Notes from Dagstuhl: biomedical data sharing
Last February, I had the privilege to participate to a Dagstuhl seminar about privacy for biomedical data sharing.
I know quite a bit about privacy, but very little about the biomedical field, so it was a fantastic opportunity to learn more about this space. This blog post is a semi-random list of things I learned.
Differential privacy is hardly ever used
A quick look at the DP deployment registry is enough to make this point pretty obvious: formal approaches to privacy haven't seen major deployments in the biomedical space. This may seem surprising: access to sensitive data is a major pain point for healthcare research, and robust anonymization technology sounds like a natural solution. Why has there been so little adoption compared to other fields?
Some reasons I heard will sound very familiar to any DP practitioner.
- Nobody understands it. People described meetings where they try to explain the intuition behind DP, and see their audience go blank. In comparison, it's apparently much easier for non-technical folks to understand (the very basics of) multi-party computation or federated learning.
- The idea of adding noise to data is often a non-starter: data truthfulness is too central to the scientific culture among healthcare researchers. People prefer ad hoc techniques like generalization or suppression, even if they severely impact utility, because at least they can rely on the data "not lying to them".
- Attacks are not compelling. There's a strong perception that DP protects against unrealistic attackers, which aren't very relevant to biomedical data sharing settings.
This pushback isn't focused solely on DP — any protection mechanism that bring significant utility loss is met with harsh criticism. One participant recounted a job interview where they were talking about typical anonymization measures, and someone asked them: "Do you hate science?!"
The majority of seminar participants thought that the last point — attacks not being super compelling — was largely valid. There aren't many high-profile failures of anonymization in the medical domain. The existing ones are all "obviously terrible" (as in: the data was extremely poorly protected). This creates the sense that if you do something reasonable, even if it's not super principled, this is probably good enough. The argument goes: It this wasn't the case, then we would hear about anonymization failures a lot more often. As someone succinctly put it: "Where are the bodies?"
Controlled access repositories are all the rage
Instead of trying to reach very strong privacy guarantees by anonymizing the data, sensitive data access solutions primarily rely on risk-based approaches. The main solution is to deploy controlled access repositories, which are becoming widespread in the biomedical domain. These are systems where the data curator gives researchers access to the data, under specific conditions, with many different kinds of risk mitigation measures. Here are some examples.
- Registration and pre-approval of research projects.
- Contractual measures, like data use agreements.
- Prohibitions against data download — all analyses must happen on a cloud platform maintained by the organization sharing the data.
- Logging and monitoring of all actions taken by researchers on the data.
- Disclosure avoidance practices, like generalization and suppression, to make sure the data isn't too easy to re-identify.
- Clear consequences for data misuse.
Such systems are not a panacea. For example, sensitive data from the UK Biobank was inadvertently published dozens of times, and even ended up for sale on Alibaba. But what I heard suggested that the solution to this kind of issue won't be "add more anonymization", but rather "establish better data governance practices".
Controlled access repositories also bring complicated questions, in particular around financial sustainability. Maintaining them and making them compliant with all existing regulations is a massively complex and expensive endeavor. So in many cases, large cloud providers (GCP, Azure, etc.) are building the infrastructure, and heavily subsidizing it for initial use cases. Of course, this creates dangerous lock-in effects: if (or when) Google or Microsoft decides to hike up the price of these services, healthcare institutions will find themselves in a very difficult negotiating position.
Risk mitigation frameworks are more mature than in other industries
To enable data sharing while providing an adequate level of risk mitigation, the biomedical community uses conceptual frameworks that reflect a high level of organizational maturity. One example is the Five Safes, used to reason holistically about data sharing systems. Such tools aren't (to my knowledge) typically being used in the areas I'm more familiar with, like the tech industry. It's too bad — learning about how the biomedical field reasons about privacy risk holistically made me think of ways to potentially adopt it in my work.
One reason I like it is that it covers concerns that anonymization methods don't really address, like data misuse. In a recent example, a dataset containing medical data from 20,000 children was used by fringe "researchers" to publish race science papers. This is a very bad scenario, but this has very little with whether individual participants can be re-identified.
Healthcare institutions will also often have documented standards describing how to anonymize data, depending on exposure, sensitivity, context, and so on. Even though these guidelines aren't using provably robust methods, their very existence is a sign of organizational maturity. It brings consistency and traceability across multiple data sharing or publication use cases. How many institutions outside the medical domain have something comparable?
This maturity doesn't stop at anonymization practices. For example, I heard people describe well-defined guidelines for how to handle data breaches by employees. Privacy officers classify privacy violations in fixed categories, depending on whether the offense is inadvertent, malicious, and performed at scale, and each category comes with different disciplinary measures.
Privacy can be a convenient excuse not to publish data
I heard from multiple people that institutions have often multiple reasons to not publish data, unrelated to privacy concerns. Fear of embarrassment is a big one. For example, a hospital might be hesitant to publish anonymized data data about cancer patient, by fear that the data shows that this hospital has worse patient outcomes than its competitors. Nobody will admit this out loud, though!
This is where "privacy concerns" come in. They can be a very convenient, principled-looking rationale for not publishing data. Stakeholders often hide behind this excuse: rather than saying "we're afraid of bad PR", they say "we're concerned about patient privacy". This can lead to frustrating situations for anonymization practitioners: they come up with clever technical ways of solving the privacy problem, only to realize much later that this wasn't the real blocker to data publication.
AI is bringing mostly chaos, and some opportunities
Everyone is trying to figure out how to adapt to a new world, where LLMs are used by people every step of the way.
- Patients are coming to consultations with a preconceived idea of what to expect, because they asked ChatGPD first. They're putting their test results straight into an LLM and asking it to interpret the results. They're asking it for second opinions after seeing a doctor.
- Automated transcription services are now widespread for patient consultations. An institution I heard of went from "we ask patients to opt in", to "we tell patients and allow them to opt out", to "we don't tell patients unless they ask (and then allow them to opt out)", in just a few years.
- Doctors are also regularly using LLMs for medical queries, in a similar way than they use search engines. Institutions who forbade LLM use saw many of their employees use their personal devices and accounts to do it anyway.
- Researchers are relying on AI agents to help with coding tasks and data analysis. Data custodians are anticipating a future where AI agents perform such tasks with less and less supervision. This raises complex data protection questions — how to model the risk of such data accesses?
- I heard of an interesting project where an LLM was used to audit people's behavior in controlled access respositories and flag suspicious data usage patterns. It seemed to me like an interesting use case, because the other solutions — ixed detection rules or manual audits — have severe limitations in practice.
Everyone agrees that organizations need to think strategically about AI use. Currently, institutions are mostly launching a few pilots, and reacting to people's behavior. This is clearly not forward-thinking enough, but nobody knows how to actually formulate a good strategy, since nobody can really predict where the field is going…
