In 1997, the Group Insurance Commission (GIC), a Massachusetts health insurance organization, compiled and published a database of hospital visits by state employees . The dataset was then shared widely with researchers, to encourage scientific research. Of course, there were privacy considerations with publishing people’s health records: “in accordance with the best de-identification practice of the time”, all directly identifying information was removed: name, phone number, full address, social security number, etc.
As you can probably expect, this story does not end well. Some demographic information was left in the database, to allow researchers to run statistical analyses: ZIP code, date of birth, and gender were all part of the data. William Weld, then governor of Massachusetts, “assured the public that GIC had protected patient privacy”. Latanya Sweeney, then a PhD student at MIT, did a back-of-the-envelope calculation, and started suspecting that this apparently innocuous information could be, in fact, very identifying. To prove her point, she decided to try and reidentify Weld’s records from the “anonymized” database .
With just $20, Sweeney bought the public voter records from Massachusetts, which had both full identifiers (names, addresses) and demographic data (ZIP code and date of birth), and contained the governor’s information. Only one record in the hospital dataset matched the governor’s demographic data, and thus, Sweeney was able to know which prescriptions and visits in the data were his. She sent him that information by mail, showing theatrically that their anonymization process was not as solid as it should have been.
Several factors made this attack possible.
- The hospital data contained demographic information that could be used to distinguish between different records.
- A secondary database was available to figure out the demographic information about the target.
- The target was in both datasets.
- And the demographic information of the target (ZIP code, date of birth, and gender) was unique within both datasets: only one record had the demographic values of the governor.
At first glance, all factors appear to be necessary for the attack to work. So, which ones can we afford to remove, while making sure that the data can be used for data analysis tasks? Let us go through all options.
- We could remove all demographic information from the data, or even all information that might be linked to a person using auxiliary sources. Unfortunately, this would also severely hinder the utility of the data: correlations based on age, gender, and geographic info are very useful to researchers!
- Society probably should do something about the existence of public (or commercially available) data sources that can be used in reidentification attacks. For example, regulations might outlaw the publication or sale of such datasets, making them harder to come by. However, this is a complex issue, on which individual data owners have little power.
- Again, there is not much we can do about this aspect. We have no way to modify the secondary (public) dataset. We could decrease the probability that a random target is in our dataset by sub-sampling it, but all people in the sample would still be at risk.
- The uniqueness of the governor’s demographic information is a central aspect of the reidentification attack. Maybe, rather than suppressing all demographic values—which would render the data useless—we could find a less drastic of making those less unique.
This last suggestion is the fundamental intuition behind -anonymity: everyone must be hidden in a crowd of at most people. More precisely, each each combination of demographic values must map to at least different people. Of course, demographic information is not the only type of data that can be used in such an attack. Any personal information that the attacker knows about an individual can be used. Thus, the first step is to define a list of quasi-identifiers: database columns that we assume the attacker knows. To that end, we model the space of possible records as : each is a quasi-identifier (e.g. age or ZIP code), while columns are assumed to be unknown by the attacker.
Definition 1 (-anonymity[337, 354]). Assume . A database is -anonymous if for every possible combination of quasi-identifier values , this combination is present in at least records in the dataset, or in none.
Note that we need every combination of values to appear at least Thus, even if each individual value of each column appears times in the dataset of Table 2.3, it is not -anonymous either.
The intuition is that when a dataset is -anonymous for a sufficiently large , the last requirement for a successful reidentification attack is broken. An attacker might find out the demographic information of their target using a secondary database, but then this demographic information will be linked to different individuals, so it will be impossible to know which one is their info. The privacy guarantee gets stronger with increasing values: the larger , the bigger the crowd in which each individual is “hidden”.
Using this definition in practice requires answering a number of questions. First, what types of data are reidentifying? How should one decide which columns in a dataset are assumed to be known by the attacker, and should be classified as quasi-identifiers? As we mentioned, ZIP codes, age, or gender are all good candidates for reidentification attacks: they are public or easily findable information that is also often found in sensitive datasets, especially medical ones.
In general however, there is no universal criterion for what constitutes of quasi-identifier. This choice depends on the attack model. Timestamps, medical conditions, physical characteristics, behavioral information can all be considered potentially reidentifying in some cases. In other cases, it can seem reasonable to assume that someone trying to attack a dataset would not have easy access to these values.
The second question is about the choice of . Which value is appropriate? Again, it is difficult to say. In the healthcare world, when medical data is shared with a small number of people (typically for research purposes), is often chosen between and . To the best of our knowledge, there is no official law or regulation which suggests a specific value. Some universities, companies or other organizations have official guidelines, but the vast majority do not.
This choice is difficult to make. To pick a principled value for , we would need to understand what is the link between the parameter value, and the risk of a privacy incident happening. But if -anonymity is relatively easy to understand, estimating this risk quantitatively is not straightforward. It is difficult to measure how much more “privacy protection” is offered by larger values of . Furthermore, the level of risk depends on many additional parameters: how valuable the data is, how bad would a privacy incident be, how many people have access to the data, etc.
To use -anonymity in practice, one has to decide how to modify the initial dataset to create a -anonymous version of it. Two basic techniques are typically used as building blocks for more complex algorithms: generalization and suppression [246, 247, 338, 353].
Generalization is the process of making a quasi-identifier value less precise, so that records with different values are transformed (or generalized) into records that share the same values. Consider the records in Table 2.4.
The numerical values of these records can be transformed into numerical ranges, so that the resulting table verifies -anonymity. This is demonstrated in Table 2.5.
Generalization makes data more imprecise to satisfy privacy requirements, but still allow useful data analysis to be done. In our example, changing precise ages into age ranges might be enough to analyze whether a disease affects younger or older people disproportionately.
Transforming a numerical value into a range is one of the most typical ways of performing generalization. Other ways include removing a value entirely (e.g. transforming a gender value into “gender unknown”), or using a generalization hierarchy (e.g. transforming an ICD-10 diagnosis code into a truncated code, or into the corresponding block ).
In the previous example, our records had relatively “close” demographic values, which allowed generalization to keep reasonably accurate information while still ensuring -anonymity. What if an additional record was added to Table 2.4, with ZIP code 9755 and age 13? The other four records can be grouped in two pairs as above, but that new record is an outlier. Grouping it with one of the pairs above would mean having very large ranges of values (age between 10 and 39, or ZIP code being completely removed), which would significantly reduce the utility of the resulting data.
Suppression can be used in this case: we could simply remove such outlier values, and only use generalization on records that can be. Using both generalization and suppression on this example could lead to the same -anonymous table as before, on Table 2.5. When suppression is used, there are usually strictly less records in the transformed table than in the original. On large datasets, allowing a small percentage of suppressed records typically allows the result to be -anonymous without requiring too much generalization.
How to use generalization and suppression to reach -anonymity, while maximizing the utility of the output data? Finding the optimal strategy is NP-hard for many reasonable definitions of optimality , but a large number of approximation algorithms have been proposed in the literature , and implemented as open-source software [20, 339, 363, 371].
-anonymity is simple to understand, and it seems natural to assume that reidentification attacks are well mitigated when a dataset is made -anonymous. However, it only mitigates this particular kind of attack, and is difficult to use in some contexts. We present three such limitations, each of which is a major motivation for alternative definitions we introduce next: -map, -diversity, and -presence.
First, -anonymity is difficult to use for smaller datasets, such as ones with 100 records or less. For many of these datasets, especially in the context of a healthcare study, the attacker might not know that their target is in the dataset in the first place. As such, the “search space” for reidentification is larger than the dataset: if a single user has certain characteristics in the dataset, but many more people share these characteristics outside of the dataset, the attack might still fail.
Second, we implicitly assumed that the attacker only wanted to select a target, and link a specific record with this target. In Sweeney’s original attack, the privacy risk came from the additional sensitive data that was obtained after reidentification: medical records associated with the target. If a quasi-identifier combination is always associated with the same sensitive value in a dataset, then the attacker can learn this value even without reidentifying the record.
Third, for some datasets, being in the dataset constitutes sensitive information in the first place: for example, everyone participating in a clinical trial for a specific drug might share the same sensitive diagnostic.
-anonymity falls short of modeling the three scenarios above. We elaborate on these issues in the following three sections: each section introduces a definition that attempts to solve one of these issues. -anonymity has more problems than those listed above, we come back to these when introducing differential privacy.