Ted is writing things

On privacy, research, and privacy research.

k-anonymity, the parent of all privacy definitions

— updated

In 1997, a PhD student named Latanya Sweeney heard about an interesting data release. A health insurance organization from Massachusetts had compiled a database of hospital visits by state employees, and had thought that giving it to researchers could encourage innovation and scientific discovery. Of course, there were privacy considerations: allowing researchers to look at other citizens health records seemed pretty creepy. So they decided to do the obvious thing, and remove all columns that indicated who a patient was: name, phone number, full address, social security number, etc.

As you can probably guess, this didn't end so well. In this article, I'll describe and analyze Sweeney's successful reidentification attack, and I'll explain the privacy definition that Sweeney invented to prevent this type of attack in the future: \(k\)-anonymity.

What went wrong?

Some demographic information was left in the database, so researchers could still compile useful stats: ZIP code, date of birth, and gender were all part of the data. Sweeney realized that the claims of the Massachusetts governor, who insisted that the privacy of state employees was respected (all identifiers were removed!), were perhaps a little bit over-optimistic. Since the governor himself was a state employee, Sweeney decided to do the obvious thing and reidentify which records of the "anonymized" database were the governor's.

With just $20, Sweeney bought the public voter records from Massachusetts, which had both full identifiers (names, addresses) and demographic data (ZIP code and date of birth), and contained the governor's information. Guess how many records matched the governor's gender, ZIP code, and date of birth inside the hospital database? Only one, and thus, Sweeney was able to know which prescriptions and visits in the data were the governor's. She posted all of it to his office, showing theatrically that their anonymization process wasn't as solid as it should have been.

Several factors made this attack possible. Some are obvious, but not all:

  1. The hospital data contained demographic information that could be used to distinguish between different records.

  2. A secondary database was available to figure out the demographic information about the target.

  3. The target was in both datasets.

  4. And the demographic information of the target (ZIP code, date of birth, and gender) was unique within both datasets: only one record had the demographic values of the governor.

At first glance, these factors appear to be necessary: remove one of them and suddenly, the attack no longer works. (Try it! It's a good mental exercise.)

How to prevent this attack?

As per our previous remark, removing one of the factors should be enough to prevent attacks like these. Which ones can we afford to remove, while making sure that the data can be used for data analysis tasks?

  1. We could remove all demographic information from the data, or even all information that might be linked to a person using auxiliary sources. Unfortunately, this would also severely hinder the utility of the data: correlations based on age, gender, and geographic info are very useful to researchers!

  2. Society probably should do something about the existence of public (or commercially available) data sources that can be used in reidentification attacks. However, this is a complex political issue, so a little bit out of scope for a data owner who wants to publish or share an anonymized version of their data — in practice, there's pretty much nothing we can do about it.

  3. Again, there's not much we can do. We have no way to modify the secondary (public) dataset. We could decrease the probability that a random target is in our dataset by sub-sampling it, but all people in the sample would still be at risk, so this is obviously not a satisfying solution.

  4. Now, this is the interesting point. Maybe suppressing all demographic values would render the data useless, but there might be a middle ground to make sure that the demographic values are no longer unique in the dataset.

This last suggestion is the basic idea of \(k\)-anonymity.

Definition of \(k\)-anonymity

A dataset is said to be \(k\)-anonymous if every combination of values for demographic columns in the dataset appears at least for k different records.

For example, this dataset is \(2\)-anonymous:

ZIP code age
4217 34
4217 34
1742 77
1742 77
4217 34

This one isn't:

ZIP code age
4217 34
1742 77
1743 77
4217 34

Notice that we need every combination of values to appear at least \(k\) times. Thus, even if each individual value of each column appears \(2\) times in the following dataset, it's not \(2\)-anonymous:

ZIP code age
4217 34
1742 34
4217 77
1742 77

The intuition is that when a dataset is \(k\)-anonymous for a sufficiently large \(k\), the last requirement for a successful reidentification attack is broken. An attacker might find out the demographic information of their target using a secondary database, but then this demographic information will be linked to \(k\) different individuals, so it will be impossible to know which one is their info.

What types of data are reidentifying?

Note that we've only talked about "demographic information", which is pretty vague. ZIP codes, age, gender are all good candidates for reidentification attacks, because they're public (or easily findable) information that is also often found in sensitive datasets (especially medical ones). In general, the data owner should consider which columns might be used by the attacker they're concerned about.

These columns, not necessarily sensitive themselves but which might be used in a reidentification attack, are called quasi-identifiers (or QIs). There is no universal list of quasi-identifiers, it depends on the attack model. If some data types are almost always QIs (ZIP code, age, gender…), many more depend on the context (like timestamps, medical conditions, physical characteristics…). The question to ask is: would the person who's trying to attack our dataset have access to these values through public or commercially available data?

I'll try to write more about attack modeling and data classification later. This is not as easily explainable as the various mathematical definitions of privacy: it has lots of human components and as such, is always a bit fuzzy. Which makes it even more interesting! :D But I digress.

How to choose \(k\)?

Short answer: ¯\_(ツ)_/¯

Longer answer: nobody knows. In the healthcare world, when medical data is shared with a small number of people (typically for research purposes), \(k\) is often chosen between \(5\) and \(15\). This choice is very arbitrary and ad hoc. To the best of my knowledge, there is no official law or regulation which suggests a specific value. Some universities, companies or other organizations have official guidelines, but the vast majority don't.

To pick a parameter for a privacy definition, one needs to understand what's the link between the parameter value, and the risk of a privacy incident happening. But this is difficult: if \(k\)-anonymity is relatively easy to understand, estimating risk quantitatively is extremely tricky. I'm also going to write a bit about this later on!

  • Regulators don't want to include specific parameter values in laws or guidelines, since there is no convincing argument to be made for a given choice, and the level of risk depends on many more fuzzy parameters (how valuable the data is, how bad would a privacy incident be, etc.).
  • Data owners don't know how to choose the parameter either, so they usually buy the services of a privacy consultant to do this choice (and take care of the anonymization process). This consultant doesn't know either what's the "good" choice, but they usually have more practical experience of what are common values in the industry for similar levels of risk.

This is my first "real" blog post, about the most basic anonymity definition there is, and I've already reached my second digression to say "notice how it's actually super fuzzy and thus, complicated to apply in practice?". Isn't privacy fun? :D

How to make a dataset \(k\)-anonymous?

So, suppose we picked our quasi-identifiers and \(k=2\). Even with such a low value for \(k\), our original dataset will likely not be \(k\)-anonymous: there will be many records with unique combinations of quasi-identifier values.

The two main building blocks used to transform a dataset into a \(k\)-anonymous table are generalization and suppression.

Building block 1: generalization

Generalization is the process of making a quasi-identifier value less precise, so that records with different values are transformed (or generalized) into records that share the same values. Consider the records in this table:

ZIP code age
4217 34
4217 39
1742 75
1691 77

The numerical values of these records can be transformed into numerical ranges, so that the resulting table verifies \(2\)-anonymity:

ZIP code age
4217 30-39
4217 30-39
1000-1999 75-79
1000-1999 75-79

The idea of generalization is to make demographic information more imprecise to satisfy our privacy requirements, but still allow useful data analysis to be done. In our example, changing precise ages into age ranges is probably enough to analyze whether a disease affects young or old people disproportionately.

Transforming a numerical value into a range is one of the most typical ways of performing generalization. Other ways include removing a value entirely (e.g. transforming a gender value into "gender unknown"), or using a generalization hierarchy (e.g. transforming an ICD-10 diagnosis code into a truncated code, or the corresponding block).

Two types of generalization

Generalization strategies can be classified into two categories: global and local. Consider the following table:

ZIP code age
4217 34
4217 34
1742 34
1742 31

Global generalization means that a given value for a given column will always be generalized in the same way: if you decide to transform age 34 into age range 30-34 for one record, all records that have ages between 30 and 34 will be transformed into this fixed range of 30-34. Using global generalization, the example could be transformed into:

ZIP code age
4217 30-34
4217 30-34
1742 30-34
1742 30-34

Local generalization doesn't have that constraint: it allows you to pick a different generalization for each record. A value 34 in the age column might stay untouched for one record, and generalized for other:

ZIP code age
4217 34
4217 34
1742 30-34
1742 30-34

Global generalization usually makes it easier to do data analysis on generalized data; while local generalization allows to keep more utility at the cost of a slightly more complex data representation.

Building block 2: suppression

In our previous example, our records had relatively "close" demographic values, which allowed generalization to keep reasonably accurate information while still ensuring \(2\)-anonymity. What if the table is instead:

ZIP code age
4217 34
4217 39
1742 75
1691 77
9755 13

The first four records can be grouped in two pairs as above, but the last record is an outlier. Grouping it with one of the pairs above would mean having very large ranges of values (age between 10 and 39, or ZIP code between 4000 and 9999), which would significantly reduce the utility of the resulting data. So a simple solution to deal with such outlier values is simply to remove them from the data. Using both generalization and suppression on this example could lead to the same \(2\)-anonymous table as before:

ZIP code age
4217 30-39
4217 30-39
1000-1999 75-79
1000-1999 75-79

Using this method, there are usually strictly less records in the transformed table than in the original. On large datasets, allowing a small percentage of suppressed records typically allows the result to be \(k\)-anonymous without requiring too much generalization.

Algorithms

\(k\)-anonymity is the oldest privacy definition, it's relatively simple to understand, so it has been quickly adopted by the healthcare community for their data anonymization needs. As a result, there has been a lot of research on how to transform a dataset into a \(k\)-anonymous table.

The problem of finding an optimal strategy for \(k\)-anonymity is NP-hard, for basically any reasonable definition of optimality. This paper (pdf) presents a few such results, if you're interested in this kind of thing ^^

A list of approximation algorithms for the optimal \(k\)-anonymization problem can be found in this paper (pdf) (Table 4, page 11). 18 different algorithms are listed, and I don't even think the list is exhaustive! The paper contains many links to the original papers, and to some comparisons between methods. Sadly, there is no unified benchmark to know how all these algorithms perform on various data analysis tasks.

In practice

Unless you're a PhD student working on your literature review, you're probably not looking for a bunch of links to research papers about complicated \(k\)-anonymization algorithms. If you're a data owner trying to transform a dataset to get a \(k\)-anonymous table, you may be looking for software instead.

As of 2017, the main open-source tool for data anonymization is ARX. Its interface is a bit difficult to understand at first, but it works fairly well on small to moderately large datasets, and implements a lot more than just \(k\)-anonymity algorithms. Its main limitation is that only global generalization methods are implemented, so newer algorithms which use local generalization to minimize accuracy loss are not available.

There are other tools available online, but none of them is anywhere as usable as ARX. Many of them are listed in the Related software page of ARX's website. I've tried most of them, only to get convinced that none of them really reached the point of being a usable product. UTD Anonymization Toolbox is probably the only one worth a look: it requires to use the command-line and impractical configuration files to work, but it implements a local generalization algorithm (the first of its kind, named Mondrian (pdf), a very cool technique with much better utility preservation than global generalization algorithms).

On the commercial side, I've only heard of a toolkit developed by the consulting company Privacy Analytics. The intended audience seems to be people who know little about privacy: it looks very shiny, but I didn't manage to understand which anonymity property or algorithms they were using ^^ You can get a free trial by filling up a form on their website, but I can only assume the real version is very expensive, since there is no mention of price anywhere.

How convincing really is \(k\)-anonymity?

\(k\)-anonymity is simple to understand, and it seems intuitively obvious that reidentification attacks are well mitigated when a dataset is transformed to become \(k\)-anonymous. However, it only mitigates this particular kind of attack. We assumed that all that the attacker wanted was to select a target, point at a record, and say "this record corresponds to my target" with a high certainty. This matches Sweeney's original attack, but how realistic is this?

When an attacker successfully reidentifies someone in a dataset, it's not necessarily a privacy issue. Consider the voter files from earlier. By law, this data is public, and contains full names. It's very easy for an attacker to point at a random record and shouting "hey, I reidentified this person!": the identification is right there in the dataset. This "attack" always succeeds, but it's not really interesting, nor particularly creepy… Why is that?

In Sweeney's example, the creepy thing isn't just finding the data subject associated with a given record. The sensitive information linked with the record (in our leading example, diagnostics and drug prescriptions) is where the creepiness comes from! The leak of sensitive information associated to one given individual is the problem, not the reidentification itself.

\(k\)-anonymity doesn't really capture this idea. The definition just prevents you from knowing the real identity of an anonymized record. But maybe there are other attacks that allow you to find out sensitive information about someone, without finding with absolute certainty which record is theirs?

As I'll explain in future articles, other types of attacks do exist, and many other definitions have been proposed in order to mitigate them too. Nonetheless, \(k\)-anonymity is still used in the healthcare world, in large part because of its simplicity and utility preservation compared to other definitions.

I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).