Lowering the cost of anonymization

a PhD thesis

2.1.2  -map 

Suppose a researcher conducted a survey about human sexual behavior in the US. The answers to that survey are obviously revealing sensitive information about individuals. Assume that the data collected is small-scale: only 40 subjects volunteered for the study. Afterwards, the researcher wants to share this data with other researchers. Looking at the columns in the dataset, they make the assumption that ZIP code and age are likely to be used in reidentification attacks. Could -anonymity be used to share it in a safe way?

It will likely prove difficult to do without losing a lot of valuable information. For , a rather small value, we might need buckets like 20–49 for age. Those would obviously not provide a lot of statistical value. This large loss in utility was to be expected: with so few people in your database, it is necessary to bundle together very different age values.

Do we really need -anonymity? Who are the attackers, in this scenario? The researchers with whom you share the data, and possibly unknown parties if the data ever leaks, might not have background information about who is in the dataset. Thus, the attacker must not only distinguish between different records, but they need to actually find the real identity of a record based on its information. This attacker is significantly weaker than the one implicitly assumed in the -anonymity model!

ZIP code age
85535 79
60629 42
Table 2.6: Two sample rows from a hypothetical clinical study.

Consider two different rows in this hypothetical database, listed in Table 2.6. At first glance, the “amount of information” for these two individuals seems to be the same. Taking a look at the values, however, tells a different story. ZIP code 85535 corresponds to the small town of Eden, Arizona. Approximately 20 people live in this ZIP code: it is likely that only one person in this ZIP code is exactly 79 years old. ZIP code 60629, however, corresponds to a part of the Chicago metropolitan area. More than 100,000 people live there: probably more than 1,000 of them are 42 years old.

Therefore, it seems like the first row is very easy to reidentify, but that we do not have enough information to reidentify the second row. -anonymity does not capture this consideration: both rows are unique in the dataset. A variant of -anonymity is needed for this case: -map.

Just like -anonymity, -map requires us to determine which columns of the database are quasi-identifiers, and be used by the attacker to reidentify their target. This information alone, however, is not enough to compute -map. In the example above, we assumed that the attacker does not know whether their target is in the dataset. So what are they comparing a given record with? This comparison is done across all other individuals sharing the same values in a larger, sometimes implicit, dataset. For the previous example, this could be “everybody living in the US”, assuming the attacker has no prior information on who could potentially be in this dataset. We call this larger table the reidentification dataset, and denote it by .

Once the choice of quasi-identifiers and reidentification dataset is done, the definition is straightforward: it is similar to -anonymity, except we count the number of records in the larger dataset .

Definition 2 (-map [352]). Assume . A database satisfies -map for a reidentification dataset if for every possible combination of quasi-identifier values present in , this combination is also present in at least records in .

In our example, this corresponds to counting the number of people in the US who share the quasi-identifier values of each row in the dataset. For our sample dataset in Table 2.6, the values of the first row matched only one person in the US, so this dataset does not satisfy -map for any value of . How to obtain a larger ? We can use techniques similar to the ones mentioned previously, in Section 2.1.1.0, and generalize some of these quasi-identifiers. An example is given in Table 2.7.

ZIP code age
85000–85999 79
60629 42
Table 2.7: A dataset satisfying -map for a large .

ZIP codes between 85000 and 85999 include the entire city of Phoenix, Arizona. More than 36,000 people between 75 and 84 years old live in Phoenix [315], so there are probably more than 1,000 people who match the quasi-identifiers values of the first row. We saw earlier that the second row also matched 1,000+ people. Thus, this generalized dataset satisfies 1000-map.

Policy questions raised by -map

It is easy to see that -map raises the same difficult policy questions as -anonymity, mentioned in Section 2.1.1.0: it is difficult to choose , and it might not be obvious to choose which columns are considered quasi-identifiers. In addition, -map requires choosing the reidentification dataset , which is another difficult choice. Our choice to take the entire US population in the example above is very optimistic; it implicitly assumed that the attacker had no additional information about their target whatsoever. With such a reidentification dataset, not much generalization is needed to make this dataset -map.

In many cases, it would make sense to assume that the attacker could use a smaller reidentification database. For example, if the survey was advertised in a specific online community, the attacker could make the reasonable assumption that all survey participants are also members of this community. This would allow the attacker to restrict their search, and decrease the number of possible options for each row. As the reidentification dataset gets smaller, so does the in -map. The extreme would be to consider that the dataset itself is used for reidentification: in this case, -map is the same as -anonymity.

-map in practice

The reliance of -map on modeling a reidentification dataset makes it significantly more difficult than -anonymity to use in practice. In general, the person anonymizing the data does not have access to such a reidentification dataset, nor can they know which alternative dataset might be used. Procuring all available commercial datasets is generally impossible. Thus, transforming a dataset so the output satisfies -map seems quite difficult in practice. In some situation however, the problem is tractable.

For example, suppose that instead of releasing all the data, we release a random sample of the original dataset. In this case, it is reasonable to assume that the attacker does not know which of the original records were sampled, so we can compute the -map value based on the original dataset.

Another option is to assume that the data is a representative (or unbiased) sample of a larger dataset. This is a reasonable approximation to make if people in the dataset were selected uniformly at random across a larger population, as is common in scientific studies. In this case, it is possible to compute an estimate of the -map value for the data using statistical methods, either with purely statistical methods [137], or by approximating the reidentification dataset using publicly available data [82].

Finally, for small-case studies, a motivated data owner could try and do the job of an attacker “by hand”: go through each record and try to map it to a real person, similarly to what we did in the toy example of Table 2.6. The result of such an analysis is essentially an estimate of the -map value. This last option is approximative, and obviously not scalable.

All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).