Lowering the cost of anonymization

a PhD thesis

2.1.3  -diversity 

Consider voter data, commonly available in most US states. This data is public by law, and contains full names. One could point at an arbitrary record and claim that we “reidentified” this individual. This “attack” always succeeds, but is not particularly interesting. In Sweeney’s reidentification attack detailed in Section 2.1.1, the privacy issue was not only about finding the individual associated with a given record, but rather, about retrieving sensitive information linked with the record: diagnostics and drug prescriptions. The leak of this sensitive information associated to a specific individual was the real issue, more than the reidentification itself.

Are there cases where it is possible to find out sensitive information about someone, without reidentifying their exact record? Let us look at an example where we explicitly add sensitive information in the dataset. Each row of Table 2.8 contains a direct identifier (name), two quasi-identifiers (ZIP code and age), and one sensitive property (diagnostic).

name ZIP code age diagnostic
Alice 4217 34 Common cold
Bob 4212 39 n/a
Camille 4732 39 HIV
Dede 4743 23 HIV
Table 2.8: A sample dataset with a sensitive column.

The sensitive property is not a quasi-identifier: it is the main information that we are trying to keep secret from the attacker, so it is reasonable to assume that the attacker does not know about it a priori. Let us use generalization to make this dataset -anonymous. The result can be found in Table 2.9: all combinations of ZIP code and age appear twice.

ZIP code age diagnostic
4210–4219 30–39 Common cold
4210–4219 30–39 n/a
4700–4799 20–39 HIV
4700–4799 20–39 HIV
Table 2.9: Dataset from Table 2.8, after generalization to satisfy -anonymity.

Now, assume the attacker wants to find Camille’s diagnostic. They know that Camille has ZIP code 4732 and age 23. They can deduce that Camille’s record is the third or the fourth, but cannot know which. However, it does not matter: both records have the same diagnostic. So the attacker can find out that Camille’s diagnostic is HIV, even without reidentifying Camille’s record! -anonymity was not enough to protect the sensitive information.

To fix this, an alternative definition was proposed in [265]: -diversity. It does not only mandate that each individual is indistinguishable from sufficiently many others, but also that within each group, there are multiple options for the sensitive value. In this context, the space of possible records is modeled by ; where is an arbitrary set of possible sensitive values.

Definition 3 (-diversity[265]). Assume . A database is -diverse if for every possible combination of quasi-identifier values present in the dataset, there at least distinct values associated with in .

Of course, -diversity implies -anonymity. In Table 2.10, we use generalization to make the dataset in Table 2.8 -diverse. Consider our previous attacker, targeting Camille (third row). Like before, the attacker is unable to know which record in the -diverse dataset corresponds to Camille. Besides, they also cannot know whether Camille was healthy, or has been diagnosed with HIV. The sensitive value stays private.

ZIP code age diagnostic
4000–4999 20–34 Common cold
4000–4999 35–39 n/a
4000–4999 35–39 HIV
4000–4999 20–34 HIV
Table 2.10: Dataset from Table 2.8, after generalization to satisfy -diversity.

-diversity in practice

Like -anonymity, -diversity is easy to compute, once the quasi-identifiers and the sensitive value have been chosen. The basic building blocks to transform a dataset into a -diverse one are the same: generalization and suppression. Finding the best strategy is also done using trial-and-error heuristics. The approach used for -anonymity is straightforward to adapt to -diversity.

Since -diversity is strictly stronger than -anonymity (for ), it might seem like a natural replacement, always beneficial for privacy. However, the utility cost of -diversity is usually much more significant than when using -anonymity: some studies found that applying -diversity can be worse than using -anonymity, for a typical classification task [54]. This is one of the reasons why it is hardly ever used in practice.

Limits of -diversity

Unsurprisingly, -diversity shares similar difficulties than -anonymity: choosing is difficult, as is determining which columns are quasi-identifiers or which ones are sensitive. Even if we set aside these issues, another question looms: does -diversity really protect the sensitive information? Consider the dataset in Table 2.11. Like the dataset from Table 2.10, it is -diverse. However, an attacker targeting the third row can still discover that their target has either Hepatitis B, or HIV. Some uncertainty is preserved, but the attacker still gains very sensitive information.

ZIP code age diagnostic
4000–4999 20–34 Common cold
4000–4999 35–39 Hepatitis B
4000–4999 35–39 HIV
4000–4999 20–34 HIV
Table 2.11: A different -diverse dataset that reveals sensitive information.

Another way -diversity can fail is if by providing probabilistic information gain to the attacker. Consider the dataset in Table 2.12. An attacker trying to gain information on one of the rows could not know for sure what their target’s diagnostic is, but they can significantly increase their suspicion that they have HIV. This is especially the case if we consider the background knowledge that the attacker might have on the sensitive value: for example, if the diagnostic of the first record was a condition that the attacker knows their target does not have.

ZIP code age diagnostic
4000–4999 20–34 Common cold
4000–4999 20–34 HIV
4000–4999 20–34 HIV
4000–4999 20–34 HIV
4000–4999 20–34 HIV
4000–4999 20–34 HIV
4000–4999 20–34 HIV
4000–4999 20–34 HIV
4000–4999 20–34 HIV
Table 2.12: Another -diverse dataset that leaks probabilistic information.

How to protect against this type of probabilistic information gain? Requiring that sensitive attributes are diverse is not enough, we would need to also require that the distribution of sensitive values is roughly the same that the rest of the data. If 40% of the records are “healthy” in the overall data, then each bucket should also have roughly 40% of “healthy” records. This way, the attacker’s knowledge cannot change too much from the baseline. This is the core idea behind another definition: -closeness [251]. This definition is stricter than -diversity, and unsurprisingly, comes with even greater utility loss. We do not formally introduce it here.

All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).