Lowering the cost of anonymization

a PhD thesis

2.1.4  -presence 

The core intuition behind -map was that we assumed the attacker did not know who was in the dataset. Let us go back to this idea, with a slightly different scenario. Instead of a survey about sensitive information, like human sexual behavior, consider a clinical trial for a drug treating a particular condition, like HIV. The goal is still the same: safely share the data with other people.

These two settings look similar at first glance, but there is a crucial difference. Which information is sensitive, exactly? For the survey, the answers of each participant were sensitive, as they revealed intimate details. For the clinical study however, being in the dataset is the sensitive information: everyone in the clinical trial has been diagnosed with HIV. If an attacker finds out that their target has taken part in the study, they learn that their target suffers from this disease.

So, what does it change in practice? Let us assume that the dataset contains the records in Table 2.13. Following the same reasoning as for -map, we can research the demographics of the population of ZIP code 85535. Table 2.14 contains the number of people living in this ZIP code, depending on age range.

ZIP code age
85535 10
85535 12
85535 13
85535 13
85535 16
85535 43
Table 2.13: A sample dataset of participants to a clinical trial.

age population
10–19 5
20–29 5
30–39 10
40–49 10
50–59 20
60+ 15
Table 2.14: Hypothetical population of ZIP code 85535.

We could transform this part of this dataset to have it satisfy -map. A possible strategy is listed in Table 2.15.

ZIP code age
85535 10–39
85535 10–39
85535 10–39
85535 10–39
85535 10–39
85535 40–49
Table 2.15: A generalized version of the dataset from Table 2.13 which satisfies -map.

Is this strategy sufficient to successfully hide who participated in the survey? An attacker could know that there are 5 people aged between 10 and 19 in ZIP code 85535. Then, by looking at our de-identified dataset, the attacker can figure out that all of them are part of the dataset. Thus, they all have been diagnosed with HIV. Similarly to the discussion in Section 2.1.3, the attacker learned something sensitive about individuals, without reidentifying any record.

How to define a notion of anonymization that protects against this attack? Recall what we measured for our previous privacy definitions? For each combination of quasi-identifiers, we counted the number of records associated with this combination in the dataset (for -anonymity), or in a larger population (for -map). In the previous example, a privacy issue arose because for some combination of quasi-identifier, these numbers were equal. To detect this, we instead compute and impose a bound on the ratio between those two numbers. This is the core intuition of -presence.

Definition 4 (-presence [297]). Assume . A database satisfies -presence for a reidentification dataset if for every possible combination of quasi-identifier values , if this combination is present in records in and records in , then .

The lower is, the stronger the definition becomes. The example above had : the ratio for the records with ZIP code 85535 and age range 10–19 is . If , then similarly to the example in Section 2.1.3.0, the attacker might still get significant probabilistic information. The original definition was introduced with an additional lower bound on this ratio, in addition to the upper bound. This captures the intuition that the attacker should also not be able to find out that their target is not in the dataset. This latter concern is less frequent in practice, so we omitted it from the definition above for simplicity.

To make our dataset satisfy -presence for a lower , we could generalize the data further, as demonstrated in Table 2.16. This table satisfies -presence: the ratio for the records with age range 10–39 becomes , while ratio for the record with age 40–49 is still .

ZIP code age
85535 10–39
85535 10–39
85535 10–39
85535 10–39
85535 10–39
85535 40–49
Table 2.16: A generalized version of the dataset from Table 2.13 which satisfies -presence.

-presence in practice

-presence suffers from the same policy and applicability difficulties as -map, listed in Sections 2.1.2.0 and 2.1.2.0: without access to the reidentification dataset , it is impossible to compute -presence exactly. Approximations are also possible, like the one proposed in [298], but they are difficult to use in practice: they require a statistical model of the reidentification dataset, and the proposed approximation algorithm is computationally costly.

All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).