# δ-presence, for when being in the dataset is sensitive

**Remember** \(k\)-map? We used this definition when the attacker
didn't know who was in the dataset. Let's go back to this setting, with a
slightly different scenario. You're no longer a doctor studying human sexual
behavior. You're still a doctor, but this time, you're specialized in treating a
particular chronic disease. Instead of running a survey, you're running a
clinical trial for a new drug to treat this disease. Similarly, you want to
share the data with other people.

At first glance, these two settings look similar — but there is a crucial
difference. Which information is sensitive, exactly? For the survey, the
*answers* of each participant are sensitive, as they reveal intimate details.
But for the clinical study, *being* in the dataset is the sensitive information.
If someone figures out that you've taken part in the study, they learn that you
suffer from this disease.

So, what does it change in practice? Suppose that your dataset contains the following records:

ZIP code | age |
---|---|

85535 | 10 |

85535 | 12 |

85535 | 13 |

85535 | 13 |

85535 | 16 |

85535 | 43 |

You do a little research on who lives in ZIP code 85535. You learn that in this ZIP code:

**5**people have ages between 10 and 19;**5**people have ages between 20 and 29;**10**people have ages between 30 and 39;**10**people have ages between 40 and 49;- and
**20**people are 50 or older.

Transforming this part of your dataset to have it satisfy \(5\)-map is easy:

ZIP code | age |
---|---|

85535 | 10-19 |

85535 | 10-19 |

85535 | 10-19 |

85535 | 10-19 |

85535 | 10-19 |

85535 | 40-49 |

… But what has gone wrong there?

An attacker, using only public data, knows that there are 5 people aged between
10 and 19 in ZIP code 85535. Then, by looking at your de-identified dataset, the
attacker can figure out that *all of them are part of your data*. Thus, they all
have this specific disease. The attacker learned something sensitive about
individuals, without re-identifying any record. Just like in the example of
\(l\)-diversity!

We need yet another definition. Introducing… \(\delta\)-presence!

# Definition

Remember what we counted for our previous privacy definitions? For each combination of quasi-identifier attributes:

- for \(k\)-anonymity, we counted the number of records
*in the dataset*; - and for \(k\)-map, we counted the number of records
*in the larger population*.

What went wrong in our leading example? For certain attributes, these numbers
were *equal*. To detect this, we now compute the *ratio* between those two
numbers. Then, the \(\delta\) in \(\delta\)-presence is the *largest* ratio across
the dataset.

Consider the dataset above. The ratio for the records (85535, 10-19) is \(5/5=1\), and the ratio for the records (85535, 40-49) is \(1/10=0.1\). Thus, since we defined \(\delta\) as the greatest ratio, we have \(\delta=1\). Since the \(k\) of \(k\)-map is always larger than the \(k\) of \(k\)-anonymity, this is the maximum possible value of \(\delta\). Saying that a dataset satisfies \(1\)-presence gives zero guarantees.

Whether \(\delta=1\) is not the only interesting thing. We also want this value to
be *small*. The lower, the better. Consider what it means if \(\delta=0.95\). The
attacker might learn that their target has a 95% chance of being in the dataset.
It's not quite a 100% certainty, but it still can be problematic. For example,
it might be more than enough for an insurance company to deny you coverage…

How do we get to a lower \(\delta\) in our previous example? One solution would be to generalize the age further:

ZIP code | age |
---|---|

85535 | 10-39 |

85535 | 10-39 |

85535 | 10-39 |

85535 | 10-39 |

85535 | 10-39 |

85535 | 40-49 |

Then, the ratio for the records (85535, 10-39) becomes \(5/(5+5+10)=0.25\). The ratio for record (85535, 40-49) is still \(0.1\), so \(\delta=0.25\). (Assuming that no other record in the dataset has ZIP code 85535, and all other records have a smaller ratio).

\(\delta\)-presence was first proposed by Nergiz et al. in a 2007 paper^{
(pdf)}. In this paper, the definition is a bit different. The
authors compute not only the largest ratio, but also the *smallest* one. The
\(\delta\) parameter hides two parameters
\(\left(\delta_{\text{min}},\delta_{\text{max}}\right)\). This was done to protect
against the symmetric attack: hiding that someone is *not* in the dataset. I
never encountered a situation where this is a real concern, so I simplified it a
bit for this post.

# \(\delta\)-presence in practice

\(\delta\)-presence is computed from the ratios between quantities used in \(k\)-anonymity and \(k\)-map. While \(k\)-anonymity is very easy to compute, \(k\)-map is much harder. As such, \(\delta\)-presence has very similar practical characteristics than \(k\)-map. Since you don't typically have access to the full larger dataset, you can't compute \(\delta\) exactly. You can use a pessimistic approximation if your data is a sample of a larger dataset that you own. You can also do the work of estimating \(\delta\)-presence by hand.

What about statistical approximations? Nergiz et al. proposed an interesting
method in a followup paper^{ (pdf)}. Unfortunately, two of
its requirements make it hardly usable in practical scenarios.

- First, to run the algorithm, you need to "describe your beliefs about the world" (in a statistical sense). Unless you're a statistician, this is not something you can really do.
- Second, computing the algorithm exactly is very expensive. The authors propose a lot of approximations to make it tractable… But then, using them makes the results even more uncertain.

Finally, if you still want to use this algorithm, you would also likely have to implement it yourself. I don't know of any available software that does it for you.

# Conclusion

Like \(k\)-map, in theory, it often makes sense to use \(\delta\)-presence. It's a pity that both definitions are so difficult to use in practice! Having simpler (and more usable) approximation algorithms would be great… Which is why I have done some research work in that direction. And the results of this work will be the topic of a future post! =)