Lowering the cost of anonymization

a PhD thesis

2.1.5  Flaws of syntactic privacy definitions

The four definitions mentioned in the previous section all seem to have serious limitations. Many of these attempts at defining anonymization feel like “patches” applied to prior notions to fix the most obvious flaws. However, it looks like any of these patches only creates more difficult questions, and does not suppress privacy issues entirely. Choosing when to apply each definition is also difficult: the definitions protect different aspects, so one has to think about what kind of attacker might want to attack the data before choosing the relevant definition. This choice itself is a difficult policy decision, and choosing the wrong attack model can have devastating consequences.

Perhaps more importantly, these definitions all share some fundamental issues. First, they all break down if the attacker has unexpected background knowledge. If the choice of quasi-identifier columns is wrong, and the attacker knows some auxiliary information that we did not expect, and all guarantees are lost. This background knowledge issue can also happen if the attacker has some information about the original database that we did not expect. For example, in -diversity, if the attacker knows the information some other users in the dataset, they might reduce the number of possible options for the sensitive value. This problem is especially worrying if the attacker can potentially influence the data: -anonymity, for example, is pointless if the attacker can simply add arbitrarily many elements to the dataset to have a specific quasi-identifier combination pass the threshold. This last attack vector might not be a major concern for medical data, but is realistic for e.g. online services, where a single individual could easily create fake accounts.

Second, they implicitly assume that the data release happens in isolation: the original data will not be published in any other way, nor will related data from different datasets. Controlling that a specific dataset is only released once might be feasible in theory. However, it is near-impossible to predict what might happen in the future, and invalidate a risk assessment made at a certain point in time, by some third-party. For example, consider two hospitals who independently release medical information using -diversity, exemplified in Table 2.17. An attacker knowing that their target (with ZIP code 4217 and age 32) visited both hospitals can compare both releases, look at which diagnostic is the same in both, and deduce that their target has HIV.

ZIP code age diagnostic
4000–4999 20–34 Common cold
4000–4999 35–39 Otitis
4000–4999 35–39 HIV
4000–4999 20–34 HIV
ZIP code age diagnostic
4000–4999 35–39 HIV
4000–4999 35–39 Broken leg
4000–4999 20–34 n/a
4000–4999 20–34 Flu
Table 2.17: Two -diverse datasets.

Such attacks are not theoretical: in [158], the authors experimentally show that this scenario can plausibly lead to such disclosures. Two distinct organizations publishing related data is not the only scenario in which this problem can arise: multiple -anonymous views of the same data can also lead to -anonymity violations [400], and the same problem can arise for multiple publications of the same dynamic dataset over time [395].

Third, the fact that a dataset satisfies a given privacy definition does not guarantee that the data release did not leak information that we were trying to protect. Indeed, consider an “arbitrary privacy definition” , and pick three arbitrary datasets , and , that all satisfy this privacy definition. Consider the following algorithm, built by an attacker who wants to find out private information about their target Camille.

  • If Camille is in the dataset and has HIV, output .
  • Else, if Camille is in the dataset, output .
  • Else, output .

This algorithm always produces an output that satisfies the required privacy definition. It is possible to instantiate as -anonymity, -diversity, -presence, or any other definition applying to datasets that we could think of. But of course, the algorithm also reveals private information that all definitions we previously listed tried to protect!

This example is, of course, not realistic: nobody would actually use an algorithm like this to anonymize their data. However, it demonstrates a deep issue with this class of definitions: knowing that an algorithm always outputs a dataset that satisfies a given definition is not enough to trust this algorithm not to inadvertently leak data. This is not only a theoretical concern, either: minimality attacks use characteristics of anonymization algorithms that adapt their strategy to optimize the utility of the output data [246, 247, 406] to reverse-engineer the original contents of the database [389].

For all definition seen so far, it is possible to check simply by looking at the output of an algorithm, whether this output satisfies the definition. Anonymity is seen as a syntactic property, which is why we call these definitions syntactic privacy definitions. How to overcome the fundamental issues of this approach? We need to change the perspective. Instead of considering anonymization as a property of the output dataset, we need to consider it as a property of the process, and come up with semantic formalizations of anonymity, that describe what information this process can potentially leak. This is one of the core insights behind differential privacy.

All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).