Where does privacy risk come from, when releasing anonymized data? What exactly can go wrong? Sweeney provided the first obvious answer: privacy risk appears when you can reidentify a record. That makes sense, and led to the definition of \(k\)-anonymity. If you think your data is anonymous, but somebody pinpoints a record and figures out who it is, clearly, there's a problem.
But as researchers discovered shortly after, it's sometimes not enough. An attacker might figure out private information about someone, without reidentifying their record. Even if the dataset is \(k\)-anonymous. How does this magic work? First, we'll show how it works with an example, then we'll describe the natural solution: \(l\)-diversity.
Suppose you have the following database, which contains everyone in the country.
(It's a rather small country.)
Now, you want to release an anonymized version of this database, for research purposes. Following the \(k\)-anonymity method, you start by wondering which columns are identifying. Let's see.
- name is obviously identifying: we have to remove it completely.
- ZIP code and age are quasi-identifiers. They can help you identify someone, but reducing their precision might prevent this.
- diagnostic is sensitive, but since it's typically secret, we can consider it non-identifying1.
So, let's make this data \(k\)-anonymous. Here, \(k=2\), because it's a small country.
Since all combinations of ZIP code & age appear twice, this data is \(2\)-anonymous. But now, suppose an attacker wants to find Camille's diagnostic. The attacker knows that Camille has ZIP code 4732 and age 23. They can easily figure out that Camille's record is the third or fourth one, but cannot know which.
And there's the obvious problem: both records have the same diagnostic. So the attacker can deduce that Camille's diagnostic is "Otitis". Even without knowing which record is Camille's! \(k\)-anonymity wasn't enough to protect Camille's private information.
\(l\)-diversity: the obvious fix
So. Let's say that all users with the same quasi-identifier tuple are in the same bucket. If all sensitive values are the same within a bucket, we might leak private information. The obvious solution? Imposing some diversity in the sensitive values associated to the same (generalized) tuple.
This is \(l\)-diversity, as introduced in 2006 by Machanavajjhala et al. It builds on the definition of \(k\)-anonymity. \(l\)-diversity states that each bucket must have at least \(l\) distinct sensitive values. Of course, each bucket should contain at least \(l\) users: \(l\)-diversity implies \(l\)-anonymity.
Let's try to make the data above \(2\)-diverse.
Now, consider our attacker from earlier, targeting Camille (third row). Like before, the attacker is unable to know which records corresponds to Camille. But besides, they also can't know whether Camille was healthy, or suffered from otitis. The sensitive value stays private.
Wait, that seems too easy
You might have noticed it immediately: the definition of \(l\)-diversity has some flaws. Let's list two of them.
Uncertain information can still be sensitive
What's the key idea behind \(l\)-diversity? If the attacker has uncertainty over the sensitive value, then we avoid leaking private info. But consider the following database, which satisfies \(2\)-diversity:
Suppose the attacker knows that their target has ZIP code 4235 and age 25. The target's record is one of the first two rows. The attacker can learn that their target either has AIDS, or hepatitis B. They can't be sure which one is the correct one… But they can infer that their target has a sexually transmitted infection. This information, of course, might be embarrassing for the target!
How to fix this? One solution could be to group diagnostics into categories, like diagnostic code families. Then, we can require that each bucket has \(l\) different categories of diagnostics. This way, the attacker can't distinguish between STDs, external injuries, respiratory problems, etc.
Unfortunately, choosing these categories is a complicated policy question. There are many possible combinations of sensitive values. Making sure that none of them is sensitive sounds like a laborious task…
Probabilistic information gain
Consider the following database, again satisfying \(2\)-diversity:
Consider the same attacker as before: targeting someone with ZIP code 4235 and age 25. They can't know their target's diagnostic for certain. But they can get a strong suspicion that the target has lupus: 9 out of 10 records share this diagnostic! An insurance company might increase someone's premium because of a suspected pre-existing condition. Isn't that also a privacy issue?
How do we protect against this type of probabilistic information gain? Requiring that sensitive attributes are diverse is not enough. We need to also require that their distribution is roughly the same that the rest of the data. If 40% of the records are "healthy" in the overall data, then each bucket must also have roughly 40% of "healthy" records. This way, the attacker's knowledge can't change too much from the baseline. This is the core idea behind another definition named \(t\)-closeness. I won't go into details here, but you can read about it on Wikipedia or in the original paper who introduced this idea.
Note: this idea is also relevant if the sensitive attribute is numeric, like salary values. A yearly salary of €20,000 is very similar to €20,100: applying \(l\)-diversity doesn't make sense. By contrast, \(t\)-closeness can compare distributions in a more meaningful way.
\(l\)-diversity in practice
OK, so even with these flaws, how easy is it to use \(l\)-diversity in practice?
The good news: implementation
From an algorithmic perspective, \(l\)-diversity is very similar to \(k\)-anonymity. The basic blocks are the same: generalization and suppression. Finding the best strategy is also done using trial-and-error heuristics. The approach used for \(k\)-anonymity is straightforward to adapt to \(l\)-diversity.
Unsurprisingly, some software is available to implement it in practice. I won't list them all here, but most options introduced in my article about \(k\)-anonymity can also be used for \(l\)-diversity.
The bad news: policy
Choosing the right value of \(k\) for \(k\)-anonymity is difficult, but \(l\)-diversity is certainly not better. No official guideline or regulation will help you choose the value of \(l\). And it's at least as hard to quantify the "amount of privacy" obtained with a given choice of parameter.
Worse, the flaws described before mean that the question is even subtler than that. Should we classify the sensitive values into categories? Impose that sensitive values don't appear too often? If so, there are even more parameters that one has to choose, and no good way to choose them.
The other bad news: utility loss
\(l\)-diversity, despite its flaws, is strictly stronger than \(k\)-anonymity. And it should be relatively easy to use in practice, once we've chosen a policy… Despite this, it is hardly ever used. A health data de-identification specialist once told me that they only saw it in the wild a handful of times. By contrast, using \(k\)-anonymity is very common.
Why is that? I see two possible reasons.
First, the utility loss of \(l\)-diversity is too significant compared to \(k\)-anonymity. A study compared the utility loss of different anonymization strategies. It found that applying \(3\)-diversity dataset was worse than using \(100\)-anonymity! This particular result was for a classification accuracy task, but you get the idea2.
Second, the privacy gains are not clear. Especially considering the flaws we described above… And fixing those flaws hurts utility even more. With \(t\)-closeness, we hinder the ability to link demographic features with diagnostics. But this type of analysis is exactly what healthcare researchers want to do! If the privacy definition goes completely against this idea, it won't get much love.
\(l\)-diversity isn't a definition that definitely addresses a particular threat model. Rather, it's a "fix" for one of \(k\)-anonymity's most obvious flaws. But in security, simply patching bugs one after the other isn't a great defense mechanism... For privacy definitions too, fixing only one attack doesn't get you very far.
Which might be over-optimistic: some people make their health issues public on social media, or the press can find out and publicize the medical history of personalities. If you're actually doing this to anonymize a real dataset, you should be more careful when classifying your columns. Here, we're going with a simple assumption for the sake of simplicity. ↩
It's hard to find many examples: negative results hardly ever get published… ↩