Are you following tech- or privacy-related news? If so, you might have heard about differential privacy. The concept is popular both in academic circles and inside tech companies. Both Apple or Google use differential privacy to collect data in a private way.
So, what's this definition about? How is it better than definitions that came before? More importantly, why should you care? What makes it so exciting to researchers and tech companies? In this post, I'll try to explain the idea behind differential privacy and its advantages. I'll do my best to keep it simple and accessible for everyone — not only technical folks.
What it means
Suppose you have a process that takes some database as input, and returns some output.
This process can be anything. For example, it can be:
- computing some statistic ("tell me how many users have red hair")
- an anonymization strategy ("remove names and last three digits of ZIP codes")
- a machine learning training process ("build a model to predict which users like cats")
- … you get the idea.
To make a process differentially private, you usually have to modify it a little bit. Typically, you add some randomness, or noise, in some places. What exactly you do, and how much noise you add, depends on which process you're modifying. I'll abstract that part away and simply say that your process is now doing some unspecified ✨ magic ✨.
Now, remove somebody from your database, and run your new process on it. If the new process is differentially private, then the two outputs are basically the same. This must be true no matter who you remove, and what database you had in the first place.
By "basically the same", I don't mean "it looks a bit similar". Instead, remember that the magic you added to the process was randomized. You don't always get the same output if you run the new process several times. So what does "basically the same" means in this context? That the probability distributions are similar. You can get the exact same output with database 1 or with database 2, with similar likelihood.
What does this have to do with privacy? Well, suppose you're a creepy person trying to figure out whether your target is in the original data. By looking at the output, you can't be 100% certain of anything. Sure, it could have come from a database with your target in it. But it could also have come from the exact same database, without your target. Both options have a similar probability, so there's not much you can say.
You might have noticed that this definition is not like the ones we've seen before. We're not saying that the output data satisfies differential privacy. We're saying that the process does. This is very different from \(k\)-anonymity and other definitions we've seen. There is no way to look at data and determine whether it satisfies differential privacy. You have to know the process to know whether it is "anonymizing" enough.
And that's about it. It's a tad more abstract than other definitions we've seen, but not that complicated. So, why all the hype? What makes it so awesome compared to older, more straightforward definitions?
Why it's awesome
Privacy experts, especially in academia, are enthusiastic about differential privacy. It was first proposed by Cynthia Dwork and Frank McSherry, in 20051. Very soon, almost all researchers working on anonymization started building differentially private algorithms. And, as we've already mentioned, tech companies are also trying to use it whenever possible. So, why all the hype? I can count three main reasons.
You no longer need attack modeling
Remember the previous definitions we've seen? (If not, you're fine, just take my word for it :D) Why did we need \(k\)-map in certain cases, and \(k\)-anonymity or \(\delta\)-presence in others? To choose the right one, we had to figure out the attacker's capabilities and goals. In practice, this is pretty difficult. You might not know exactly what your attacker is capable of. Worse, there might be unknown unknowns: attack vectors that you hadn't imagined at all. You can't make very broad statements when you use old-school definitions. You have to make some assumptions, which you can't be 100% sure of.
By contrast, when you use differential privacy, you get two awesome guarantees.
- You protect any kind of information about an individual. It doesn't matter what the attacker wants to do. Reidentify their target, know if they're in the dataset, deduce some sensitive attribute… All those things are protected. Thus, you don't have to think about the goals of your attacker.
- It works no matter what the attacker knows about your data. They might already know some people in the database. They might even add some fake users to your system. With differential privacy, it doesn't matter. The users that the attacker doesn't know are still protected.
You can quantify the privacy loss
We saw that when using \(k\)-anonymity, choosing the parameter \(k\) is pretty tricky. There is no clear link between which \(k\) to choose and how "private" the dataset is. The problem is even worse with other definitions. This problem is present in all other definitions we've seen so far.
Differential privacy is much better. When you use it, you can quantify the greatest possible information gain by the attacker. The corresponding parameter, usually named \(\varepsilon\), allows you to make very strong statements. Suppose \(\varepsilon=1.1\). Then, you can say: "an attacker who thinks their target is in the dataset with probability 50% can increase their level of certainty to at most 75%."
And do you remember the previous point about attack modeling? It means you can change this statement in many ways. You can replace "their target is is the dataset" by anything about one individual. And you can add "no matter what the attacker knows" if you want to be extra-precise. Altogether, that makes differential privacy much stronger than all definitions that came before.
Suppose you have some data. You want to share it with Alex and with Brinn, in some anonymized fashion. You trust Alex and Brinn equally, so you use the same definition of privacy for both of them. They are not interested in the same aspects of the data, so you give them two different versions of your data. Both versions are "anonymous", for the definition you've chosen.
What happens if Alex and Brinn decide to conspire, and compare the data you gave them? Will the union of the two anonymized versions still be anonymous? It turns out that for most definitions of privacy, this is not the case. If you put two \(k\)-anonymous versions of the same data together, the result won't be \(k\)-anonymous. So if Alex and Brinn conspire, they might be unable to reidentify users on their own… or even reconstruct all the original data! That's definitely not good news.
If you used differential privacy, you get to avoid this type of scenario. Suppose that you gave differentially private data to Alex and Brinn. Each time, you used a parameter of \(\varepsilon\). Then if they conspire, the resulting data is still protected by differential privacy, except that the privacy is now weaker: the parameter becomes \(2\varepsilon\). So they gain something, but you still quantify how much information they got. Privacy experts call this property composition.
This scenario sounds a bit far-fetched, but composition is super useful in practice. Organizations often want to do many things with data. Publish statistics, release an anonymized version, train machine learning algorithms… Composition is a way to stay in control of the level of risk as new use cases appear and processes evolve.
I hope the basic intuition behind differential privacy is now clear. Want a one-line summary? Uncertainty in the process means uncertainty for the attacker, which means better privacy.
I also hope that you're now wondering how it actually works! What hides behind this magic that makes everything private and safe? Why does differential privacy have all the awesome properties I've mentioned? What a coincidence! That'll be the topic of a future post, which will try to give more details while still staying clear of heavy math.