How to publish data about people while protecting their privacy? This question is far from new. Statistical agencies have grappled with it for decades. Computer scientists have proposed a whole bunch of creative notions to capture this idea. None of them was very satisfactory, though: all these notions were shown to be broken in some circumstances. They were also hard to apply without destroying the utility of the data.
This all changed in 2006, when four researchers introduced differential privacy. This new notion took a novel approach to defining privacy leakage, one that would prove much more rigorous and fruitful. So, what makes differential privacy special? How did it get so successful in adademic circles? Why did governments and tech companies start adopting it for their data publications?
This first article introducing differential privacy will attempt to answer that question. First, we'll describe the high-level intuition behind this successful notion. Then, we'll explain why it's so successful: why is it so much more awesome than all the definitions that came before?
The core idea behind differential privacy
Suppose you have a process that takes some database as input, and returns some output.
This process can be anything. For example, it can be:
- a process calculating some statistics ("tell me how many users have red hair")
- a de-identification strategy ("remove names and last three digits of ZIP codes")
- a machine learning training process ("build a model to predict which users like cats")
- … you get the idea.
To make a process differentially private, you usually have to modify it a little bit. Typically, you add some randomness, or noise, in some places. What exactly you do, and how much noise you add, depends on which process you're modifying. I'll abstract that part away and simply say that your process is now doing some unspecified ✨ magic ✨.
Now, remove somebody from your database, and run your new process on it. If the new process is differentially private, then the two outputs are basically the same. This must be true no matter who you remove, and what database you had in the first place.
By "basically the same", I don't mean "it looks a bit similar". Instead, remember that the magic you added to the process was randomized. You don't always get the same output if you run the new process several times. So what does "basically the same" means in this context? It means that you can get the exact same output from both databases with similar likelihood.
What does this have to do with privacy? Well, suppose you're a creepy person trying to figure out whether your target is in the original data. By looking at the output, you can't be 100% certain of anything. Sure, it could have come from a database with your target in it. But it could also have come from the exact same database, without your target. Both options have a similar probability, so there's not much you can say.
You might have noticed that this definition doesn't say anything about what the output data looks like. Differential privacy is not a property of the output data. It's very different from, say, \(k\)-anonymity, one of the first data privacy definitions. You can't look at the output data and determine whether it satisfies differential privacy. Instead, differential privacy is a property of the process: you have to know how the data was generated to determine whether it's differentially private.
That's about it for the high-level intuition. It's a little abstract, but not very complicated. So, why all the hype? What makes it so awesome compared to older, more straightforward definitions?
What makes differential privacy special
Privacy experts, especially in academia, are enthusiastic about differential privacy. It was first proposed by Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam Smith in 20061. Very soon, almost all researchers working on anonymization started building differentially private algorithms. Tech companies and governments are adopting it fast. So, why all the hype? I can count three main reasons.
You no longer need attack modeling
All definitions that came before needed some assumptions about the attacker. To choose the right notion, you needed to figure out the attacker's capabilities and goals. How much prior knowledge do they have? What auxiliary data are they allowed to use? What kind of information do they want to learn?
Doing in practice was difficult and very error-prone. Answering these questions is very tricky: in particular, you might not know exactly what the attacker wants or is capable of. Worse, there might be unknown unknowns: attack vectors that you didn't anticipate at all. For that reason, you couldn't make very broad statements with these old-school definitions. You had to make some assumptions, which you couldn't be 100% sure of.
By contrast, when you use differential privacy, you get two awesome guarantees.
- You protect any kind of information about an individual. It doesn't matter what the attacker wants to do. Reidentify their target, know if they're in the dataset, deduce some sensitive attribute… All those things are protected. Thus, you don't have to think about the goals of your attacker.
- It works no matter what the attacker knows about your data. They might already know some people in the database. They might even add some fake users to your system. With differential privacy, it doesn't matter. The users that the attacker doesn't know are still protected.
Differential privacy, like older notions, comes with a numeric parameter that you can tweak. There is a big difference, though, in how meaningful that parameter is. Take \(k\)-anonymity, for example. It tells you that each record in the output dataset "looks like" at least \(k-1\) other records. But does the value of \(k\) tell us about the level of protection?
The answer is… not much. There is no clear link between the value of \(k\) and how private the dataset is. So choosing \(k\) is very handwavy, and can't be justified in a formal way. The problem is even worse with other old-school definitions.
Differential privacy is much better. When you use it, you can quantify the greatest possible information gain by the attacker. The corresponding parameter, named \(\varepsilon\), allows you to make formal statements. Suppose \(\varepsilon=1.1\). Then, you can say: "an attacker who thinks their target is in the dataset with probability 50% can increase their level of certainty to at most 75%." Choosing the exact value of \(\varepsilon\) isn't easy, but at least, it can be interpreted in a formal way.
And do you remember the previous point about attack modeling? It means you can change this statement in many ways. You can replace "their target is is the dataset" by anything about one individual. And you can add "no matter what the attacker knows" if you want to be extra-precise. Altogether, that makes differential privacy much stronger than all definitions that came before.
Suppose you have some data. You want to share it with Alex and with Brinn, in some anonymized fashion. You trust Alex and Brinn equally, so you use the same definition of privacy for both of them. They are not interested in the same aspects of the data, so you give them two different versions of your data. Both versions are "anonymous", for the definition you've chosen.
What happens if Alex and Brinn decide to conspire, and compare the data you gave them? Will the union of the two anonymized versions still be anonymous? It turns out that for most definitions of privacy, this is not the case. If you put two \(k\)-anonymous versions of the same data together, the result won't be \(k\)-anonymous. So if Alex and Brinn collaborate, they might be able to reidentify users on their own… or even reconstruct all the original data! That's not good news.
With differential privacy, you can avoid this failure mode. Suppose that you gave differentially private data to Alex and Brinn. Each time, you used a parameter of \(\varepsilon\). Then if they conspire, the resulting data is still protected by differential privacy. The level of privacy is now weaker: the parameter becomes \(2\varepsilon\). So they still gain some information, but you can now quantify how much. This property is called composition.
This scenario sounds a bit far-fetched, but composition is super useful in practice. Organizations often want to do many things with data. Publish statistics, release an anonymized version, train machine learning algorithms… Composition is a way to stay in control of the level of risk as new use cases appear and processes evolve.
I hope the basic intuition behind differential privacy is now clear. If you remember a single thing, let this be this one-line summary: uncertainty in the process means uncertainty for the attacker, which means better privacy.
I also hope that you're now wondering how it actually works! What hides behind this magic that makes everything safe and private? Why does differential privacy have all the awesome properties I've mentioned? This is the exact topic of the next article in this series, which explains this in more detail while still staying clear of heavy math.