A friendly, non-technical introduction to differential privacy
Differential privacy is getting a lot of attention lately. Companies and governments are starting to publish data anonymized with this notion. Universities are offering courses about it. Statisticians are getting acquainted with this new approach to protecting data. Open-source organizations are publishing tooling to make differential privacy easier to use.
So, you might be wondering: what's the hype all about? What even is differential privacy? What makes it so special? How does it work in practice? And, perhaps more importantly, can I understand it without having to read a bunch of complicated equations?
The good news is: you've come to the right place. Welcome to my friendly blog post series about differential privacy! It provides simple explanations for the core concepts behind differential privacy. It is meant for a wide, non-technical audience: it doesn't assume any prior knowledge, uses as little math as possible, and illustrates everything with simple examples and diagrams.
Sounds interesting? Excellent! Start with these two articles.
- Why differential privacy is awesome presents a high-level, non-technical description of the definition. Then, it explains what makes it so great, and why people are so excited about it.
- Differential privacy in (a bit) more detail introduces the formal definition. It uses very little math, and explains every single concept needed to understand it.
Then, this blog post series splits in two branches. You can read one or the other in any order, depending on what you're most interested in.
The first branch is about the how: what techniques can you use to achieve differential privacy? It's a little bit technical, though I still keep it as simple as I can. If that doesn't sound interesting, feel free to skip over and go directly to the next section!
- Differential privacy in practice (easy version) explains how to make simple statistics differentially private: counts, sums, histograms… In many cases, you don't need much more!
- Almost differential privacy describes how to publish private histograms without knowing the categories in advance. It also introduces an important variant: \((\varepsilon,\delta)\)-differential privacy.
- The privacy loss random variable introduces this central concept in differential privacy. Then, it uses it to explain the real meaning of \(\delta\) in \((\varepsilon,\delta)\)-differential privacy.
- The magic of Gaussian noise introduces Gaussian noise and its shiny properties. In particular, it explains why Gaussian noise is so useful when releasing many statistics.
- Getting more useful results with differential privacy presents five simple techniques to improve the utility of your anonymized data. If you're rolling out your first differential privacy pipeline and it's not performing well, this is the article for you!
- Averaging risk: Rényi DP & zero-concentrated DP introduces two variants of differential privacy. By averaging the privacy loss, these notions allow for much better privacy analysis in a number of common use cases.
- Choosing things privately with the exponential mechanism explains a fundamental building block of differential privacy: a mechanism to privately select the best option among many.
The second branch of these series is about the why. In which contexts can differential privacy be used? Why do organizations decide to adopt it? What policy questions does it raise? The articles in this branch are accessible to non-technical folks, and are all self-contained. You can read them in any order you like!
- Local vs. central differential privacy presents the two main models of differential privacy, depending on who the attacker is. Publishing provably anonymized data is very different from collecting it, and this post explains the difference.
- Why not differential privacy? explores what it means for an algorithm to not be differentially private. Don't understand why people insist on using differential privacy, even when it seems overkill? This is the resource for you.
- Demystifying the US Census Bureau's reconstruction attack explains why the government agency decided to use differential privacy. They ran an attack on the data they published in 2010, and the results are quite scary.
- Don't worry, your data's noisy makes the case that all data is noisy, even without differential privacy. In fact, the noise used in differential privacy is the best kind of noise: the one you can quantify.
- Is differential privacy the right fit for your problem? gives a simple litmus test to quickly get an idea of whether a given use case is well-suited to using differential privacy.
- What anonymization techniques can you trust? lists some common legacy techniques used to anonymize data, explains how they fail, and draws some lessons from these historical failures.
- Mapping privacy-enhancing technologies to your use cases situates differential privacy in the wider context of privacy-enhancing technologies, explaining which kind of data privacy challenge each one addresses.
Finally, one article lists the known real-world deployments of DP, along with their privacy parameters.
This series isn't finished. I have a list of future articles I'd like to write… and I'm adding new ideas to this list faster than I'm writing blog posts! If you're looking for further things to read on differential privacy, you can do two things.
- You can check out this reading list I curated. I particularly recommend it you're looking for more formal content: textbooks with mathematical proofs, scientific papers, etc.
- You can follow me on Mastodon or subscribe to this blog's RSS feed to keep updated about future posts.