Special edition in a series on differential privacy. Other articles are below.
- Why differential privacy is awesome presents a non-technical explanation of the definition.
- Differential privacy in (a bit) more detail introduces the formal definition, with very little math.
- Differential privacy in practice (easy version) explains how to make simple statistics differentially private.
- Almost differential privacy describes how to publish private histograms without knowing the categories in advance.
- Local vs. global differential privacy presents the two main models of differential privacy, depending on who the attacker is.
- The privacy loss random variable explains the real meaning of \((\varepsilon,\delta)\)-differential privacy.
- The magic of Gaussian noise introduces Gaussian noise and its shiny properties.
- Why not differential privacy? explores what it means for an algorithm to not be differentially private.
- Getting more useful results with differential privacy presents five simple techniques to improve the utility of your anonymized data.
Someone recently asked me for reading suggestions to learn more about differential privacy (DP). I thought that the answer was worth posting somewhere. So, here it is: a list of material that I often recommend to people who want to dive into the field.
Before I start, two remarks. First, most of these are going to be research papers. There's a lack of accessible material on differential privacy. I hope it gets better over time; I'll keep this list updated as I discover more good stuff. Second, this list is going to be very subjective. I'm more interested in some problems than others. I like beautiful theory and very practical applications. I'm less enthusiastic about the research in-between. This list might not be the right one for you. If you're a student starting a research project, ask your advisor what you should read! They'll be a better source of reading material than a random guy on the Internet.
Introductory and reference material
There are less links that I'd like in the category « suitable for a non-technical audience ». That's exactly why I'm writing these blog posts… Still, some people have created great material.
I recently stumbled upon a fantastic 12-minute video called Protecting Privacy with MATH. It's done by MinutePhysics, and it's friendly, understandable, yet very accurate. I love that it explains the basics of reconstruction attacks! It provides a great motivation for differential privacy.
For more in-depth material, I recommend the Primer for a Non-Technical Audience. It's written by folks at the Harvard Privacy Tools Project, and is perfect if you don't want heavy math. I prefer it to the typical recommendation for a reference book on DP, surnamed the « Privacy Book ». Its real name is The Algorithmic Foundations of Differential Privacy. One of the authors is Cynthia Dwork, one of the creators of DP, so it's a good way of discovering the original motivations… But I would mostly recommend Sections 1 to 3, not the whole thing.
Ah, on the topic of things written by the creators of DP! Frank McSherry's blog makes for a fun read (look for the articles tagged « Privacy »).
Local and shuffled model
Excited by my blog post about the different models of DP? When it comes to the privacy model, the RAPPOR paper is a classic. It explains how Google Chrome collected useful statistics without learning information about individuals. The follow-up paper, Prochlo, introduced the shuffled model in a very practical context. To understand its theoretical guarantees, I recommend the later Privacy Blanket paper, it provides a very nice explanation of the intuition behind it1.
Rolling out differential privacy
Moving on to applied views of central DP for statistics. Google colleagues and I recently published recently a paper about a differentially private SQL engine. Even though this isn't the first work in that direction2, I'm pretty proud of it! Section 2, in particular, is a nicer explanation of the technique I describe in a previous blog post. And Section 5 is full of practical considerations.
Want to roll out DP in practice? The Issues Encountered Deploying Differential Privacy paper, by the US Census Bureau, is a must-read. I strongly relate to the problems it raises. For even more fun and absurd implementation things, learn the floating-point attack on the Laplace mechanism, it's a classic.
Theory of differential privacy
There are natural theory questions raised by DP. For example « how many queries can we answer privately and accurately on the same database? » The Complexity of Differential Privacy is a survey of known results on accuracy/privacy trade-offs. This being theory, almost everything in there is asymptotic, of course.
The theory around the privacy loss distribution is beautiful. It gives a global view of the privacy leakage of a mechanism. It's less simple, but it makes much more sense than only considering a pair of (ε,δ) parameters. To learn about it, I recommend the Privacy Loss Classes paper. It has practical consequences, too, on estimating privacy budgets over many compositions. It enables neat ways of estimating the evolution of a privacy budget over many compositions. This is explored in the Privacy Buckets paper, which is also worth a read.
On interesting reformulations of DP, the hypothesis testing formulation is also super cool. The Gaussian Differential Privacy paper does a great job at explaining it. It also gives you nice results on amplification by sampling, a shiny and useful tool in practice3.
Finally, the guarantee given by DP is subject to some debate. Everyone understands the math, but how do we translate it in practice? There's a long-standing debate about data correlations and adversary strength. On that topic, the Differential Privacy as a Causal Property paper is fantastic. It clarifies this debate, links to all relevant references, and proposes a neat way to solve this issue conceptually.
Differentially private machine learning
I don't know much about machine learning, but it felt weird to not mention it at all in an article like this. So I asked my colleague Peter Kairouz to suggest some material! The following suggestions are from him. I was glad to see that there aren't only scientific papers in there!
First, start with this pretty great 2-hour video tutorial (with available slides and historical references). If you then want to actually test this in practice, try out TensorFlow Privacy! For a nice introduction to it, check out this blog post, by one of the TensorFlow Privacy authors. It explains how to run DP-SGD with it, step by step and with a lot of additional references.
The paper that introduced DP-SGD is a classic: Deep Learning with Differential Privacy. Another method for the same problem, DSSGD, appeared earlier in Privacy-Preserving Deep Learning(PDF). Yet another alternative approach, more generic and surprisingly understandable, is called PATE. It's described in a blog post, which comes with references to the original papers.
Most of these methods rely on tight accounting on the privacy budget. A common tool for that is Rényi Differential Privacy, a DP variant that considers the averaged privacy loss. Another is Amplification by Iteration, which studies mechanisms that iterate over the data multiple times, without releasing intermediary results. Both techniques enable better composition results.
Deep learning isn't the only interesting ML application of differential privacy. Methods also exist for Ordinary Least Squares, or for Empirical Risk Minimization. This last method admits Efficient Algorithms and Tight Error Bounds. In general, Convex Optimization problems are a nice fit for DP methods.
If you feel this reading list is missing something, please let me know! I'd like this post to be a living resource. So I'm enthusiastic about adding new material, especially if they're more approchable than scientific papers. My contact info is in the footer of this page.
Thanks to Peter Kairouz, Antoine Amarilli and Úlfar Erlingsson for their helpful comments and suggestions.