A reading list on differential privacy
This post is part of a series on differential privacy. Check out the table of contents to see the other articles!
Someone recently asked me for reading suggestions to learn more about differential privacy (DP). I thought that the answer was worth posting somewhere. So, here it is: a list of material that I often recommend to people who want to dive into the field.
Before I start, two remarks. First, most of these are going to be research papers. There's a lack of accessible material on differential privacy. I hope it gets better over time; I'll keep this list updated as I discover more good stuff. Second, this list is going to be very subjective. I'm more interested in some problems than others. I like beautiful theory and very practical applications. I'm less enthusiastic about the research in-between. This list might not be the right one for you. If you're a student starting a research project, ask your advisor what you should read! They'll be a better source of reading material than a random guy on the Internet.
Introductory and reference material
There are less links that I'd like in the category « suitable for a non-technical audience ». That's exactly why I'm writing these blog posts… Still, some people have created great material.
I recently stumbled upon a fantastic 12-minute video called Protecting Privacy with MATH. It's done by MinutePhysics, and it's friendly, understandable, yet very accurate. I love that it explains the basics of reconstruction attacks! It provides a great motivation for differential privacy.
For more in-depth material, I recommend the Primer for a Non-Technical Audience. It's written by folks at the Harvard Privacy Tools Project, and is perfect if you don't want heavy math. I prefer it to the typical recommendation for a reference book on DP, surnamed the « Privacy Book ». Its real name is The Algorithmic Foundations of Differential Privacy. One of the authors is Cynthia Dwork, one of the creators of DP, so it's a good way of discovering the original motivations… But I would mostly recommend Sections 1 to 3, not the whole thing.
Ah, on the topic of things written by the creators of DP! Frank McSherry's blog makes for a fun read (look for the articles tagged « Privacy »).
Local and shuffled model
Excited by my [blog post][local] about the different models of DP? When it comes to the privacy model, the RAPPOR paper is a classic. It explains how Google Chrome collected useful statistics without learning information about individuals. The follow-up paper, Prochlo, introduced the shuffled model in a very practical context. To understand its theoretical guarantees, I recommend the later Privacy Blanket paper, it provides a very nice explanation of the intuition behind it1.
Rolling out differential privacy
Moving on to applied views of central DP for statistics. Google colleagues and I recently published recently a paper about a differentially private SQL engine. This isn't the first work in that direction2, but I'm still pretty proud of it: we worked hard to make it easy to understand and mostly self-contained! You can see an updated version in Section 4.2 of my PhD thesis. Further in the thesis, you can also find a discussion of operational aspects of DP, a topic that doesn't get as much attention as it probably should.
Want more about rolling out DP in practice? The Issues Encountered Deploying Differential Privacy paper, by the US Census Bureau, is a must-read. I strongly relate to the problems it raises. For even more fun and absurd implementation things, go read about the floating-point attack on the Laplace mechanism, it's a classic.
Theory of differential privacy
There are natural theory questions raised by DP. For example « how many queries can we answer privately and accurately on the same database? » The Complexity of Differential Privacy is a survey of fundamental results on accuracy/privacy trade-offs. This being theory, almost everything in there is asymptotic, of course.
The theory around the privacy loss distribution is beautiful. It gives a global view of the privacy leakage of a mechanism. It's less simple, but it makes much more sense than only considering a pair of (ε,δ) parameters. To learn about it, I recommend the Privacy Loss Classes paper. It has practical consequences, too, on estimating privacy budgets over many compositions. It enables neat ways of estimating the evolution of a privacy budget over many compositions. This is explored in the Privacy Buckets paper, which is also worth a read.
Speaking of interesting reformulations of DP, the hypothesis testing formulation is also super cool. The Gaussian Differential Privacy paper does a great job at explaining it. It also gives you nice results on amplification by sampling, a shiny and useful tool in practice3.
Finally, there are interesting debates around how to interpret the guarantee provided by DP. Everyone understands the math, but how do we translate it in practice? There's a long-standing debate about data correlations and adversary strength. On that topic, the Differential Privacy as a Causal Property paper is fantastic. It clarifies this debate, links to all relevant references, and proposes a neat way to solve this issue conceptually.
Differentially private machine learning
I don't know much about machine learning, but it felt weird to not mention it at all in an article like this. So I asked my colleague Peter Kairouz to suggest some material! The following suggestions are from him. I was glad to see that there aren't only scientific papers in there!
First, start with this pretty great 2-hour video tutorial (with available slides and historical references). If you then want to actually test this in practice, try out TensorFlow Privacy! For a nice introduction to it, check out this blog post, by one of the TensorFlow Privacy authors. It explains how to run DP-SGD with it, step by step and with a lot of additional references.
The paper that introduced DP-SGD is a classic: Deep Learning with Differential Privacy. Another method for the same problem, DSSGD, appeared earlier in Privacy-Preserving Deep Learning(PDF). Yet another alternative approach, more generic and surprisingly understandable, is called PATE. It's described in a blog post, which comes with references to the original papers.
Most of these methods rely on tight accounting on the privacy budget. A common tool for that is Rényi Differential Privacy, a DP variant that considers the averaged privacy loss. Another is Amplification by Iteration, which studies mechanisms that iterate over the data multiple times, without releasing intermediary results. Both techniques enable better composition results.
Deep learning isn't the only interesting ML application of differential privacy. Methods also exist for Ordinary Least Squares, or for Empirical Risk Minimization. This last method admits Efficient Algorithms and Tight Error Bounds. In general, Convex Optimization problems are a nice fit for DP methods.
If you feel this reading list is missing something, please let me know! I'd like this post to be a living resource. So I'm enthusiastic about adding new material, especially if they're more approchable than scientific papers. My contact info is in the footer of this page.
Thanks to Antoine Amarilli, Úlfar Erlingsson, Peter Kairouz, and Rob Yoder for their helpful comments and suggestions.
Amplification by shuffling was initially formalized in the eponymous paper. ↩
The first system to do generic differentially private computations was PINQ, and a major follow-up was Flex. Both papers are worth a read too! ↩
For a slightly more exhaustive and historical view of results in that area, Section 4 of the Amplification by Subsampling paper (and the corresponding references) is a solid resource. ↩