Averaging risk: Rényi DP & zero-concentrated DP
This post is part of a series on differential privacy. Check out the table of contents to see the other articles!
Remember the privacy loss random variable (PLRV)? We saw that it described all values that the privacy leakage could take, and how likely each one was. And we saw that \(\varepsilon\)-DP was a worst-case property: the PLRV must always be lower than \(\varepsilon\). By contrast, we described \((\varepsilon,\delta)\)-DP as an "almost worst-case" notion: it left a little wiggle room for rare events with a privacy loss larger than \(\varepsilon\).
In this article, we'll use the PLRV in yet another interesting way. Instead of only looking at its extreme values, we'll look at the average of the PLRV. This will give us an intuitive explanation of two useful variants of DP: Rényi differential privacy, and zero-concentrated differential privacy.
Basic idea
Recall the core intuition behind the privacy loss random variable (PLRV). Say our secret mechanism \(A\) returns an output \(O\). The attacker is trying to find out whether the input was \(D_1\) or \(D_2\), where both options differ in a single person's data. The PLRV \(\mathcal{L}_{D_1,D_2}(O)\) was kind of the "actual \(\varepsilon\) value" for this attacker: \(e^{\mathcal{L}_{D_1,D_2}(O)}\) was the advantage they gain when observing output \(O\). This gave us a reformulation of \(\varepsilon\)-DP, as an upper bound on this value. If \(A\) is \(\varepsilon\)-DP, then for all possible choices for \(D_1\) and \(D_2\), and all possible outputs \(O\):
How do we transform this worst-case property into an average case definition? Two options come to mind.
- We could average the privacy loss across possible databases \(D_1\) and \(D_2\)…
- … or we could average it across possible outputs \(O\).
The first option turns out to be a Bad Idea™, for subtle reasons that I'm not going to go into here1. The second option, however, makes sense. This is the same "kind" of relaxation as \((\varepsilon,\delta)\)-DP: there's a small probability that something goes worse than we'd like. Importantly, that this probability doesn't depend on the attacker. It only comes from the algorithm's randomness, and doesn't require additional assumptions.
There is a significant difference, however. In \((\varepsilon,\delta)\)-DP, there can be a small probability (at most \(\delta\)) of infinite privacy loss. When we're averaging the privacy loss, that's no longer allowed. If the privacy loss is very low 99.99% of the time, but is infinite with probability 00.01%… then its average is still going to be infinite.
So bounding the average PLRV will be a way of relaxing DP, but without allowing infinitely bad events. Arbitrarily bad events can still happen, but only with vanishing probability. Let's formalize this.
Rényi differential privacy
Here's a first attempt at capturing this intuition of averaging risk. For every pair of databases \(D_1\) and \(D_2\) differing in a single record, we require that:
Here, \(\mathbb{E}\) is the expected value: you weigh each possible event by its probability. A very bad event can be acceptable if it happens almost never. This seems to capture our initial intuition.
One problem with this approach, though, is that we're not really averaging the right thing. The advantage that a Bayesian attacker can gain is \(e^\varepsilon\), not \(\varepsilon\)! So by averaging the privacy loss, we're not really averaging the risk. Let's show this with a little visualization. The following plot shows the attacker gain of a made-up mechanism: how much will the betting odds change, depending on the random output?
With this distribution, the expected value of \(\mathcal{L}\) (the "average \(\varepsilon\)") is around \(3.4\). This translates to an attacker gain of about 30.
That doesn't look right. The actual risk is often a lot larger the "average"! If we want to average out the risk, we should take the mean of \(e^\mathcal{L}\) instead. This would be closer to the intuition. The requirement would become…
This makes more sense: this can be seen as the arithmetic mean of the risk. If we plot it, the corresponding \(e^\varepsilon\) is the average of the blue line.
This still feels a bit arbitrary. Why not take a different averaging function? Large values of the privacy loss correspond to the worst events. These are particularly scary: maybe we want to give them more "weight"? We could do this using a quadratic mean, for example. We would then require something like this:
This gives us a larger average than before.
Let's generalize this. To decide which kind of averaging function to use, we'll introduce a parameter \(\alpha\).
This is Rényi differential privacy. If \(A\) satisfies the above inequality for all choices of \(D_1\) and \(D_2\), we say it's \((\alpha,\varepsilon)\)-Rényi differentially private.
Some special values of \(\alpha\) correspond to common averaging functions.
- \(\alpha\rightarrow1\) bounds the arithmetic mean of \(\mathcal{L}\) or, equivalently, the geometric mean of \(e^\mathcal{L}\);
- \(\alpha=2\) bounds the arithmetic mean of \(e^\mathcal{L}\);
- \(\alpha=3\) bounds the quadratic mean of \(e^\mathcal{L}\);
- \(\alpha=4\) bounds the cubic mean of \(e^\mathcal{L}\);
- and it's also possible to pick \(\alpha=\infty\), which bounds the maximum value of \(e^\mathcal{L}\): it's then equivalent to \(\varepsilon\)-DP.
Let's visualize these options using our previous example.
Rényi DP, invented by Ilya Mironov, has a bunch of neat properties. In particular, it composes nicely, just like DP. If a mechanism \(A\) is \((\alpha,\varepsilon_1)\)-Rényi DP and a mechanism \(A'\) is \((\alpha,\varepsilon_2)\)-DP, then releasing the output of both will be \((\alpha,\varepsilon_1+\varepsilon_2)\)-DP.
Zero-concentrated differential privacy
So Rényi DP is pretty neat, but it involves an additional parameter \(\alpha\). That's a bit annoying. Choosing \(\varepsilon\) was already difficult. Having to make a new decision about "how to average the risk" seems even harder. Yet, this idea of averaging the privacy loss is pretty natural. Ideally, we would like to keep this intuition, but have a single parameter.
What if we covered all possible values of \(\alpha\) at once? Larger \(\alpha\) values put more weight on bad events: the "average" also gets larger as \(\alpha\) grows. So what if we put a bound on the average… but have this bound grow with \(\alpha\)? This seems like a good idea. But now, the question becomes: how fast should it grow? There are a lot of increasing functions. But a logarithm doesn't exactly behave like an exponential!
Since we have a choice, we can think of what other things we'd like from a single-parameter definition. We saw that Gaussian noise was a neat tool to design DP mechanisms: it would be nice to describe its privacy guarantee in a simple way with our new definition. Composition is also important, and if possible, a simple composition result would be better.
To sum up, we're looking for a formulation that:
- has a single parameter,
- corresponds to a larger \(\varepsilon\) for growing values of \(\alpha\),
- describes the guarantee of Gaussian noise in a simple & precise way,
- and has a simple composition guarantee.
That's exactly what zero-concentrated differential privacy (zCDP) provides. Introduced by Mark Bun & Thomas Steinke, it can be interpreted in simple terms: given a single parameter \(\rho\), the \(\varepsilon\) corresponding to each \(\alpha\) must be at most \(\rho\alpha\). In the formalism above, the mechanism is \(\rho\)-zCDP if:
It's easy to verify that it matches all the requirements above.
- The single parameter \(\rho\) corresponds to the arithmetic average of the privacy loss. (Or, equivalently, to the geometric average of the \(e^\mathcal{L}\).)
- It guarantees that the relationship between \(\alpha\) and \(\varepsilon\) is at most linear, which is very simple.
- It describes the Gaussian mechanism beautifully. Suppose that the statistics you're computing have a \(L^2\) sensitivity of \(\Delta\). Then, adding adding Gaussian noise of variance \(\sigma^2\) to the result. Then the result satisfies \(\rho\)-zCDP, with \(\rho=\frac{\Delta^2}{2\sigma^2}\). So much nicer than the formula giving the \((\varepsilon,\delta)\)-DP guarantee!
- And composition is a breeze. If a mechanism is \(\rho_1\)-zCDP and another is \(\rho_2\)-zCDP, then publishing the result of both is \(\left(\rho_1+\rho_2\right)\)-zCDP.
The last two points are super useful to analyze multiple Gaussian mechanisms at once: we can look at them separately, and add their corresponding \(\rho\) values. This works even if they use very different noise magnitudes. And the resulting guarantee is much more precise than if we'd done the accounting with \((\varepsilon,\delta)\)-DP.
These nice properties are why zCDP has been used in practice for some high-profile use cases, like the 2020 U.S. Census Redistricting Data. If you want to release a lot of statistics, you might benefit from using this notion in your privacy analysis as well.
tl;dr
Can we describe all the definitions we've seen so far in a tweet-length summary? Here's an attempt.
- \(\varepsilon\)-DP: the absolute worst case is \(\varepsilon\).
- \(\left(\varepsilon,\delta\right)\)-DP: the worst case is \(\varepsilon\), almost always.
- \(\left(\alpha,\varepsilon\right)\)-Rényi DP: the average case is \(\varepsilon\), and \(\alpha\) tells you which average function to use.
- \(\rho\)-zCDP: many \(\left(\alpha,\varepsilon\right)\)-Rényi DP guarantees at once, well-chosen for convenience.
Simple, right?
Note: I optimized this article for simplicity. I tried to find the simplest possible intuition for these notions, and made up a neat storyline to introduce one after the other. This came at a cost in historical accuracy. If your main goal was getting an intuitive understanding of these definitions, then you can stop here. If you're also interested in learning about the history of these notions, keep reading.
Contrary to the story above, zero-concentrated DP was introduced before Rényi DP. This work itself built on a prior definition, concentrated DP, invented by Dwork and Rothblum. This prior notion says that if you take the PLRV and subtract its mean, you get a distribution that's "smaller than a Gaussian".
Concentrated DP was a fruitful notion, used to prove tighter composition theorems for \((\varepsilon,\delta)\)-DP. It also described the privacy properties of the Gaussian mechanism in a neater way. But has also some shortcomings: it was not closed under post-processing, and its formulation was fairly complex. This is what zero-concentrated DP was introduced to fix: it formalized the "PLRV is smaller than a Gaussian" intuition in a simpler way, keeping the advantages without the problems.
The original goal of both notions was to get better composition results, not to average risk. Rényi DP, introduced afterwards, followed this line of research. Fixing the parameter \(\alpha\) was a way of getting more flexibility in the privacy analysis, in particular for machine learning use cases.
Thanks to Anthony Caruso for letting me know about a mistake in a previous version of this post.
-
This blog post gives a few examples of what this can look like, and the dangers of doing so. ↩