Choosing things privately with the exponential mechanism
This post is part of a series on differential privacy. Check out the table of contents to see the other articles!
The goal of most differential privacy mechanisms is to publish statistics: numbers giving some information about groups of people. But to build more complex DP mechanisms, we sometimes need a different kind of building block. In this article, we won't be adding noise to numbers. Instead, we'll make a choice among multiple options, in a differentially private way.
A simple example
Let's say that we're designing a poll to pick the best science-fiction books published in 2020. First, we'll collect a big list of books published that year. Then, we ask people to select the books they liked. Each person can pick as many books as they want, and we want to select the book with most votes. If there is a tie, we select the winner randomly.
How do we publish this in a privacy-preserving way? Let's take a look at the books with the most votes in our voting results.
We can't simply publish the true answer without randomness. Otherwise, we would be publishing the most popular book (here, Network Effect) 100% of the time. But imagine that we add a single person, who only votes for one of the second-most popular books.
In that case, we would want to release either one or the other with equal probability. This is a violation of differential privacy: 50% of the time, we publish The Hidden Girl, and this couldn't happen without this added user!
To avoid this, we need to add more randomness in our process. How do we do that exactly?
Well, we already know how to publish histograms with differential privacy: we add well-calibrated Laplace noise to each of the statistics. So why don't we just do that? If we make the entire histogram private, we can release all of it. In particular, we can see which book has the highest noisy vote count, and declare it the winner.
But we have a problem here. The noise has to be scaled by the number of statistics that each user can contribute to. In our setting, a particularly enthusiastic user could vote for all the books. If our list has 10,000 books to choose from, then we have to multiply the noise scale by 10,000. This seems… not great. And it also feels unnecessary: we don't want to release the entire histogram, we only want to pick a winner. Could we use that fact to inject less noise into the process?
Let's try to think about what a good strategy would look like. Say we are using \(\varepsilon=\ln(2)\). Let's look at the votes for the most popular books again.
The true winner is Network Effect. So we want to select this true answer with some probability, hopefully as high as possible.
Now, what is the probability of selecting one of the second-best choices? They both have one fewer vote than the winner. They're not the correct answer, so we want to select them with as small a probability as possible. But we're also constrained by our differential privacy guarantee.
Imagine that we add a single new person to the data, and they vote for Hench. Then, we should be selecting Network Effect and Hench with equal probability.
With DP, we must select Hench with similar probability as in this hypothetical scenario. How similar? We chose \(e^\varepsilon=2\), so there can be a factor of at most 2 between these probabilities. We want it to be as small as possible, so let's make it exactly half.
The exact same reasoning holds for The Hidden Girl, which has as many votes as Hench.
What about one of the books that is a little further away from the winner? The Relentless Moon, for example, is 3 votes short: we would need three more votes to get to the winning probability. If we can add one person at a time, we need three steps to get there.
And each time we add one person, we have to respect the differential privacy constraint: we can at most double the probability of selecting The Relentless Moon. To arrive there after these three steps, we need to start from at least \(\left(\frac{1}{2}\right)^3=1/8\) of the maximum probability.
We can repeat this idea and draw our full probability distribution. For each book, if it's \(k\) votes short of the true winner, its selection probability should be \(1/2^k\) of the true winner.
Achieving this for an arbitrary \(\varepsilon\) is straightforward: the probability of selecting a book \(i\) with \(k_i\) votes should be proportional to \(\exp\left(\varepsilon \cdot k_i\right)\). After normalizing to make the probabilities sum to \(1\), we get the following formula:
We call this DP procedure the exponential mechanism.
A generic statement and a simple optimization
The example above is very simple: each book is simply associated to the number of votes it received. But the exponential mechanism is much more generic, and we can use it in more complex settings. Let's say we have a database \(D\), and we have to choose between many items \(O_1\), \(O_2\), and so on. We assume that each item \(O_i\) has a score \(s_i(D)\), which depends on the dataset. Let \(\Delta\) be the sensitivity of the scoring function: the maximum change to \(s_i(D)\) when one person is added to (or removed from) \(D\), for all \(i\). Then the exponential mechanism \(\mathcal{M}\) is defined as:
Proving that it satisfies ε-DP is very easy — the proof in the original paper is just 3 lines long! Try to come up with it by yourself. Or you can also click here:
In our example above, the \(O_i\) are the books, and the score of each book is its number of votes. Adding or removing one person modifies the scores by at most one, so \(\Delta=1\), and we get the same…
Wait a second. We're not getting the same thing! There's a factor of \(2\) in the generic formula that we didn't have in our voting example. So if we were using the generic formula, we would get worse utility: the probability of selecting the winner would be smaller. Can we get rid of this multiplicative factor?
The answer is yes, because our scores are monotonic: if we add a user, they will all get larger. If we remove one, they will all get smaller. That's a common special case, and in that case, you can remove the \(2\) factor:
Again, this is quite easy to prove, especially if you've understood the previous proof.
Note that here, we assumed that we're protecting the addition or removal of a single person in the dataset. If we want to protect any change in a single person's votes instead, the privacy analysis changes: someone could add one vote to a book and remove a vote to another. The scores would no longer be monotonic, and we would need to pay the \(2\) factor in our formula.
More results?!
The exponential mechanism is a central building block in differential privacy. It's been studied from many different angles, so there is a lot to say about it. This blog post is long enough already, but here are a few ✨ selected facts ✨. Follow the links if you'd like to learn more!
- The exponential mechanism can be implemented in a simple way: add noise from a Gumbel distribution to each score, and choose the item with the highest noisy score.
- Its privacy guarantees can be finely analyzed using a notion called bounded range. This allows you to prove that an exponential mechanism calibrated for \(\varepsilon\)-DP also satisfies \(\rho\)-zCDP with \(\rho=\frac{1}{8}\varepsilon^2\): a lot better than the typical conversion of \(\rho=\frac{1}{2}\varepsilon^2\).
- Using the exponential mechanism several times? Don't use regular composition theorems! Instead, using the special structure of this mechanism can get you tighter results.
- You can do better than the exponential mechanism, with a mechanism called Permute-and-Flip. Its original definition is somewhat complicated, but people found a nice characterization afterwards: add noise from a geometric distribution to each score, and pick the highest noisy score.
- However, the exponential mechanism still retains one advantage: it can also be used when the space of possible choices is continuous, like "every real number between 0 and 1"1. This is very useful, for example to compute the median of values in a dataset.
I'm thankful to Daniel Simmons-Marengo, Liudas Panavas, and PeoriaBummer for helpful feedback on this post.
-
In that case, the density of the probability distribution on \(x\) must be proportional to \(\exp\left(\varepsilon\cdot\frac{s_x(D)}{2\Delta}\right)\). The normalization factor (in the denominator) also becomes an integral instead of a discrete sum. ↩