The magic of Gaussian noise
This post is part of a series on differential privacy. Check out the table of contents to see the other articles!
Previously, we used Gaussian noise to explain the real meaning of \(\delta\) in \((\varepsilon,\delta)\)-differential privacy. One question was left unanswered: why would anyone use Gaussian noise in the first place? The guarantees it provides aren't as strong: it gives \((\varepsilon,\delta)\)-DP with \(\delta>0\), while Laplace noise provides pure \(\varepsilon\)-DP. This blog post gives an answer to this question, and describes the situations in which Gaussian noise excels.
Gaussian noise is nice
A first advantage of Gaussian noise is that the distribution itself behaves nicely. It's called the normal distribution for a reason: it has convenient properties, and is very widely used in natural and social sciences. People often use it to model random variables whose actual distribution is unknown. If you sum many independent random variables, you end up with a normal distribution. And these are just a few of the many properties of this fundamental distribution. Thus, most data analysts and scientists are already familiar with Gaussian noise. It's convenient when you release anonymized statistics: analysts don't need to learn too many new concepts to understand what you're doing to protect the data.
A second advantage is that the Gaussian distribution has nice, thin tails. The vast majority of its probability mass is focused around its mean. Take a normal distribution with mean 0 and standard deviation \(\sigma\). The 68–95–99.7 rule says that a random variable sampled from this distribution will be:
- in \([-\sigma,\sigma]\) with 68% probability;
- in \([-2\sigma,2\sigma]\) with 95% probability;
- and in \([-3\sigma,3\sigma]\) with 99.7% probability.
It even gets better as you go further away from the mean. The probability that the random variable is outside \([-k\sigma,k\sigma]\) decreases faster than \(e^{-k^2/2}\). In practice, you're rarely surprised by the values that a Gaussian distribution takes. Even if you sample \(1,000,000\) values, they are all probably going to be within \([-5\sigma,5\sigma]\).
Laplace, by comparison, isn't quite as nice. Its tails decrease exponentially fast, but that's still much slower than Gaussian tails. Suppose you sample \(1,000,000\) values from a Laplace distribution of standard deviation \(\sigma\). On average, 849 of them will be outside \([-5\sigma,5\sigma]\).
OK, so Gaussian noise is nice. But that does not change a simple fact: to get a comparable level of privacy for a single statistic, Laplace is much better. Assume that we're adding noise to a simple count, of sensitivity \(1\). This graph compares the Laplace noise needed to get \(\varepsilon=1\), and Gaussian noise needed to get \(\varepsilon=1\) and \(\delta=10^{-5}\).
Despite its weaker privacy guarantees, the Gaussian distribution is much flatter. Its standard deviation is over 3.4, while Laplace's is less than 1.3. Thus, much more noise will need to be added, and analysts care a lot about minimizing the noise. Why, then, would Gaussian noise be a good option? The answer is simple: because it gets better when you're publishing a lot of statistics.
From one to many statistics
In most of our previous examples, we assumed that each individual appeared in a single statistic. This case is common, for example when partitioning people based on demographic information. But in many applications, this assumption does not hold. Imagine, for example, that you want to answer the question: « what types of specialized physicians did people visit in the past 10 years? »
Assume we're working in the central model. We have a dataset of 〈patient ID, specialist type〉 pairs, and each record corresponds to an individual visiting a specialized physician (cardiologist, dermatologist, radiologist, etc.). We want to count the number of unique patient IDs per specialty.
Note that each patient can only influence a single count once. We count distinct patient IDs: if you visit dermatologists 10 times, you will only add 1 to the "dermatologist" count. However, a single patient might visit many types of specialized physicians. There are many kinds of specialties, and a single patient might influence all the counts. Say there are 50 of them.
How to make these counts differentially private? A first solution is to split the privacy budget across all the counts. Here, we can split our privacy budget in 50. If we want to achieve \(\varepsilon=1\), we compute \(\varepsilon'=1/50=0.02\), and add Laplace noise of scale \(1/\varepsilon'=50\) to all the counts.
Unfortunately, this is a lot of noise.
Fortunately, this is exactly the kind of situation in which Gaussian noise shines. When a single patient can impact \(k\) distinct statistics, we need to scale Laplace noise by \(k\). By contrast, Gaussian noise must only be scaled by \(\sqrt{k}\). Comparing the two gives a much more flattering view of the power of Gaussian noise.
OK, so that's the general idea. Now, why does that happen? How come composition doesn't seem to behave in the same way for Laplace and Gaussian? To understand this better, we'll first introduce the concept of sensitivity.
Different kinds of sensitivities
Consider our example above. For each type of specialized physician, we count the people who consulted with one. But we won't consider this histogram as 50 different counting queries. Instead, we'll consider it as a single function, with values in \(\mathbb{N}^{50}\). It outputs a vector: a list of 50 coordinates, each of which corresponds to a fixed specialty. How to make such a function \(f\) differentially private? We'll add noise, scaled by the function's sensitivity.
The sensitivity of a function measures how much its output can change when you add one record in the database. If the function returns a single number, we measure the absolute value of the difference: the sensitivity of \(f\) is the maximum value of \(\left|f\left(D_1\right)-f\left(D_2\right)\right|\). We already encountered the sensitivity before: when counting things, if each patient can change the statistic by more than \(1\), we needed to scale the noise accordingly. The same happened for sums.
Here, the function \(f\) returns a vector. How do we measure the distance between two vectors? We have a few options. We could use the Manhattan distance, or the Euclidean distance, or even weirder stuff. As it turns out, the distance we need to use depends on which noise function we're adding. Laplace noise is scaled by the \(L^1\) sensitivity, itself based on the Manhattan distance. Here is its definition, denoting by \(f_i(D)\) the \(i\)-th coordinate of \(f(D)\):
where the \(\max\) is taken over \(D_1\) and \(D_2\) differing in a single record. This is easy to understand: you just sum the sensitivities for each coordinate. For our function \(f\), we have \(\Delta_1(f)=50\): Laplace noise needs to be scaled by 50. You might have noticed that this is equivalent to using simple composition: the scale of Laplace noise is \(\Delta_1/\varepsilon\), so dividing \(\varepsilon\) by \(50\) is the same as considering all coordinates together.
By contrast, Gaussian noise needs to be scaled by the \(L^2\) sensitivity. This type of sensitivity is based on the Euclidean distance, and defined by:
still taking the \(\max\) over \(D_1\) and \(D_2\) differing in a single record. This formula might look complicated, but the Euclidean distance is a simple concept: it's the length of a straight line between two points. If you only have two dimensions, this formula might be reminiscent of the Pythagorean theorem!
The standard deviation \(\sigma\) of Gaussian noise will be proportional to \(\Delta_2(f)\). Let's compute this value for our function. Each patient can change a single count by at most one, and each can change all counts. Thus:
The noise scales with the square root of the number of counts. This is key to Gaussian's superiority in such situations: the \(L^2\) sensitivity grows much more slowly than the \(L^1\) sensitivity. As a result, scaling the noise by the sensitivity hurts accuracy much less. Of course, using Gaussian noise gives you \((\varepsilon,\delta)\)-DP, not pure DP, so there is still a tradeoff. As we saw in the previous article, this isn't a super scary \(\delta\), so it's generally worth it.
This paper (Theorem 8) gives the exact formula for calibrating Gaussian noise depending on \(\Delta_2(f)\), \(\varepsilon\) and \(\delta\). You need to pick \(\sigma\) such that the following equality holds:
where \(g\) is a complicated function. As you can see, increasing \(\Delta_2(f)\) and \(\sigma\) simultaneously has no effect: \(\sigma\) is proportional to \(\Delta_2(f)\). There is no analytic form for \(\sigma\), but since \(g\) is monotonic, you use e.g. a binary search to approximate its value. If you want to know the exact formula, click here: .
Now, why do these two noise distributions work so differently? Rather than proving this formally, here is a visual intuition. Let's look at the density function of Gaussian and Laplace noise, in two dimensions. The first is Gaussian, the second is Laplace.
As you can see, Gaussian noise has a circular shape, while Laplace noise has a square shape. How indistinguishable are two points, when noise is added to both of them? With Gaussian noise, it depends on their distance from each other in a straight line. By contrast, with Laplace, it depends on how far they are in Manhattan distance.
In conclusion, whether to use Laplace or Gaussian noise depends on two things:
- whether we are OK with a non-zero \(\delta\);
- how many statistics a single individual can influence.
The first point is clear: if we want \(\delta=0\), we can't use Gaussian noise. Let's quantify the second point. If a single person can impact at most one statistic, Laplace is better. If they can impact many, Gaussian is better. Where does the boundary lie? The following graph answers this question, comparing both mechanisms by their standard deviation. We pick \(\varepsilon=1\) and \(\delta=10^{-5}\), and we assume that each person can influence each statistic at most once.
For these values of \(\varepsilon\) and \(\delta\), Gaussian noise is better if each individual can influence 8 statistics or more. Of course, with different privacy parameters, the result might differ. But as the impact of a single individual grows, Gaussian noise will always end up being the least noisy choice.
Further uses
What if each person can influence each statistic differently? Suppose, for example, that we count the number of visits to each type of physician. A single patient can add many visits to a single count. Worse, the maximum number of visits per patient can vary across physician types. Can we still use Gaussian noise? The answer (hat tip to Mark) is yes: you can scale down each statistic so the sensitivity becomes \(1\), add noise to them, then scale them back up. This makes Gaussian noise even more powerful: if you compute many statistics on the same data, you can use the Gaussian mechanism to reduce the noise magnitude, even if the statistics are completely unrelated.
Finally, Gaussian noise is heavily used in differentially private machine learning. The fundamental reason is the same. Consider methods like stochastic gradient descent, a popular algorithm in machine learning. At each iteration of this method, each data point influences many coordinates of a vector. To make it differentially private, we need to add noise to all coordinates. Thus, Gaussian noise is a good choice, for the exact same reason. Machine learning is full of more Gaussian-related goodness, but this article is long enough already.
Maybe we'll come back to it in future posts! The next article, though, tackles a much simpler problem: what do you do when you try using differential privacy on your data, but the results aren't accurate enough?
Thanks to [Frank McSherry] and [Antoine Amarilli] for their helpful comments.