# The magic of Gaussian noise

This post is part of a series on differential privacy. Check out the table of contents to see the other articles!

P**reviously**, we used Gaussian noise to explain
the real meaning of \(\delta\) in \((\varepsilon,\delta)\)-differential
privacy. One question was left unanswered: why would anyone use Gaussian noise
in the first place? The guarantees it provides aren't as strong: it gives
\((\varepsilon,\delta)\)-DP with \(\delta>0\), while Laplace noise provides pure
\(\varepsilon\)-DP. This blog post gives an answer to this question, and describes
the situations in which Gaussian noise excels.

# Gaussian noise is nice

A first advantage of Gaussian noise is that the distribution itself behaves
nicely. It's called the *normal* distribution for a reason: it has
convenient properties, and is very widely used in natural and social sciences.
People often use it to model random variables whose actual distribution is
unknown. If you sum many independent random variables, you end up with a normal
distribution. And these are just a few of the many
properties of this fundamental distribution. Thus, most data
analysts and scientists are already familiar with Gaussian noise. It's
convenient when you release anonymized statistics: analysts don't need to learn
too many new concepts to understand what you're doing to protect the data.

A second advantage is that the Gaussian distribution has nice, *thin* tails. The
vast majority of its probability mass is focused around its mean. Take a normal
distribution with mean 0 and standard deviation \(\sigma\). The 68–95–99.7
rule says that a random variable sampled from this distribution will be:

- in \([-\sigma,\sigma]\) with 68% probability;
- in \([-2\sigma,2\sigma]\) with 95% probability;
- and in \([-3\sigma,3\sigma]\) with 99.7% probability.

It even gets better as you go further away from the mean. The probability that
the random variable is *outside* \([-k\sigma,k\sigma]\) decreases faster than
\(e^{-k^2/2}\). In practice, you're rarely surprised by the values that a Gaussian
distribution takes. Even if you sample \(1,000,000\) values, they are all probably
going to be within \([-5\sigma,5\sigma]\).

Laplace, by comparison, isn't quite as nice. Its tails decrease exponentially fast,
but that's still much slower than Gaussian tails. Suppose you sample \(1,000,000\)
values from a Laplace distribution of standard deviation \(\sigma\). On average,
**849** of them will be outside \([-5\sigma,5\sigma]\).

OK, so Gaussian noise is nice. But that does not change a simple fact: to get a comparable level of privacy for a single statistic, Laplace is much better. Assume that we're adding noise to a simple count, of sensitivity \(1\). This graph compares the Laplace noise needed to get \(\varepsilon=1\), and Gaussian noise needed to get \(\varepsilon=1\) and \(\delta=10^{-5}\).

Despite its weaker privacy guarantees, the Gaussian distribution is much flatter. Its standard deviation is over 3.4, while Laplace's is less than 1.3. Thus, much more noise will need to be added, and analysts care a lot about minimizing the noise. Why, then, would Gaussian noise be a good option? The answer is simple: because it gets better when you're publishing a lot of statistics.

# From one to many statistics

In most of our previous examples, we assumed that each individual
appeared in a *single* statistic. This case is common, for example when
partitioning people based on demographic information. But in many applications,
this assumption does not hold. Imagine, for example, that you want to answer the
question: « what types of specialized physicians did people visit in the past 10
years? »

Assume we're working in the central model. We have a dataset of 〈patient ID, specialist type〉 pairs, and each record corresponds to an individual visiting a specialized physician (cardiologist, dermatologist, radiologist, etc.). We want to count the number of unique patient IDs per specialty.

Note that each patient can only influence a single count *once*. We count
*distinct* patient IDs: if you visit dermatologists 10 times, you will only add 1
to the "dermatologist" count. However, a single patient might visit many types of
specialized physicians. There are many kinds of
specialties, and a single
patient might influence all the counts. Say there are 50 of them.

How to make these counts differentially private? A first solution is to *split
the privacy budget* across all the counts. Here, we can
split our privacy budget in 50. If we want to achieve \(\varepsilon=1\), we
compute \(\varepsilon'=1/50=0.02\), and add Laplace noise of scale
\(1/\varepsilon'=50\) to all the counts.

Unfortunately, this is a lot of noise.

Fortunately, this is exactly the kind of situation in which Gaussian noise shines. When a single patient can impact \(k\) distinct statistics, we need to scale Laplace noise by \(k\). By contrast, Gaussian noise must only be scaled by \(\sqrt{k}\). Comparing the two gives a much more flattering view of the power of Gaussian noise.

OK, so that's the general idea. Now, *why* does that happen? How come
composition doesn't seem to behave in the same way for Laplace and Gaussian? To
understand this better, we'll first introduce the concept of *sensitivity*.

# Different kinds of sensitivities

Consider our example above. For each type of specialized physician, we count the
people who consulted with one. But we won't consider this histogram as 50
different counting queries. Instead, we'll consider it as a *single* function,
with values in \(\mathbb{N}^{50}\). It outputs a *vector*: a list of 50
coordinates, each of which corresponds to a fixed specialty. How to make such a
function \(f\) differentially private? We'll add noise, scaled by the function's
*sensitivity*.

The sensitivity of a function measures how much its output can change when you
add one record in the database. If the function returns a single number, we
measure the *absolute value* of the difference: the sensitivity of \(f\) is the
maximum value of \(\left|f\left(D_1\right)-f\left(D_2\right)\right|\). We already
encountered the sensitivity before: when counting things, if each
patient can change the statistic by more than \(1\), we needed to scale the noise
accordingly. The same happened for sums.

Here, the function \(f\) returns a vector. How do we measure the distance between two vectors? We have a few options. We could use the Manhattan distance, or the Euclidean distance, or even weirder stuff. As it turns out, the distance we need to use depends on which noise function we're adding. Laplace noise is scaled by the \(L^1\) sensitivity, itself based on the Manhattan distance. Here is its definition, denoting by \(f_i(D)\) the \(i\)-th coordinate of \(f(D)\):

where the \(\max\) is taken over \(D_1\) and \(D_2\) differing in a single record. This is easy to understand: you just sum the sensitivities for each coordinate. For our function \(f\), we have \(\Delta_1(f)=50\): Laplace noise needs to be scaled by 50. You might have noticed that this is equivalent to using simple composition: the scale of Laplace noise is \(\Delta_1/\varepsilon\), so dividing \(\varepsilon\) by \(50\) is the same as considering all coordinates together.

By contrast, Gaussian noise needs to be scaled by the \(L^2\) sensitivity. This type of sensitivity is based on the Euclidean distance, and defined by:

still taking the \(\max\) over \(D_1\) and \(D_2\) differing in a single record. This formula might look complicated, but the Euclidean distance is a simple concept: it's the length of a straight line between two points. If you only have two dimensions, this formula might be reminiscent of the Pythagorean theorem!

The standard deviation \(\sigma\) of Gaussian noise will be proportional to \(\Delta_2(f)\). Let's compute this value for our function. Each patient can change a single count by at most one, and each can change all counts. Thus:

The noise scales with the *square root* of the number of counts. This is key to
Gaussian's superiority in such situations: the \(L^2\) sensitivity grows much more
slowly than the \(L^1\) sensitivity. As a result, scaling the noise by the
sensitivity hurts accuracy much less. Of course, using Gaussian noise gives you
\((\varepsilon,\delta)\)-DP, not pure DP, so there is still a tradeoff. As we saw
in the previous article, this isn't a super scary \(\delta\), so it's
generally worth it.

This paper (Theorem 8) gives the exact formula for calibrating Gaussian noise depending on \(\Delta_2(f)\), \(\varepsilon\) and \(\delta\). You need to pick \(\sigma\) such that the following equality holds:

where \(g\) is a complicated function. As you can see, increasing \(\Delta_2(f)\) and \(\sigma\) simultaneously has no effect: \(\sigma\) is proportional to \(\Delta_2(f)\). There is no analytic form for \(\sigma\), but since \(g\) is monotonic, you use e.g. a binary search to approximate its value. If you want to know the exact formula, click here: .

Now, why do these two noise distributions work so differently? Rather than proving this formally, here is a visual intuition. Let's look at the density function of Gaussian and Laplace noise, in two dimensions. The first is Gaussian, the second is Laplace.

As you can see, Gaussian noise has a circular shape, while Laplace noise has a
square shape. How indistinguishable are two points, when noise is added to both
of them? With Gaussian noise, it depends on their distance from each other *in
a straight line*. By contrast, with Laplace, it depends on how far they are in
Manhattan distance.

In conclusion, whether to use Laplace or Gaussian noise depends on two things:

- whether we are OK with a non-zero \(\delta\);
- how many statistics a single individual can influence.

The first point is clear: if we want \(\delta=0\), we can't use Gaussian noise. Let's quantify the second point. If a single person can impact at most one statistic, Laplace is better. If they can impact many, Gaussian is better. Where does the boundary lie? The following graph answers this question, comparing both mechanisms by their standard deviation. We pick \(\varepsilon=1\) and \(\delta=10^{-5}\), and we assume that each person can influence each statistic at most once.

For these values of \(\varepsilon\) and \(\delta\), Gaussian noise is better if each individual can influence 8 statistics or more. Of course, with different privacy parameters, the result might differ. But as the impact of a single individual grows, Gaussian noise will always end up being the least noisy choice.

# Further uses

What if each person can influence each statistic *differently*? Suppose, for
example, that we count the number of *visits* to each type of physician. A
single patient can add many visits to a single count. Worse, the maximum number
of visits per patient can vary across physician types. Can we still use Gaussian
noise? The answer (hat tip to Mark) is *yes*: you can scale down
each statistic so the sensitivity becomes \(1\), add noise to them, then scale
them back up. This makes Gaussian noise even more powerful: if you compute many
statistics on the same data, you can use the Gaussian mechanism to reduce the
noise magnitude, even if the statistics are completely unrelated.

Finally, Gaussian noise is heavily used in differentially private machine learning. The fundamental reason is the same. Consider methods like stochastic gradient descent, a popular algorithm in machine learning. At each iteration of this method, each data point influences many coordinates of a vector. To make it differentially private, we need to add noise to all coordinates. Thus, Gaussian noise is a good choice, for the exact same reason. Machine learning is full of more Gaussian-related goodness, but this article is long enough already.

Maybe we'll come back to it in future posts! The next article, though, tackles a much simpler problem: what do you do when you try using differential privacy on your data, but the results aren't accurate enough?

Thanks to [Frank McSherry] and [Antoine Amarilli] for their helpful comments.