Differential privacy in practice (easy version)

This post is part of a series on differential privacy. Check out the table of contents to see the other articles!

Previously, we saw that differential privacy was pretty awesome, and we looked at the formal definition. Now, how do we obtain it in practice? Let's start with the basics.

Counting unique users

Suppose you have a database, and you want to publish how many people in there satisfy a given condition. Say, how many have green eyes? Even if you have many people in your database, you can't just publish the true answer. Let's take a moment to understand why.

With differential privacy, we assume that the attacker knows almost all elements. They only have uncertainty about their target. Say they want to know whether their target has green eyes. If you output the real number \(k\), they can compare it with the number of people with green eyes among the people they know. If it's \(k-1\), then the target has green eyes. If it's \(k\), then the target does not.

So, what do we do? We compute the exact answer, and we add noise. This noise will come from a probability distribution called the Laplace distribution. This distribution has a parameter, its scale, which determines how "flat" it is. It looks like this:

Graph showing a Laplace distribution with scale 1/ln(3), centered on 0

So, to get \(\varepsilon\)-differential privacy, we pick a random value according to \(\text{Laplace}(1/\varepsilon)\), and we add this noise to the real value. Why does it work? Let's look at the distribution of the number we return, depending on whether the true count is \(k=1000\) (blue line, the target doesn't have green eyes) or \(k=1001\) (yellow line, the target has green eyes).

Graph showing two Laplace distributions with scale 1/ln(3), centered on 1000 and 1001

Let's say the real number is \(k=1001\), and after adding noise, we published \(1003\). Let's put ourselves in the attacker's shoes. What's the likelihood that the original number was \(1001\) vs. \(1000\)? The hypothesis "\(k=1001\)" is a bit more likely: generating a noise of \(2\) is more likely than a noise of \(3\). How much more likely? It turns out that the ratio between these likelihoods is… \(e^\varepsilon\)! So the ratio of probabilities of differential privacy is satisfied.

This works no matter what the output is: the ratio will always be between \(e^\varepsilon\) and \(e^{-\varepsilon}\). If you want to double-check, you can either verify it on the graph, or do the math.

Counting things

OK, so counting unique users was pretty easy. Counting things must also be straightforward, right? Let's say you have a database of suggestions that people sent to your company using a feedback form. You want to publish the number of suggestions you received on a given day. Meanwhile, the attacker wants to get an idea of how many complaints their target published.

What's different about the previous scenario? Can't we just add noise picked from \(\text{Laplace}(1/\varepsilon)\) and get \(\varepsilon\)-differential privacy? There's a catch: what if someone sent more than one complaint during one day? Let's say someone was super unhappy and sent five complaints. The other 1000 customers sent one complaint each. The influence of this one disgruntled customer will be larger than before. The two distributions now look like this:

Graph showing two Laplace distribution with scale 1/ln(3), centered on 1000 and 1005

The difference between the curves is much larger than before. Their ratio is at most \(e^{5\varepsilon}\), so using a parameter of \(1/\varepsilon\) only gives \(5\varepsilon\)-differential privacy. To fix this, we need to add more noise. How much more? It depends on the maximum contribution of one individual user. If the maximum amount of complaints in one day is 5, you must add 5 times the amount of noise. In this example, using \(\text{Laplace}(5/\varepsilon)\) would give you \(\varepsilon\)-differential privacy.

Graph showing two Laplace distribution with scale 5/ln(3), centered on 1000 and 1005

Note that you can't fully automate this process: you need to know what the largest contribution can be. A human, with some knowledge over the process, must make a judgment call. In our case, this could be "users won't post more than five complaints per day".

What happens if that judgment call is wrong, and a user later decides to post 10 complaints in one day? To preserve the desired level of privacy, you need to clamp all values to the estimated maximum. In other words, for this outlier user, you would only count 5 complaints in the non-noisy sum.

This process can introduce unexpected bias in the data. So, be careful when estimating the largest contribution! If clamping only happens very rarely, you should be fine.

Summing or averaging numbers

Let's say each of your users gives your customer service a rating, between -10 and 10. You want to release the average rating. Computing an average is pretty much the same as computing a sum — add all ratings, then divide by the number of users¹. So, what do we do to the sum to achieve differential privacy?

Among all rating options, we only have to consider the worst possible case. How far can the two noise curves be from each other? If the values are all between -10 and 10, the greatest possible difference is \(10-(-10)=20\). It happens when the attacker tries to determine whether a user voted -10 or 10.

Like in the previous example, you have to add noise of \(Laplace(20/\varepsilon)\) to get \(\varepsilon\)-differential privacy. And just as before, you need to check that each value is between your theoretical minimum and maximum. If you find an anomalous value, e.g. lower than the minimum, you need to clamp it to the minimum before adding it to the sum.

In some cases, estimating these minimum or maximum values is difficult. For example, if you're computing the average salary in a large group of people, how should you estimate the upper salary limit? I don't see that problem as a usability flaw of differential privacy. Rather, it suggests that averages are not a meaningful metric in the presence of outliers. Removing these outliers is a good idea for both accuracy and privacy :-)

Releasing many things at once

OK. What if you don't want to release only one statistic, but many of them? Can you add noise to each of them and be fine? Well… it depends. The main question you have to ask yourself is: what is the maximum influence of one individual? There are two distinct possibilities.

Statistics are about different people

Suppose you want to release the number of users you have depending on their age ranges: 20-29, 30-39, 40-49, etc.

Each user will have an influence in at most one of these categories. Either someone is in a given age range, either they're in another one. This situation often appears when you're trying to release a histogram:

Histogram of fake data showing a number of people by age range

When you're in this case, you can safely add Laplace noise of scale \(1/\varepsilon\) to each bucket count. There is no problematic interaction between buckets. Releasing the entire histogram is still \(\varepsilon\)-differentially private.

Noised version of the previous histogram, some bars have changed slightly

Pretty easy, right? Note that this histogram looks a bit weird: the counts are not integers, and one count ended up being negative! We can make it a bit less suspicious, by rounding all counts and replacing all negative numbers by 0.

"Cleaned" version of the previous histogram, all values are now positive integers

This type of post-processing is allowed, thanks to a nifty property of differential property. If you take differentially private data, and make it go through a fixed transformation, you still get differential privacy². Convenient!

One person is in multiple statistics

OK, what if you're releasing multiple statistics, but this time, they might all be about the same user? Let's say that you want to publish how many of your users…

are younger than 35;
are using an iOS device;
are colorblind;
have started using your app less than a month ago.

The same user could be in all those categories! In this scenario, you can't add Laplace noise of scale \(1/\varepsilon\) to each count and get \(\varepsilon\)-differential privacy. Instead, you have to consider each count as a separate data release. Thus, if you have \(C\) different counts, you have to add Laplace noise of scale \(C/\varepsilon\) to each of them. Each independent release will be \(\varepsilon/C\)-differentially private. And we can now use the composition property of differential privacy! This allows us to conclude that the entire release is \(\varepsilon\)-differentially private.

This works for any kind of statistics, not just unique counts. Want to release several pieces of information? Count the maximum influence of one single person, and "split" your \(\varepsilon\) between each data release. This \(\varepsilon\) is called your privacy budget: you choose \(\varepsilon_1\), …, \(\varepsilon_C\) whose sum is \(\varepsilon\), and you release the \(i\)-th statistic with \(\varepsilon_i\)-differential privacy. This solution is more flexible than simply using \(\varepsilon/C\)-differential privacy on each statistic. If one statistic is more important, or more sensitive to noise, you can attribute more budget for it. The larger the budget portion, the lower the noise. Of course, you will have to add more noise to the other values.

Traps to avoid

The mechanisms we saw today are pretty straightforward. But anonymization is full of traps: there are still a number of things that can go wrong.

When summing or averaging numbers, clamping them to the minimum and maximum is essential. Otherwise, all guarantees fly out the window.
Pay attention how you implement this clamping in practice. Special values like NaN can lead to surprising behavior.
Of course, you're not allowed to cheat. You have to choose your privacy strategy in advance, then apply it, and release the noisy data. You can't double-check that it's accurate enough. Otherwise, you skew the randomness, and you lose the privacy guarantees.
When releasing histograms, it's important to choose each category in advance. If you have to look at your data to know what your categories are, you have to use more subtle methods.

This last point is often a problem in practice. For example, say that you want to count how many times each word was used in customer complaints. You can't make a definite word list: people could use words that you didn't predict, or make typos. In that case, the technique I described for histograms doesn't work.

Why doesn't it work? How to fix this? All this, and more, in the next article! Or you can also head over to the table of contents of this blog post series to pick what to read next.

If you want to be extra pedantic, you might also want to add noise to your total number of users. That depends on the flavor of definition that you choose. I'm not going to that level of detail here, and you probably shouldn't either. ↩
If you're wondering why this is true, here's a super short proof. If there was a post-processing function that would break the differential privacy property… The attacker could run it too, and distinguish between two outcomes. But it's impossible, because differential privacy forbids it :-) ↩

..@..♦.D.	About	Blog	Recipes
latest — rss — archives ← previous — next →