Almost differential privacy
Part of a series on differential privacy. You might want to start with the previous articles below!
 Why differential privacy is awesome presents a nontechnical explanation of the definition.
 Differential privacy in (a bit) more detail introduces the formal definition, with very little math.
 Differential privacy in practice (easy version) explains how to make simple statistics differentially private.
 Almost differential privacy (this article) describes how to publish private histograms without knowing the categories in advance.
 Local vs. global differential privacy presents the two main models of differential privacy, depending on who the attacker is.
 The privacy loss random variable explains the real meaning of \((\varepsilon,\delta)\)differential privacy.
 The magic of Gaussian noise introduces Gaussian noise and its shiny properties.
 Why not differential privacy? explores what it means for an algorithm to not be differentially private.
 Getting more useful results with differential privacy presents five simple techniques to improve the utility of your anonymized data.
Let's continue where we left off. In the last article, we saw how to publish histograms in a privacypreserving way. Adding noise to each count was enough to get εdifferential privacy. But we finished with a puzzling statement: I mentioned that if you don't know the categories in advance, the technique no longer works. In fact, the problem gets much trickier. We'll even need to introduce a variant of the original definition! Let's dive in.
Openended survey question
Let's say you're doing a survey where you asked people what's their favorite color. Instead of giving them a list of fixed options, you let them write whatever text they want. Lots of answers are going to be common colors: blue, green, pink… But realworld data is noisy, and you're surely going to get unpredictable answers. Some might be junk answers: people misunderstanding the question, or trolling the survey. Other might simply be rare colors. You want to publish a histogram of answers.
Let's use the same technique as before. What happens if we add Laplace noise of scale \(1/\varepsilon\) to each category? We need to compare the output of this process for two databases that differ on a single element. There are two possibilities.
The two databases have the same categories
If you're lucky, the two databases have the same categories. For example:
 In one, you got 10 green answers, 5 red, and 2 yellow.
 In the other, you got 10 green answers, 5 red, and 3 yellow.
Then, adding noise to each category works fine. The only difference is in the yellow category. By adding noise, we hide the difference between the two values, exactly like before.
The two databases don't have the same categories
This is where it gets trickier. For example:
 In one, you got 10 green answers, 5 red, 2 yellow.
 In the other, you got 10 green answers, 5 red, 2 yellow, and one ultramarine.
Let's see what happens if you add noise to both. Each column will end up with a slightly different number than the real one. But there's something glaringly obvious: the categories are different!
No need to squint at the numbers to notice the difference between these two histograms! It's easy for an attacker to tell apart outputs with different categories. We call this a distinguishing event: the attacker can learn with 100% certainty which database is the right one. Thus, the process is not differentially private. How to fix this?
Maybe we could list all possible categories, and add noise to each of them, including zero counts. Unfortunately, it's not as simple as that. In our example, people can enter anything: there's an infinite number of possibilities. The good news is that at the cost of a slight relaxation in our privacy guarantee, we can overcome that problem.
A solution: thresholding
It's fairly difficult to make sure distinguishing events never happen^{1}. Instead, we can settle for the next best thing: we prevent them from happening most of the time. One way to do that is thresholding. Not only do we add noise to each category, but we also remove all categories with low counts. Let's say that our threshold is 5. In the example above, we would end up releasing only two categories:
There's a price to that strategy: we're losing rare categories. In this example, we didn't only drop the ultramarine category, but yellow as well. Any category whose count is close to 5 (or less) has a significant chance of being lost. Often, that's not a big problem: rare answers have a larger chance of being meaningless.
That solution isn't perfect from a privacy perspective. For example, what if the noise added to the ultramarine category is larger than 4? Then the total count is 5 or more, we end up publishing this category, and it breaks differential privacy. Fortunately, this doesn't happen too often: only 0.6% of the time with Laplace noise of parameter \(1/\ln(3)\).
Tying it all together: \((\varepsilon,\delta)\)differential privacy
Our strategy is a little more complicated than before. We now have two parameters.
 The amount of noise we're adding. Just like before, if we're aiming for \(\varepsilon\)differential privacy most of the time, we need to add Laplace noise of scale \(1/\varepsilon\).
 The threshold we're using to drop rare categories, after adding noise. It induces a natural tradeoff. The bigger the threshold, you more data you lose… But the bigger the threshold, the more you reduce the odds of having a distinguishing event.
Let's visualize this. For each threshold, what are the odds that by adding noise to a category with count 1, you end up above the threshold? The following graph assumes Laplace noise of parameter \(1/\ln(3)\).
Using a logarithmic scale, the graph is a straight line. That makes sense: Laplace noise is a double exponential distribution.
Now, the choice of threshold is specific to the algorithm. For a different algorithm, or a different noise function, the same threshold might have a different effect. So it's not a good idea to use it directly to quantify privacy. Instead, we use the odds of a distinguishing event as an additional parameter to our modified definition.
Formal definition
From \(\varepsilon\)differential privacy, we get \((\varepsilon,\delta)\)differential privacy. This new definition is stricly weaker than the original definition, and has a similar formulation. For all databases \(D_1\) and \(D_2\) which differ in only one individual, and all sets \(S\) of outputs:
The meaning of \(\varepsilon\) is the same as before. The only new element is the \(\delta\). It captures the odds that something goes wrong^{2}. In our example above, \(\delta\approx0.006=0.6\%\). By using \((\varepsilon,\delta)\)differential privacy, we're saying that the algorithm is almost \(\varepsilon\)differentially private. And here, almost means with probability \(1\delta\): the closer \(\delta\) is to 0, the better.
Criticisms of the definition
As I said, you can see \(\delta\) as the probability that something goes terribly wrong. For a privacy definition, this seems like a bad thing to have. Consider the following algorithm, which takes a database as input. With probability \(1\delta\), it returns 42. With probability \(\delta\), it returns the entire database. Talk about a data leak! Still, this algorithm is \((0,\delta)\)differentially private.
This example illustrates that this \(\delta\) parameter allows for catastrophic failures^{3}. Knowing this, you have two options.
 Either you work harder to predefine categories, or use more advanced techniques, and only use "true" differential privacy.
 Either you accept that bad things can happen, and try to limit the risk by mandating a tiny \(\delta\).
I'd argue that the second solution is not a bad choice. The probability of getting hit by lightning in your lifetime is on the order of \(10^{4}\). The probability of a given bit in your RAM being randomly flipped by a cosmic ray in one year is about \(10^{6}\). In many situations, it's reasonable to consider these a negligible risk.
My perspective is that everything in data protection is about risk mitigation. You'll never reduce the risk to 0. Even if you use "true" differential privacy, your implementation might have critical bugs. Or you might get hacked, and your entire anonymization strategy might become irrelevant. Or someone might drug you and hit you with a $5 wrench until you give them your database. What are the odds of this happening? If your \(\delta\) is even smaller, it might be an acceptable price to pay for more convenience.
How to choose \(\delta\)?
Considering the catastrophic scenarios above, maybe our \(\delta\) of 0.6% is a bit too large to use everywhere. But what's a good number? A common option is to pick a \(\delta\) that is significantly smaller than \(1/n\), where \(n\) is the total number of people in the database. The reasoning goes as follows. Each person has, in the worst case, a \(\delta\) chance that their data leaks. So the total odds that someone's data leaks is \(\approx n\delta\): we need to make sure that this number is small enough^{4}.
Luckily, in the problem above, you don't need huge thresholds to get tiny values of \(\delta\). If you have a million users, and you want \(n\delta<0.1\), a threshold of 15 is enough.
Cool properties
\((\varepsilon,\delta)\)differential privacy has the same cool properties as differential privacy.
 Composition: suppose you have two \((\varepsilon,\delta)\)differentially private mechanisms. Then, publishing the result of both satisfies \((2\varepsilon,2\delta)\)differential privacy.
 Postprocessing: suppose you have a \((\varepsilon,\delta)\)differentially private mechanism. Then if you make its output go through a fixed transformation, you still get \((\varepsilon,\delta)\)differential privacy.
That means that most of what we learned in the simpler case of predefined categories still applies. You can round noisy values to integers without risk. If the same person can be in multiple buckets, you can adapt the values of \(\varepsilon\) and \(\delta\). You can also compute sums, although you should be careful in how you adapt the threshold when doing so.
With that, we covered the most frequent and easy use cases for differential privacy. The next post covers an important distinction between the local and global model in differential privacy, and gives a glimpse at what comes inbetween.
Thanks to Frank McSherry and Antoine Amarilli for their helpful comments.

But, as it turns out, not completely impossible. The "Improved Adaptative Histogram" method described in this paper does exactly that, even if the space of possible categories is infinite. I don't know how it compares to the approach described in this post in terms of data loss & truthfulness. It'd be interesting to figure out! ↩

This intuition is technically incorrect, but it's a good first approximation. Most people can understand the idea of "a small chance that something goes wrong". The real interpretation is more complex, I wrote a blog post that explains it. ↩

For more fun examples, check out this blog post. Its author is one of the original creators of differential privacy. I recommend checking his other posts! ↩

Note that this assumes independence between all the possible data leak events. This is wrong in general, but it's a good enough approximation in practice. ↩