Lowering the cost of anonymization

2.1.6 Differential privacy ✿

The flaws of syntactic definitions pointed in the previous section are impossible to fix without a change in paradigm. Trying to find a criterion for a database to be anonymized is always going to be flawed, so we need a different approach. One of the core insights of differential privacy, which we define in this section, is that anonymization should be a property of the process instead. The definition should apply to the anonymization mechanism, and not to its output.

What property do we want an anonymization mechanism to have? Intuitively, we want it to not leak private information: an attacker with access to the output of the mechanism should not learn anything about an individual. A first attempt could be to require the mechanism $M$ to be such that for any databases $D_{1}$ and $D_{2}$ that differ in only one record, $M (D_{1}) = M (D_{2})$ .

Why would such a definition accomplish our goal? For any single record of a database $D$ , if we change or remove that record completely, we would still obtain the same output. This means that the output does not leak anything about the individual corresponding to this record: if it did, changing the record would change the output, and an attacker might be able to notice.

This definition corresponds to a privacy goal set out by Dalenius. In 1977, he defined statistical disclosure as follows.

If the release of a statistic $S$ makes it possible to determine the value [of a single record] more accurately than is possible without access to $S$ , a disclosure has taken place […]

The goal is to prevent this from happening. Our definition satisfies this goal: the output does not leak anything about individual records. The problem, of course, is that this definition is trivial. Any database $D_{2}$ can be obtained from any database $D_{1}$ by a series of “neighboring” databases, removing and adding records one by one as needed; so for any databases $D_{1}$ and $D_{2}$ , $M (D_{1}) = M (D_{2})$ . It is not only impossible to learn anything about individual records by looking at the output, but it is impossible to learn anything about the entire database. Allowing the mechanism $M$ to be randomized would not change anything: the equality would be between probability distributions, but the result would be the same.

In that sense, Dalenius’ privacy goal is fundamentally impossible to achieve. If we cannot learn anything about individuals, then by extension, we cannot learn anything about the database either. Thus, we have to accept leaking some information about individuals, and try to quantify and limit this leakage of private information. This is what $ε$ -differential privacy does: rather than requiring that $M (D_{1})$ is exactly the same as $M (D_{2})$ , for neighboring $D_{1}$ and $D_{2}$ , it requires that $M (D_{1})$ is almost the same as $M (D_{2})$ . This notion is formalized via the concept of $ε$ -indistinguishability.

Definition 5 ( $ε$ -indistinguishability [128]¹). Two random variables $A$ and $B$ are $ε$ -indistinguishable, denoted $A \approx_{ε} B$ , if for all measurable sets $X$ of possible events:

P [A \in X] \leq e^{ε} \cdot P [B \in X] and P [B \in X] \leq e^{ε} \cdot P [A \in X] .

Informally, $A$ and $B$ are $ε$ -indistinguishable if their distributions are “close”; and the smaller $ε$ is, the closer $A$ and $B$ are. This notion can then be used to define differential privacy.

Definition 6 ( $ε$ -differential privacy [122 , 128]). A mechanism $M$ is $ε$ -differential private (or $ε$ -DP) if for all datasets $D_{1}$ and $D_{2}$ that differ only in one record, $M (D_{1}) \approx_{ε} M (D_{2})$ .

Just like $k$ in $k$ -anonymity, this definition depends on a numeric parameter $ε$ . The smaller the $ε$ , the stronger the definition; if $ε = 0$ , the distributions are exactly the same, and as discussed above, $M$ is trivial. Setting $ε$ is, again, a non-trivial policy decision. However, this time, we can formally quantify the privacy leakage depending on the value of $ε$ , and make statements such as “an attacker who thinks their target is in the dataset with probability 50% can increase their level of certainty to at most 75%”.

How does this work? Let us first consider a simple example, called randomized response.

A simple example: randomized response

Interestingly, randomized response was introduced in 1965 [383], 40 years before differential privacy. It works as follows. Suppose we want to do a survey to know how many people are respecting social distancing guidelines during the COVID-19 pandemic. If we naively go out and ask people whether they are, many might feel uncomfortable answering honestly, and lie to us. So we can the following mechanism. The participants no longer directly answer the question “are you respecting social distancing guidelines?”. Instead, each of them will flip a coin, without showing it to us.

On heads, the participant tells the truth (Yes or No).
On tails, they flip a second coin. If the second coin lands on heads, they answer Yes. Otherwise, they answer No.

How is this better for survey respondents? They can now answer Yes without revealing that they are doing something illegal. When someone answers No, we cannot know their true answer for sure. They could be breaking social distancing guidelines, but they might also have answered at random. Let us compute the probabilities of each answer someone who broke social distancing guidelines.

With probability 50%, they will say the truth and answer No.
With probability 50%, they will answer at random.
- They then have another 50% chance to answer Yes, so 25% chance in total.
- Similarly, in total, they have a 25% chance to answer No.

All in all, we get a 75% chance to answer No and a 25% chance to answer Yes. For someone who respecting social distancing guidelines, the probabilities are reversed: 25% chance to answer No and 75% to answer Yes. Calling this randomization mechanism $M$ :

\begin{matrix} P [M (Yes) = Yes] = 0.75 & P [M (Yes) = No] = 0.25 P [M (No) = Yes] = 0.25 & P [M (No) = No] = 0.75 \end{matrix}

Now, $0.75$ is three times bigger than $0.25$ . So if we choose $ε$ such as $e^{ε} = 3$ (which corresponds to $ε ≃ 1.1$ ), this process is $ε$ -differentially private. This intuitive notion of plausible deniability translates naturally in the language of differential privacy.

Of course, with a differentially private process like this one, we are getting some noise into our data: some answers are going to be random. But given enough answers, with high probability, the noise will cancel itself out. Suppose that we have 1000 answers in total: 400 of them are Yes and 600 are No. About 50% of all 1000 answers are random, so we can remove 250 answers from each count. In total, we get 150 Yes answers out of 500 non-random answers, so about 30% of Yes overall.

What if we want the process to offer a higher level of privacy? Instead of having the participants answer randomly with probability 50%, we can have them answer randomly 75% of the time. Conversely, if you want less noise instead, we could have them answer randomly only 25% of the time. There is a natural trade-off between privacy (for individuals) and utility (for the data), which is present in all applications of differential privacy.

How can we now translate this intuition into a formal statement about the information gain of an attacker? A relatively simple formalization is the Bayesian interpretation of differential privacy, generalizing an analysis that was originally done in the context of randomized response mechanisms [144].

Formally quantifying the attacker’s information gain

Let us consider a more generic scenario than the previous example. We have a mechanism $M$ which is $ε$ -differentially private. We run it on some database $D$ , and release the output $M (D)$ to an attacker. Then, the attacker tries to find out whether their target is in $D$ . This attack goal is similar to the one of $δ$ -presence (see Section 2.1.4), but we could just as easily have the attacker try to find out the sensitive value associated to their target, or any other information about a single individual.

Under differential privacy, the attacker cannot gain a lot of information about their target. This is true even if this attacker has a lot of knowledge about the dataset. Let us take the stronger attacker we can think of: they know all the database, except their target. This attacker has to determine which database is the real one, between two options: one with their target in it (denote it $D_{in}$ ), the other without ( $D_{out}$ ).

So, in the attacker’s model of the world, the actual database $D$ can be either $D_{in}$ or $D_{out}$ . They might have an initial suspicion that their target is in the database. This suspicion is represented by a Bayesian probability $P [D = D_{in}]$ . This probability can be anything between $0$ and $1$ . Say, $0.9$ if the attacker’s suspicion is strong, $0.01$ if they think it is very unlikely, $0.5$ if they have no idea. Similarly, their suspicion that their target is not in the dataset is also a probability, $P [D = D_{out}]$ . Since there are only two options, $P [D = D_{out}] = 1 - P [D = D_{in}]$ .

Now, suppose the attacker sees that the mechanism returns output $O$ . How much information did the attacker gain? This is captured by looking at how much their suspicion changed after seeing this output. In mathematical terms, we have to compare the initial suspicion (the prior) $P [D = D_{in}]$ with the updated suspicion $P [D = D_{in} | M (D) = O]$ (the posterior). This posterior is the attacker’s model of the world after seeing $O$ . With $ε$ -differential privacy, the posterior is never too far from the prior, and we can quantify this phenomenon exactly, as shown in Proposition 1.

Proposition 1. Assume $D$ is a database chosen randomly between $D_{1}$ and $D_{2}$ . Then the posterior $P [D = D_{1} | M (D) = O]$ is bounded by:

\frac{P [D = D_{1}]}{e^{ε} + (1 - e^{ε}) P [D = D_{1}]} \leq P [D = D_{1} | M (D) = O] \leq \frac{e^{ε} \cdot P [D = D_{1}]}{e^{ε} + (e^{ε} - 1) P [D = D_{1}]} .

Proof. First, we use Bayes’ rule to express the posterior as a function of the prior:

\begin{matrix} P [D = D_{1} | M (D) = O] & = \frac{P [D = D_{1}] \cdot P [M (D) = O | D = D_{1}]}{P [M (D) = O]} = \frac{P [D = D_{1}] \cdot P [M (D_{1}) = O]}{P [M (D) = O]} . \end{matrix}

Then, we divide the posterior on $D_{1}$ and the posterior on $D_{2}$ to make the unknown $P [M (D) = O]$ term disappear:

\frac{P [D = D_{1} | M (D) = O]}{P [D = D_{2} | M (D) = O]} = \frac{P [D = D_{1}]}{P [D = D_{2}]} \cdot \frac{P [M (D_{1}) = O]}{P [M (D_{2}) = O]} .

Note that the right-most term are the same as in the definition of differential privacy, so we have:

e^{- ε} \leq \frac{P [M (D_{1}) = O]}{P [M (D_{2}) = O]} \leq e^{ε} .

Plugging this into the previous formula, we get:

\begin{matrix} e^{- ε} \frac{P [D = D_{1}]}{P [D = D_{2}]} \leq \frac{P [D = D_{1} | M (D) = O]}{P [D = D_{2} | M (D) = O]} \leq e^{ε} \frac{P [D = D_{1}]}{P [D = D_{2}]} . & (2.1) \end{matrix}

Replacing $P [D = D_{2}]$ by $1 - P [D = D_{1}]$ , doing the same for $P [D = D_{2} | M (D) = O]$ , and solving for $P [D = D_{1} | M (D) = O]$ , leads to the desired result. □

Note that the ratio of probabilities that appeared in the proof of Proposition 1 a simple interpretation: those are what gamblers call betting odds. In a sport game of team A vs. team B, betting odds of 2:1 means that according to bookies, the probability for A to win is twice as much as for B. This corresponds to probabilities of about 67% and 33%, respectively. In particular, Equation 2.1 show that an alternative interpretation of differential privacy is that betting odds cannot change much after knowing the output of a DP mechanism.

For example, with $ε = ln 3$ , the upper and lower bounds for the posterior as a function of the prior are depicted in Figure 2.1. The black line is what happens if the attacker did not get their prior updated at all. The blue lines are the lower and upper bounds on the updated suspicion: it can be anywhere between the two. We can visualize the example mentioned in the previous section: for an initial suspicion of 50%, the updated suspicion is approximately between 25% and 75%.

000001000001IU........np24682468itdiaaltesudspsuiscpioincion — Figure 2.1: Prior-posterior bounds for $ε$ -DP with $ε = ln 3$ .

This property can be used to convey the meaning of $ε$ to people who are not experts. An $ε$ of $ln 3$ can be said to guarantee that at most, the attacker “halves their uncertainty”. Similarly, $ε = ln 2$ can “reduce the uncertainty by one third”, while $ε = ln 4$ can “reduce it by two thirds”. Though not very formal, this formulation is understandable by most, and has the advantage of creating appropriate rejection reactions to large values of $ε$ .

What does it look like for various values of $ε$ ? In Figure 2.2, we visualize the generalization of the bounds in Figure 2.2. For larger values of $ε$ , the information gain increases fast: for example, with $ε = 5$ . Then, an attacker can go from a small suspicion (say, 10%) to a very high degree of certainty (94%).

000001000001InU1234567𝜀.2.4.6.8.2.4.6.8iptdiaalt seuds spuicsipoincion — Figure 2.2: Prior-posterior bounds for $ε$ -DP, for various values of $ε$ .

Properties of differential privacy

Differential privacy has several properties that make it a more convincing and convenient definition of anonymization than other definitions seen so far. First, many problems we pointed out in syntactic definitions do not apply to differential privacy. The definition protects anything associated with a single individual, including presence in the database and sensitive information. Differential privacy also protects against an attacker that has unlimited background knowledge: $D_{1}$ and $D_{2}$ can be arbitrary neighboring databases, so the attacker could even choose all records and only have uncertainty about their target. These two aspects greatly simplify policy choices: a manual assessment of the attacker’s goals and capabilities is no longer required.

As proven in the previous section, and explored in more detail in Section 2.2, differential privacy allows us to prove semantic properties: it limits the amount of information that an attacker can gain, no matter what attack method is used and what kind of output the mechanism generates. It does not matter what exact form the output takes: it can be a statistic, synthetic data, a machine learning model, etc.

In addition, differential privacy has a few properties that make it convenient to use. Its semantic guarantees are preserved by transformations that happen after the differential privacy stage. This property, called post-processing, makes it easy to reason about complex data pipelines that involve an anonymization step: everything downstream of the DP step stays DP.

Proposition 2 (Post-processing of differential privacy [30]). Let $M$ be an $ε$ -DP mechanism, and let $f$ be an arbitrary function. Then the function $D \to f (M (D))$ is also $ε$ -DP.

Differential privacy also satisfies convexity, a second property which is less crucial to practical use cases, but is another indication that DP has the kind of semantic guarantees that we expect: choosing between two DP mechanisms at random still leads to a DP mechanism.

Proposition 3 (Convexity of differential privacy [225 , 226]). Let $M_{1}$ and $M_{2}$ be two $ε$ -DP mechanisms, and $p \in (0, 1)$ . The mechanism $M$ defined by $M (D) = M_{1} (D)$ with probability $p$ , and $M (D) = M_{2} (D)$ with probability $1 - p$ , is $ε$ -DP.

Differential privacy also composes well: combining the result of two DP mechanisms is still DP. This property, especially combined with post-processing, is how most complex DP algorithms are built: by assembling simpler mechanisms. For example, a DP algorithm for sums can be combined with a DP algorithm for counts to create a DP algorithm for average. There are several types of composition: parallel composition, sequential composition, and adaptive composition. We introduce the first two below.

Proposition 4 (Parallel composition [122]). Let $M_{1}$ be a $ε_{1}$ -DP mechanism, and $M_{2}$ a $ε_{2}$ -DP mechanism. For any dataset $D$ , let $D_{1}$ and $D_{2}$ be the result of an operation that separates records in two disjoint datasets. Then the mechanism $M$ defined by $M (D) = (M_{1} (D_{1}), M_{2} (D_{2}))$ is $(max (ε_{1}, ε_{2}))$ -DP.

This property allows us to build locally differentially private mechanisms, in which a central server can compute global statistics without accessing the raw data from each individual. In this thesis, we mostly focus on sequential composition, which we simply call composition.

Proposition 5 (Sequential composition [122]). Let $M_{1}$ be a $ε_{1}$ -DP mechanism, and $M_{2}$ a $ε_{2}$ -DP mechanism. Then the mechanism $M$ defined by $M (D) = (M_{1} (D), M_{2} (D))$ is $(ε_{1} + ε_{2})$ -DP.

Note that the formula in Proposition 5 is not tight, especially when many mechanisms are composed. A significant body of research focuses on obtaining tighter composition bounds, either with or without knowledge of mechanisms used for composition [57 , 132 , 133, 213 , 278 , 284]. Proposition 5 still holds if $M_{2}$ depends on the value of $M_{1} (D)$ : this variant is called adaptive composition. This latter property allows to quantify the gain of information over time of an attacker interacting with a differentially private query engine. Proposition 5 still holds for sequential composition; most of the tighter results also do, with some exceptions [114].

Data privacy definitions that are a property of a mechanism and not of the output dataset were already proposed in as early as 2003 [144]. However, thanks to its useful properties, differential privacy quickly became the flagship of data privacy definitions. Thousands of scientific papers have explored this notion and its variants, or introduced algorithms that satisfy it. Large organizations are also investing in differential privacy research and exploring the use of differential privacy for practical use cases: examples include the US Census Bureau [4 , 159], Google [141], Apple [360], Microsoft [107], LinkedIn [224], Uber [210], etc.

The Laplace mechanism

Making simple statistics differentially private is relatively easy. Suppose, for example, that we have a database collecting people’s eye colors, and we want to publish how many people have green eyes? Publishing the true answer is not differentially private, even if a lot of people have green eyes. Indeed, with differential privacy, we need to assume that the attacker knows almost all records, and only has uncertainty about their target. In this example, it means that they already know the eye color of everyone in the dataset, except their target. So if we output the real number $k$ , they can compare it with the number of people with green eyes among the people they know. If it is $k - 1$ , then the target has green eyes. If it is $k$ , then the target does not.

What can we do to prevent this? We can add noise to the real statistic: instead of returning the true answer, we generate a random number from a certain probability distribution, add this number to the original answer, and return this noisy sum. The most common probability distribution used in differential privacy is called the Laplace distribution. This distribution has a parameter, its scale, which determines how “flat” it is.

Definition 7 (Laplace distribution). The Laplace distribution of mean 0 and of scale $b > 0$ is the probability distribution with probability density function:

PDF (x) = \frac{1}{2 b} exp (- \frac{| x |}{b}) .

Figure 2.3 shows the probability density function of a Laplace distribution of scale $b = \frac{1}{ln 3}$ .

−−024000042...246 — Figure 2.3: Laplace distribution of scale $b = \frac{1}{ln 3}$ .

To make our statistic above $ε$ -DP, we pick a random value according to a Laplace distribution of scale $\frac{1}{ε}$ , and we add it to the real value. Figure 2.4 gives a visual intuition of why it works. It shows the distribution of the number we return, depending on whether the true count is $1000$ (blue line, the target does not have green eyes) or $1001$ (red line, the target has green eyes).

991111000099,,,,...68000024600000246 — Figure 2.4: Laplace distribution of scale $b = \frac{1}{ln 3}$ added to $1000$ (blue) and $1001$ (red).

Suppose the real number is $1001$ , and we returned $1003$ . From the attacker’s perspective, what is the likelihood that the original value is $1000$ vs. $1001$ ? The hypothesis “the true answer is $1001$ ” is a bit more likely: generating a noise of $2$ is more likely than a noise of $3$ . How much more likely? The ratio of probability density functions is exactly:

\begin{matrix} \frac{PDF (2)}{PDF (3)} & = \frac{\frac{1}{2 b} exp (- \frac{| 2 |}{b})}{\frac{1}{2 b} exp (- \frac{| 3 |}{b})} = exp ((- 2 + 3) ln 3) = 3 . \end{matrix}

We can perform the same computation to compare any two events where we add a noise of $k$ vs. $k + 1$ for any $k$ , and it is easy to check that the shape of the Laplace distribution guarantees that the ratio will always be between $- 3$ and $3$ . This is exactly the guarantee we need for $ε$ -DP, with $ε = ln 3$ : adding Laplace noise of scale $\frac{1}{ε}$ is enough to guarantee $ε$ -DP for cases where we count unique individuals.

What if we want to compute and release more complicated statistics? For example, we might have a dataset of suggestions that your customers sent us using a feedback form, where each customer can use the form at most 5 times. We want to publish the number of suggestions we received in total, while the attacker wants to get an idea of how many suggestions their target published.

Adding noise picked from a Laplace distribution of scale $\frac{1}{ε}$ is not enough to get $ε$ -DP. Indeed, assume that the attacker’s target sent 5 suggestions, while our other 1000 customers sent one suggestion each. The influence of the target will be larger than before, as shown in Figure 2.5. The difference between the two curves is much larger than before; their ratio is at most $e^{5 ε}$ : using Laplace noise of scale $\frac{1}{ε}$ only gives $5 ε$ -DP.

99111111000099000000...68000001246024680 — Figure 2.5: Laplace distribution of scale $b = \frac{1}{ln 3}$ added to $1000$ (blue) and $1005$ (red).

To fix this, we need to add more noise. How much more? It depends on the maximum contribution of one individual customer. If the maximum amount of suggestions is 5, we must add 5 times the amount of noise. In this example, using Laplace noise of scale $\frac{5}{ε}$ would give us $ε$ -DP; we display this in Figure 2.6.

99111111024680099000000⋅⋅⋅⋅.1.1680000011111&#x0030246800000−−−−2222 — Figure 2.6: Laplace distribution of scale $b = \frac{5}{ln 3}$ added to $1000$ (blue) and $1005$ (red).

This informal reasoning can be extended to a wider class of statistics. The maximum contribution of an individual is called the global sensitivity, and adding Laplace noise scaled by this global sensitivity provides differential privacy. This method for making arbitrary statistics differentially private, which we formalized below, is called the Laplace mechanism.

Proposition 6 (Proposition 1 in [128]). Let $f : D \to R$ be a deterministic function. The global sensitivity of $f$ is the smallest number $s$ such that for all datasets $D_{1}$ and $D_{2}$ differing only in one record:

| f (D_{1}) - f (D_{2}) | \leq s .

Furthermore, the mechanism $M$ defined by $M (D) = f (D) + X$ , where $X$ is a random variable sampled from a Laplace distribution of scale $\frac{s}{ε}$ , is $ε$ -differentially private.

Note that if we want to protect individuals, we need to be careful with the definition of a record: all suggestions from the same customer must be part of the same record. We come back to these considerations in Section 2.2.3. In Section 4.2, we also explain how to use this basic building block for a wide range of practical applications.

When releasing integers, instead of the Laplace mechanism, we can use a discrete distribution instead: the geometric mechanism.

Definition 8 (Two-sided geometric distribution). The two-sided geometric distribution of mean 0 and of parameter $p \in (0, 1)$ is the probability distribution such that a random variable $X$ sampled from the distribution follows, for all $k \in Z$ :

P [X = k] = \frac{1 - p}{1 + p} p^{| k |} .

Proposition 7 ([166]). Let $f : D \to Z$ be a deterministic function of global sensitivity $s \in N$ . The mechanism $M$ defined by $M (D) = f (D) + X$ , where $X$ is a random variable sampled from the two-sided geometric distribution of mean $0$ and of parameter $p$ , is $ln (\frac{1}{p})$ -differentially private.

Using the geometric distribution for integer-valued data has multiple advantages: it avoids unexpected vulnerabilities from floating-point implementations [283], returns results that are easier to understand for non-experts (which might be surprised by a non-integer result for e.g. a count of unique people), and has slightly better utility than simply rounding the result of a Laplace distribution.

¹This notion is similar to the cryptographic notion of indistinguishability [169], except the cryptographic version has

ε = 0

, and an additive error term instead, which is negligible. A similar notion,

(1, ε)

-privacy, was defined in [70], where

(1 + ε)

used in place of

e^{ε}

, and it was also called log-ratio distance in [182]

LINKPREV LINKUP LINKNEXT