Lowering the cost of anonymization

3.2.1 Counting queries

The initial motivation for limiting the attacker’s background knowledge was to show that, under this assumption, some noiseless mechanisms preserve the individuals’ privacy [46]. A typical example is a counting query, which answers the question “How many users satisfy $P$ ?” for some property $P$ . We can model this by a data-generating distribution $θ$ where each record $D (i)$ is either $0$ or $1$ with some probability $p_{i}$ , and we want to measure the privacy of the mechanism $M (D) = \sum_{i} D (i)$ . Records are assumed to be independent, and the adversary is assumed to know some portion of the records. As an immediate consequence of Theorem 2 and Proposition 26, it does not matter whether the attacker can modify, or only see, this portion of records: the values of $ε$ and $δ$ are identical for APKDP and PPKDP.

Furthermore, the closer $p_{i}$ are to $0$ or $1$ , the less randomness is present in the data. For extremely small or large values of $p_{i}$ , the situation is very similar to one where the attacker exactly knows $D (i)$ . As such, it is natural to assume that among the records that are unknown by the attacker, all $p_{i}$ are between $λ$ and $1 - λ$ , for some $λ$ not too close to $0$ . This assumption can easily be communicated to non-specialists: “we assume that there are at least 1000 records that the attacker does not know, and that their level of uncertainty is at least 10% for these records.”

Initial asymptotic results in this context appeared in [46] and more precise bounds were derived in [176]. In the special case where all $p_{i}$ are equal to a fixed value $p$ , Theorem 5 in [46] and Theorem 1 in [176] show that counting queries are APKDP with $ε = O (\sqrt{\frac{ln (1 ∕ δ)}{n}})$ (for small $δ$ , and increasing $n$ ) and $δ = e^{- Ω (ε^{2} n)}$ (for small $ε$ , and increasing $n$ ). This provides tiny values of $ε$ and $δ$ for moderate values of $n$ and $p$ . However, the assumption that all $p_{i}$ are identical is unrealistic: in practice, there is no reason to assume that all users have an equal chance of satisfying $P$ . Theorem 7 in [46] and Theorem 2 in [176] show that without this assumption, the $ε$ obtained is still small: $ε = O (\sqrt{\frac{ln (n)}{n}})$ , but the upper bound obtained on $δ$ is significantly larger: $δ = O (\frac{1}{\sqrt{n}})$ . This is more than what is typically acceptable; a common recommendation is to choose $δ = o (\frac{1}{n})$ .

In the following theorem, we show that the exponential decrease of $δ$ with $n$ still holds in the general case where all $p_{i}$ are different. For simplicity, we assume that the attacker has no background knowledge: because all records are independent, adding some partial knowledge has a fixed, reversible effect on the output space, similarly to $Θ$ -reducibility. In this case, having the attacker know $m$ records out of $n$ is the same as having the attacker know no records among $n - m$ .

Theorem 3. Let $θ$ be a distribution that generates $n$ records, where $D (i)$ is the result of an independent Bernoulli trial of probability $p_{i}$ . Let $λ$ be such that for all $i$ , $λ < p_{i} < 1 - λ$ . Let $M$ be defined by $M (D) = \sum_{i} D (i)$ . Then $M$ is $(Θ, ε, δ)$ -APKDP, for any $ε$ and $δ$ such that:

δ \geq P [\frac{X}{Y} \geq ε]

where $X$ and $Y$ are independent random variables sampled from a binomial distribution with $n - 1$ trials and success probability $2 λ$ . For a fixed $ε \leq 1$ , this condition is satisfied if:

ε \geq max (\sqrt{\frac{14 ln (1 ∕ δ)}{λ (n - 1)}}, \frac{27}{λ (n - 1)})

which gives $δ = e^{- Ω (ε^{2} λn)}$ .

Proof. The proof uses existing results on privacy amplification by shuffling: in [27 , 140], the authors show that adding noise independently to each data point, and then shuffling the results (hiding from the attacker which record comes from which user), provide strong DP guarantees. Even though our problem looks different, the same reasoning can be applied. First, we show that $θ$ can be seen as applying randomized response on each record. Then, since a counting query is a symmetric boolean function, it can be composed with a shuffle of its input, which allows us to use amplification by shuffling.

Let us formalize this intuition. A Bernoulli trial of probability $p_{i}$ (denoted $Bernoulli (p_{i})$ ), with $λ < p_{i} < 1 - λ$ , can be decomposed into the following process:

Generate $b \sim Bernoulli (2 λ)$ .
If $b = 0$ , return $b_{ν_{i}} \sim Bernoulli (\frac{p_{i} - λ}{1 - 2 λ})$ .
If $b = 1$ , return $b_{rr} \sim Bernoulli (0.5)$ .

This can be seen as a randomized response process applied on some input $b_{ν_{i}}$ , itself random: $θ = R_{2 λ} (ν)$ , where $ν$ is a distribution generating $n$ records, where the $i$ -th record is generated by $b_{ν_{i}} \sim Bernoulli (\frac{p_{i} - λ}{1 - 2 λ})$ , and $R_{2 λ}$ is a binary randomized response process with parameter $2 λ$ .

Further, note that $M$ can be seen as the composition between itself and a pre-shuffling phase: $M = M \circ S$ , where $S : D \to D$ is a function that applies a random permutation to the input records. Thus, $M {(D)}_{| D \sim θ} = M {(S (R_{2 λ} (D)))}_{| D \sim ν}$ . We can now apply Theorem 3.1 in [27] and its proof to show that $S \circ R_{2 λ}$ is $(ε, δ)$ -DP for $ε$ within the constraints above (with $k = 2$ and $γ = 2 λ$ ). By post-processing, $M \circ S \circ R_{2 λ}$ is also $(ε, δ)$ -DP, which directly yields that $M$ is $(ε, δ)$ -APKDP.

Note that we omitted a small technical detail: conditioning $θ$ on $D (i) = a$ is not identical to conditioning $ν$ on $D (i) = a$ , since no noise is added to the record $i$ in the former case. To fix this, we need to define $R_{2 λ}$ as randomizing all records except a fixed one $i$ . The proof of Theorem 3.1 in [27] assumes that no noise is added to the target record, so the result still holds. □

We compare this result with the previous state-of-the-art. First, we reformulate a previously known result from [176] that applies to our setting.

Proposition 27 (Theorem 3 in [176]). Let $θ$ be a distribution that generates $n$ records, where $D (i)$ is the result of an independent Bernoulli trial of probability $p_{i}$ , and let $M$ be defined by $M (D) = \sum_{i} D (i)$ . Let $μ_{2} = \frac{1}{n} \sum_{i} p_{i} (1 - p_{i})$ and $μ_{3} = \frac{1}{n} \sum_{i} p_{i} (1 - p_{i}) | 1 - 2 p_{i} |$ be respectively the average second moment and average absolute third moment of the $D (i)$ . Then for any $δ_{2} \geq 1.25 e_{2}^{- n μ_{2} ∕ 2}$ , $M$ is $(θ, ε, δ)$ -APKDP, with

ε = \sqrt{\frac{2 ln (1.25 ∕ δ_{2})}{n μ_{2}}} and δ = \frac{1.12 μ_{3}}{\sqrt{n} μ_{2}^{3}} (1 + e^{ε}) + δ_{2} .

Proof. For $δ_{2} = \frac{4}{5 \sqrt{n}}$ , this is a direct application of Theorem 3 in [176]. Changing the value $δ_{2}$ in its proof (Appendix A.2) allows us to obtain the more general formula above. This requires Fact 1 to be true, which is the case when $ε \leq 1$ (the authors omit this detail), or equivalently, when $δ_{2} \geq 1.25 e_{2}^{- n μ_{2} ∕ 2}$ . □

The comparison between this result and Theorem 3 is not completely straightforward. Aside from $n$ , Theorem 3 only depends on a global bound on the “amount of randomness” ( $p_{i}$ ) of each user, while Proposition 27 depends on the average behavior of all users. As such, the global bound $λ$ can be small because of one single user having a low $p_{i}$ , even if all other users have a lot of variance because their $p_{i}$ is close to 0.5. We therefore provide two experimental comparisons. In the first one, $p_{1} = λ = 0.05$ and $p_{i} = 0.5$ for all $i > 1$ . This case is designed to have the parameters of Theorem 3 underperform (as we underestimate the total amount of randomness) and those of Proposition 27 perform well. In the second one, the $p_{i}$ are uniformly distributed in $[λ, 1 - λ] = [0.05, 1 - 0.05]$ . In both cases, we compare the $(ε, δ)$ graphs obtained for $n = 1000$ and $n = 100, 000$ , and present the results in Figure 3.2.

The graphs show that if we consider the smallest possible $ε$ given by the definitions, our theorem leads to a large $δ$ : with $ε = Θ (1 ∕ \sqrt{n})$ , we obtain $δ = Θ (1)$ ; in contrast, Proposition 27 leads to $δ = Θ (1 ∕ \sqrt{n})$ . However, increasing $ε$ to slightly larger values quickly leads to tiny values of $δ$ , which was impossible with the previous state-of-the-art results. They also show that the closed-form bound from [27] is far from tight, as numerically computating these bounds improves them by several orders of magnitude. This leads to a natural open question: is there a better asymptotic formulation of the bounds given by amplification by shuffling for randomized response?

0011211111𝜀δ.5.500000−−−−−97531PPTTrrhhooeeppoooorrsseeiimmtitioo33nn,, 2 2 c n77lu,,om c cseaaerssdiee- fco12rm — Figure 3.2: Comparison of $(ε, δ)$ bounds given by Theorem 3 and Proposition 27, for $λ = 0.05$ , $n = 100$ (left) and $n = 10, 000$ (right).
Case 1: all but one $p_{i}$ are 0.5. Case 2: the $p_{i}$ are distributed uniformly over $[0.05, 0.95]$ .

000001111111𝜀.2.4.6.8000000−−−−−01196352 — Figure 3.2: Comparison of $(ε, δ)$ bounds given by Theorem 3 and Proposition 27, for $λ = 0.05$ , $n = 100$ (left) and $n = 10, 000$ (right).
Case 1: all but one $p_{i}$ are 0.5. Case 2: the $p_{i}$ are distributed uniformly over $[0.05, 0.95]$ .

What is the impact of $λ$ on the privacy guarantees? In Figure 3.3, we plot the $ε$ obtained for $δ = 0.01 ∕ n$ as a function of $λ$ , for various values of $n$ .

n = 100
n = 1000
n = 10,000
000001234λ𝜀.0.0.0.100101n 1= 100,000 — Figure 3.3: Comparison of $ε$ bounds given by the numerical computation of Theorem 3, with varying $λ$ , for various values of $n$ and $δ = 0.01 ∕ n$ .

One natural application for this result is voting: in typical elections, the total tally is released without any noise. Adding noise to the election results, or not releasing them, would both be unacceptable. Thus, the results are not $(ε, δ)$ -DP for any $(ε, δ)$ parameters, even though publishing the tally is not perceived as a breach of privacy. The intuitive explanation for this is that attackers are assumed not to have complete background knowledge of the secret votes. Our results confirm this intuition and quantify it. These results can easily be extended to votes between multiple candidates.

Corollary 1. Let $θ$ be a distribution that generates $n$ records, where $D (i) \in {1, \dots, K}$ for $K > 1$ , and where $P [D_{i} = k] = p_{i, k}$ , and every record is independent from all others. Let $λ$ be such that for all $i$ and all $k$ , $λ < p_{i, k}$ . Let $M$ return the histogram of all values: $M (D) = (N_{1}, \dots, N_{K})$ , where $N_{k}$ is the number of records $i$ such that $D (i) = k$ . Then $M$ is $(Θ, ε, δ)$ -APKDP, for any $ε \leq 1$ and $δ > 0$ such that:

ε \geq max (\sqrt{\frac{14 ln (1 ∕ δ)}{λ (n - 1)}}, \frac{27}{λ (n - 1)}) .

Proof. The proof is the same as for Theorem 3. With multiple options, the parameter $γ$ of the multi-category randomized response is $γ = Kλ$ , which leads to the same $(ε, δ)$ parameters. □

The results in this section apply to individual counting queries. This applies to scenarios like votes, but in many practical use cases, multiple queries are released. Can the results of this section be generalized to these cases? In general, noiseless mechanism do not compose. For example, fixing an individual $t$ , queries like “How many people voted 1?” and “How many people who are not $t$ voted 1?” can both be private on their own. However, publishing both results will reveal $t$ ’s vote: the composition of both queries cannot be private. Are there special cases where noiseless counting queries can be composed?

One such case happens when each counting query contains the data of a number of new users, independent from users in previous counting queries. This can happen in situations where statistics are collected on actions that each user can only do once, for example, registering with an online service. In this case, we can restrict the privacy analysis of each new query to the set of independent users in its input, and use the previous results from this section: this approach is formalized and proven in Theorem 1 in [46].

What if this approach is impossible, for example if there are dependencies between each input record of each query? For example, a referendum could ask voters multiple questions, with correlations between the different possible answers. Another example could be app usage statistics published every day, where the data for day $d + 1$ for each user is correlated to the user’s data on the previous day $d$ . In this case, to compute the privacy loss of the first $d$ binary queries, we can consider them as a single query with $2^{d}$ options. Afterwards, we can take into account the temporal correlations to compute the probabilities associated with each option, and use Corollary 1.

LINKPREV LINKUP LINKNEXT