Lowering the cost of anonymization

3.2.2 Thresholding

Theorem 3 gives good $ε$ and $δ$ parameters when there are many people who vote with “enough randomness”: there is a $λ$ such that $λ < p_{i} < 1 - λ$ . The parameters have a dependency on $λn$ , which in practice translates to scenarios where both options have large counts with high probability. In many practical applications, however, it is hard to know in advance whether this will be the case. Consider, for example, a mobile app gathering usage metrics on possible sequences of actions carried by users within the app. Some sequences will be very probable, and have high counts. But if there are arbitrarily many such sequences, some will be very rare: like many practical distributions, there will be a long tail.

To protect the data from these outlier users, a typical protection employed is thresholding: only return the user count associated with a sequence if it is larger than a given threshold $T$ . What level of protection does such a technique provide? In this section, we formalize the intuition given in Example 7, and show that, assuming a passive attacker, thresholding provides protection when all voters vote with a very small or a very large probability. First, we formalize the notion of a thresholding mechanism in a simple context.

Definition 60 (Simple thresholding). Given a database $D = (D (1), \dots, D (n))$ with values in ${0, 1}$ and a threshold $T$ , the $T$ -thresholding mechanism $M_{T}$ evaluates $˜ k = \sum_{i} D (i)$ and returns $⊥$ if $˜ k \leq T$ , and $k$ otherwise.

Note that $M_{T}$ only thresholds low counts. In many practical situations however, thresholding must be applied in both directions: it must also catch the case where, with high probability, almost all records are $1$ . This situation is symmetrical to thresholding low counts: without loss of generality, we can assume $T < n ∕ 2$ , and the symmetric version of all results in this section hold.

Let us now show the main result of this section: if participants vote $1$ with low probability, then thresholding protects against passive attackers, and in some cases also against certain active attackers. This privacy property only holds if the expected value of the count is lower than the threshold; and the level of protection depends on the ratio between the threshold and the expected value (denoted by $r$ below).

Theorem 4. Let $θ$ be a distribution that returns $n$ independent records, each of which is $1$ with low probability: $D (i) \sim Ber (p_{i})$ , and $p_{i} < p$ for all $i$ ; moreover, let $Θ = {θ}$ . Suppose that there is no partial knowledge, i.e., $| B | = 0$ , and let us denote by $f (s, n, p)$ the probability that a random variable following a binomial distribution with parameters $n$ and $p$ has value $s$ : $f (s, n, p) = p^{s} {(1 - p)}^{n - s} (\frac{n}{s})$ .

Then, if $r = \frac{p (n - 1)}{(1 - p) T} < 1$ , $M_{T}$ is $(Θ, ε, δ)$ -APKDP (and thus, $(Θ, ε, δ)$ -PPKDP), with:

\begin{matrix} ε & = - ln (1 - \frac{f (T, n - 1, p)}{1 - r}) δ & = \frac{f (T, n - 1, p)}{1 - r} \end{matrix}

For a large $n$ , assuming $pn$ is fixed, we can use the Poisson approximation and get $δ \approx \frac{{(pn)}^{T} e^{- pn}}{(1 - r) T!}$ . If this quantity is small enough, $ε \approx δ$ .

If the background knowledge $B$ is not empty, assume that the attacker knows a subset $| B |$ of records. Let $b_{max}$ be such that $r_{b} = \frac{p | B |}{(1 - p) b_{max}} < 1$ and $r^{'} = \frac{p (n - | B | - 1)}{(1 - p) (T - b_{max})} < 1$ . Then $M_{T}$ is $(Θ, ε, δ)$ -PPKDP, with:

\begin{matrix} ε & = - ln (1 - \frac{f (T - b_{max}, n - | B | - 1, p)}{1 - r^{'}}) δ & = \frac{f (b_{max}, | B |, p)}{1 - r_{b}} + \frac{f (T - b_{max}, n - | B | - 1, p)}{1 - r^{'}} . \end{matrix}

Proof. The proof is presented in three stages.

1.: First, we consider the simpler case where all $p_{i}$ are equal and there is no background knowledge. This allows us to compute the PLRV exactly, and we can then split the output space into two parts. Most of its mass will be in the $⊥$ event, and we can compute it there exactly. All other events will be captured by $δ$ .
2.: Second, we extending this to non-empty partial knowledge in a similar fashion: for some $b_{max}$ , with high probability, the background knowledge will not have more than $b_{max}$ records whose value is 1: the rest of the probability mass goes in the $δ$ , and this allows us to use the previous idea with a new threshold $T^{'} = T - b_{max}$ .
3.: Finally, we use a coupling argument to extend this to the case where the $p_{i}$ are not all the same.

First, let us compute the PLRV for $M_{T}$ depending on the output $k$ and the value of the background knowledge $ˆ B$ , assuming a simple distribution $θ$ where records are i.i.d. Denote by $b$ the number of records in $ˆ B$ which are 1 and by $¯ ¯ b = | ˆ B | - b$ the number of records that are 0. The targeted record will be called $D (t)$ , and we assume it is never part of $ˆ B$ . Let $θ$ be a distribution that returns $n$ i.i.d records according to $D (i) \sim Ber (p)$ . Then we can directly compute:

\begin{matrix} L_{i, 0, 1}^{M_{T}, θ} (k, ˆ B) & = ln \frac{P [M (D) = k ∣ ∣ D (t) = 0, B = ˆ B]}{P [M (D) = k ∣ ∣ D (t) = 1, B = ˆ B]} = ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ \begin{matrix} ln \frac{T - b \sum s = 0 f (s, n - | B | - 1, p)}{T - b - 1 \sum s = 0 f (s, n - | B | - 1, p)} & if k = ⊥ \end{matrix} \end{matrix}

Note that if $k = b$ , then $L_{i, 0, 1}^{M_{T}, θ} (k, B) = \infty$ . The case where $k < b$ is impossible regardless of $D (i)$ : there cannot be more $1$ s in the background knowledge than the mechanism outputs. In the case where there is no background knowledge, this becomes:

L_{i, 0, 1}^{M_{T}, θ} (k) = ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ \begin{matrix} ln \frac{T \sum s = 0 f (s, n - 1, p)}{T - 1 \sum s = 0 f (s, n - 1, p)} & if k = ⊥ \end{matrix}

This calculation allows us to bound $ε$ and $δ$ in the simpler case where there is no background knowledge. To do so, we need a technical lemma to bound the probability mass of the tail of the binomial distribution appearing above.

Lemma 1. For any $n$ , $p$ , and $m$ such that $m > \frac{pn}{1 - p}$ , we have:

\begin{matrix} n \sum s = m f (s, n, p) < \frac{f (m, n, p)}{1 - \frac{pn}{(1 - p) m}} \end{matrix}

Proof. Note that for all $s \geq m$ :

\frac{f (s + 1, n, p)}{f (s, n, p)} = \frac{p}{1 - p} (\frac{n - s}{s + 1}) < \frac{pn}{(1 - p) m} .

Since $m > \frac{pn}{1 - p}$ , this is strictly lower than $1$ , so the sum converges at least as fast as a geometric series, which directly gives the desired result. □

We can now start proving the main theorem; first in the case where there is no background knowledge and all $p_{i}$ are equal. There are two possibilities to consider, $L_{i, 1, 0}^{M_{T}, θ}$ and $L_{i, 0, 1}^{M_{T}, θ}$ . First, we have:

\begin{matrix} P_{θ} [M_{T} (D) \Leftrightarrow ⊥ | D (i) = 0] & < P_{θ} [M_{T} (D) \Leftrightarrow ⊥ | D (i) = 1] = 1 - T - 1 \sum s = 0 f (s, n - 1, p) = n - 1 \sum s = T f (s, n - 1, p) < \frac{f (T, n - 1, p)}{1 - \frac{p (n - 1)}{(1 - p) T}} \end{matrix}

since $T > \frac{p (n - 1)}{1 - p}$ , so we can use Lemma 1. Let us denote this quantity as $δ$ . Now, we have

\begin{matrix} L_{i, 1, 0}^{M_{T}, θ} (⊥) = ln \frac{T - 1 \sum s = 0 f (s, n - 1, p)}{T \sum s = 0 f (s, n - 1, p)} < 0 . \end{matrix}

Furthermore:

\begin{matrix} L_{i, 0, 1}^{M_{T}, θ} (⊥) & = ln \frac{T \sum s = 0 f (s, n - 1, p)}{T - 1 \sum s = 0 f (s, n - 1, p)} < ln ⎛ ⎜ ⎜ ⎜ ⎜ ⎝ \frac{1}{1 - n - 1 \sum s = T f (s, n - 1, p)} ⎞ ⎟ ⎟ ⎟ ⎟ ⎠ < - ln ⎛ ⎜ ⎜ ⎝ 1 - \frac{f (T, n - 1, p)}{1 - \frac{p (n - 1)}{(1 - p) T}} ⎞ ⎟ ⎟ ⎠ . \end{matrix}

Thus, when the output is thresholded, the PLRV is smaller than $ε = - ln (1 - \frac{f (T, n - 1, p)}{1 - r})$ , and the event “the output is not thresholded” only happens with a probability smaller than $δ$ , which proves the initial statement in the simpler case.

Now, in the case where the background knowledge is non-empty, we must not only split the output space, but also $B$ as well. Denoting $b$ the number of “1” entries in $B$ , there are three cases we must consider:

1.: $b \geq b_{max}$ : if $b_{max}$ is large enough, this happens with small probability, which we put in the $δ$ term;
2.: $b < b_{max}$ and $M_{T} (D) \Leftrightarrow ⊥$ : if $T^{'} = T - b_{max}$ is large enough, this happens with small probability, which we put in the $δ$ ;
3.: $b < b_{max}$ and $M_{T} (D) = ⊥$ : this is the event in which most of the probability mass is concentrated on, so we bound its privacy loss to obtain $ε$ .

The probability of the first event can be bounded by:

\begin{matrix} P_{θ} [b \geq b_{max}] & = | B | \sum s = b_{max} f (s, | B |, p) < \frac{f (b_{max}, | B |, p)}{1 - \frac{p | B |}{(1 - p) b_{max}}} \end{matrix}

by Lemma 1. Similarly, the probability of the second event can be bounded by:

\begin{matrix} P_{θ} [M_{T} (D) \Leftrightarrow ⊥ | b < b_{max}, D (i) = 1] & < n - | B | - 1 \sum s = T^{'} f (s, n - | B | - 1, p) < \frac{f (T^{'}, n - | B | - 1, p)}{1 - \frac{p (n - | B | - 1)}{(1 - p) T^{'}}} \end{matrix}

so we can bound $δ$ by the sum of those two terms. Now, let us compute the privacy loss for the third case. Assuming $b < b_{max}$ , we have:

\begin{matrix} L_{i, 1, 0}^{M_{T}, θ} (⊥, ˆ B) = ln \frac{T - b - 1 \sum s = 0 f (s, n - | B | - 1, p)}{T - b \sum s = 0 f (s, n - | B | - 1, p)} < 0 \end{matrix}

and:

\begin{matrix} L_{i, 0, 1}^{M_{T}, θ} (⊥, ˆ B) & = ln \frac{T - b \sum s = 0 f (s, n - | B | - 1, p)}{T - b - 1 \sum s = 0 f (s, n - | B | - 1, p)} < ln ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ \frac{1}{T^{'} - 1 \sum s = 0 f (s, n - | B | - 1, p)} ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ < - ln (1 - \frac{f (T^{'}, n - | B | - 1, p)}{1 - r^{'}}) \end{matrix}

Denoting this by $ε$ , this proves the theorem in the special case where all $p_{i}$ are equal to a constant $p$ .

Now, we extend the first case (where the background knowledge is empty) to the case where all $p_{i}$ are different, and $p_{i} < p$ for all $p$ . Let us denote $θ_{p}$ the distribution where all users vote with the same probability $p$ . Let $g (s, n, p) = P_{θ} [M_{T} (D) = s]$ . By a simple coupling argument between $θ$ and $θ_{p}$ , we have for all $t$ :

\begin{matrix} n \sum s = t + 1 g (s, n, p) & = P_{θ} [M_{T} (D) > t] \leq P_{θ_{p}} [M_{T} (D) > t] = n \sum s = t + 1 f (s, n, p) . \end{matrix}

We can then use this fact throughout the previous proof. The application of this bound for $δ$ is immediate, and for $ε$ , we have:

\begin{matrix} L_{i, 0, 1}^{M_{T}, θ} (⊥, ˆ B) = ln \frac{T \sum s = 0 g (s, n - 1, p)}{T - 1 \sum s = 0 g (s, n - 1, p)} . \end{matrix}

We can bound the numerator by $1$ and the denominator expands to:

\begin{matrix} T - 1 \sum s = 0 g (s, n - 1, p) & = 1 - n - 1 \sum s = T g (s, n - 1, p) \geq 1 - n - 1 \sum s = T f (s, n - 1, p) = T - 1 \sum s = 0 f (s, n - 1, p) \end{matrix}

so we can reuse the previous bound. The bounds translate to the case where the background knowledge is not empty by a similar argument. □

As shown in Figure 3.4, when the threshold is above the expected value, the values $ε$ and $δ$ given by Theorem 4 are very close. Moreover, for large $n$ , these values are extremely small. This shows that thresholding counts constitutes a good practice, which can be used to meaningfully improve user privacy without having to know about the data distribution in advance, like in the usage statistics example at the beginning of this section.

A practitioner can apply the following reasoning: for each possible sequence of actions captured by the system collecting app usage statistics, either many users are likely to have a value of 1, in which case Theorem 3 applies and thresholding will likely not impact data utility; or the vast majority of users will have a value of $0$ , in which case Theorem 4 applies and thresholding will protect the rare users whose value is $1$ .

What if the attacker has non-zero partial knowledge, but is able to interact with the system? We saw in Example 7 that if this partial knowledge is larger than the threshold, the mechanism is not private. But if this partial knowledge is small enough, then privacy is still possible: it is equivalent to reducing the threshold for an attacker with no partial knowledge.

Proposition 28. Let $θ$ be the same distribution as in Theorem 4, with background knowledge of size $| B | \leq T$ . Let $θ^{'}$ be the equivalent distribution but with $n^{'} = n - | B |$ , and no partial knowledge. Then for any $ε$ and $δ$ , $M_{T}$ is $(θ, ε, δ)$ -APKDP iff $M_{T - | B |}$ is $({θ^{'}}, ε, δ)$ -PPKDP.

Proof. By writing down its explicit value, one can see that ${APK}_{i, t, t^{'}, B}^{'}$ only depends on the difference between $T$ and the number of ones $b$ in the background knowledge $B$ . The same applies for the variables $n$ and $| B |$ , which appear only in the form of $n - | B |$ . This shows the equivalence of $M_{T}$ being $({θ}, ε, δ)$ -APKDP and $M_{T - | B |}$ being $({θ^{'}}, ε, δ)$ -APKDP. Since APKDP and PPKDP are the same when there is no background knowledge, the statement follows. □

68111111111111T024680000000−−−−−−0654321𝜀δ — Figure 3.4: $ϵ$ and $δ$ from Theorem 4 as a function of the threshold $T$ , where $| B | = 0$ , $p = 0.005$ , and three different values of $n$ : $n = 1000$ (top), $n = 10, 000$ (middle), and $n = 100, 000$ (bottom).

−1−1−8−4062𝜀
678911111111T000001200000 000δ — Figure 3.4: $ϵ$ and $δ$ from Theorem 4 as a function of the threshold $T$ , where $| B | = 0$ , $p = 0.005$ , and three different values of $n$ : $n = 1000$ (top), $n = 10, 000$ (middle), and $n = 100, 000$ (bottom).

LINKPREV LINKUP LINKNEXT