Lowering the cost of anonymization

a PhD thesis

3.2.2  Thresholding

Theorem 3 gives good and parameters when there are many people who vote with “enough randomness”: there is a such that . The parameters have a dependency on , which in practice translates to scenarios where both options have large counts with high probability. In many practical applications, however, it is hard to know in advance whether this will be the case. Consider, for example, a mobile app gathering usage metrics on possible sequences of actions carried by users within the app. Some sequences will be very probable, and have high counts. But if there are arbitrarily many such sequences, some will be very rare: like many practical distributions, there will be a long tail.

To protect the data from these outlier users, a typical protection employed is thresholding: only return the user count associated with a sequence if it is larger than a given threshold . What level of protection does such a technique provide? In this section, we formalize the intuition given in Example 7, and show that, assuming a passive attacker, thresholding provides protection when all voters vote with a very small or a very large probability. First, we formalize the notion of a thresholding mechanism in a simple context.

Definition 60 (Simple thresholding). Given a database with values in and a threshold , the -thresholding mechanism evaluates and returns if , and otherwise.

Note that only thresholds low counts. In many practical situations however, thresholding must be applied in both directions: it must also catch the case where, with high probability, almost all records are . This situation is symmetrical to thresholding low counts: without loss of generality, we can assume , and the symmetric version of all results in this section hold.

Let us now show the main result of this section: if participants vote with low probability, then thresholding protects against passive attackers, and in some cases also against certain active attackers. This privacy property only holds if the expected value of the count is lower than the threshold; and the level of protection depends on the ratio between the threshold and the expected value (denoted by below).

Theorem 4. Let be a distribution that returns independent records, each of which is with low probability: , and for all ; moreover, let . Suppose that there is no partial knowledge, i.e., , and let us denote by the probability that a random variable following a binomial distribution with parameters and has value : .

Then, if , is -APKDP (and thus, -PPKDP), with:

For a large , assuming is fixed, we can use the Poisson approximation and get . If this quantity is small enough, .

If the background knowledge is not empty, assume that the attacker knows a subset of records. Let be such that and . Then is -PPKDP, with:

Proof. The proof is presented in three stages.

First, we consider the simpler case where all are equal and there is no background knowledge. This allows us to compute the PLRV exactly, and we can then split the output space into two parts. Most of its mass will be in the event, and we can compute it there exactly. All other events will be captured by .
Second, we extending this to non-empty partial knowledge in a similar fashion: for some , with high probability, the background knowledge will not have more than records whose value is 1: the rest of the probability mass goes in the , and this allows us to use the previous idea with a new threshold .
Finally, we use a coupling argument to extend this to the case where the are not all the same.

First, let us compute the PLRV for depending on the output and the value of the background knowledge , assuming a simple distribution where records are i.i.d. Denote by the number of records in which are 1 and by the number of records that are 0. The targeted record will be called , and we assume it is never part of . Let be a distribution that returns i.i.d records according to . Then we can directly compute:

Note that if , then . The case where is impossible regardless of : there cannot be more s in the background knowledge than the mechanism outputs. In the case where there is no background knowledge, this becomes:

This calculation allows us to bound and in the simpler case where there is no background knowledge. To do so, we need a technical lemma to bound the probability mass of the tail of the binomial distribution appearing above.

Lemma 1. For any , , and such that , we have:

Proof. Note that for all :

Since , this is strictly lower than , so the sum converges at least as fast as a geometric series, which directly gives the desired result. □

We can now start proving the main theorem; first in the case where there is no background knowledge and all are equal. There are two possibilities to consider, and . First, we have:

since , so we can use Lemma 1. Let us denote this quantity as . Now, we have


Thus, when the output is thresholded, the PLRV is smaller than , and the event “the output is not thresholded” only happens with a probability smaller than , which proves the initial statement in the simpler case.

Now, in the case where the background knowledge is non-empty, we must not only split the output space, but also as well. Denoting the number of “1” entries in , there are three cases we must consider:

: if is large enough, this happens with small probability, which we put in the term;
and : if is large enough, this happens with small probability, which we put in the ;
and : this is the event in which most of the probability mass is concentrated on, so we bound its privacy loss to obtain .

The probability of the first event can be bounded by:

by Lemma 1. Similarly, the probability of the second event can be bounded by:

so we can bound by the sum of those two terms. Now, let us compute the privacy loss for the third case. Assuming , we have:


Denoting this by , this proves the theorem in the special case where all are equal to a constant .

Now, we extend the first case (where the background knowledge is empty) to the case where all are different, and for all . Let us denote the distribution where all users vote with the same probability . Let . By a simple coupling argument between and , we have for all :

We can then use this fact throughout the previous proof. The application of this bound for is immediate, and for , we have:

We can bound the numerator by and the denominator expands to:

so we can reuse the previous bound. The bounds translate to the case where the background knowledge is not empty by a similar argument. □

As shown in Figure 3.4, when the threshold is above the expected value, the values and given by Theorem 4 are very close. Moreover, for large , these values are extremely small. This shows that thresholding counts constitutes a good practice, which can be used to meaningfully improve user privacy without having to know about the data distribution in advance, like in the usage statistics example at the beginning of this section.

A practitioner can apply the following reasoning: for each possible sequence of actions captured by the system collecting app usage statistics, either many users are likely to have a value of 1, in which case Theorem 3 applies and thresholding will likely not impact data utility; or the vast majority of users will have a value of , in which case Theorem 4 applies and thresholding will protect the rare users whose value is .

What if the attacker has non-zero partial knowledge, but is able to interact with the system? We saw in Example 7 that if this partial knowledge is larger than the threshold, the mechanism is not private. But if this partial knowledge is small enough, then privacy is still possible: it is equivalent to reducing the threshold for an attacker with no partial knowledge.

Proposition 28. Let be the same distribution as in Theorem 4, with background knowledge of size . Let be the equivalent distribution but with , and no partial knowledge. Then for any and , is -APKDP iff is -PPKDP.

Proof. By writing down its explicit value, one can see that only depends on the difference between and the number of ones in the background knowledge . The same applies for the variables and , which appear only in the form of . This shows the equivalence of being -APKDP and being -APKDP. Since APKDP and PPKDP are the same when there is no background knowledge, the statement follows. □

678911111111T000001200000 000δ
5566666711111T6802468000000 00000000δ
Figure 3.4:  and from Theorem 4 as a function of the threshold , where , , and three different values of : (top), (middle), and (bottom).
All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).