Lowering the cost of anonymization

a PhD thesis

3.3.3  Weakening the privacy definition

Our main result is negative: no cardinality estimator satisfying our privacy definition can maintain a good accuracy. Thus, it is natural to wonder whether our privacy definition is too strict, and if the result still holds for weaker variants.

In this section, we consider two classes of weaker variants of our privacy definition: allowing the privacy loss to be larger than on some inputs (relaxing the definition alongside dimension Q, in Section 2.2.2), and allowing some proportion of users to not be protected (relaxing the definition alongside dimension V, in Section 2.2.4).

First, we show that for the former class of variants, our negative result still holds: allowing a small probability of bad outcomes does not help. Second, we show that even though our negative result does not hold for the latter class of variants, cardinality estimators used in practice still leak a lot of information, even when we average the privacy loss across users.

Allowing larger values of the privacy loss

-sketch privacy provides a bound on how much information the attacker can gain in the worst case. As exemplified by the DP variants alongside dimension Q, in Section 2.2.2, it is natural to relax this requirement

A first natural relaxation is to accept a small probability of failure, similarly to -DP (Definition 14 in Section 2.2.2): requiring a bound on the information gain in most cases, and accept a potentially unbounded information gain with low probability.

Definition 70. A cardinality estimator satisfies -sketch privacy above cardinality if for every , , and ,

Unfortunately, our negative result still holds for this variant of the definition. Indeed, we show that a close variant of Lemma 4 holds, and the rest follows directly.

Lemma 10. Let . A cardinality estimator that satisfies -probabilistic sketch privacy above cardinality satisfies, for all :

Proof. First, we show that if a cardinality estimator verifies -sketch privacy at a given cardinality, then for each target , we can find an explicit decomposition of the possible outputs: the bound on information gain is satisfied for each possible output, except on a density . This is similar to the definition of probabilistic differential privacy (Definition 15 in Section 2.2.2), except the decomposition depends on the choice of . We then use this decomposition to prove a variant of our negative result.

Lemma 11. If a cardinality estimator satisfies -sketch privacy above cardinality , then for every and , we can decompose the space of possible sketches such that , and for all :

Proof. Suppose the cardinality estimator satisfies -sketch privacy at cardinality , and fix and . Let .

Let be the set of outputs for which the privacy loss is higher than . Formally, is the set of sketches that satisfy

We show that has a density bounded by .

Suppose for the sake of contradiction that this set has density at least given that :

We can sum the inequalities in to obtain:

Averaging both inequalities, we get:

since . This contradicts the hypothesis that the cardinality estimator satisfies -sketch privacy at cardinality .

Thus, and by the definition of , every output verifies:

We can then use this decomposition to prove Lemma 10. Lemma 11 allows us to get two sets and such that:

  • ;
  • and for all , .

We decompose into:

The same reasoning as in the proof of Lemma 4 gives:

and since , we immediately have:

We conclude that

Now, Lemma 2 gives , and finally:

We can then deduce a theorem similar to our negative result for our weaker privacy definition.

Theorem 9. An unbiased cardinality estimator that satisfies -sketch privacy above cardinality has a variance at least for any and , where . It is therefore not precise if .

Proof. This follows from Lemmas 10, 5, and 6. □

Instead of requiring that the attacker’s information gain is bounded by for every possible output, we could bound the average information gain. This is equivalent to accepting a larger privacy loss in some cases, as long as other cases have a lower privacy loss. This intuition is similar to -Kullback-Leiber privacy (Definition 16 in Section 2.2.2), often used in similar contexts [104, 133, 327, 328]. In our case, we adapt it to maintain the asymmetry of our original privacy definition. First, we give a formal definition the privacy loss of a user given output .

Definition 71. Given a cardinality estimator, the positive privacy loss of given output at cardinality is defined as

This privacy loss is never negative: this is equivalent to discarding the case where the attacker gains negative information. Now, we bound this average over all possible values of , given .

Definition 72. A cardinality estimator satisfies -sketch KL privacy above cardinality if for every and , we have

It is easy to check that -sketch KL privacy above cardinality is strictly weaker than -sketch privacy above cardinality . Unfortunately, this definition is also stronger than -sketch privacy above cardinality for certain values of and , and as such, Lemma 10 also applies. We prove this in the following lemma.

Lemma 12. If a cardinality estimator satisfies -sketch KL privacy above cardinality , then it also satisfies -sketch privacy above cardinality for any .

Proof. Let and . Suppose that a cardinality estimator does not satisfy -sketch privacy at cardinality . Then with probability strictly larger than , the output does not satisfy -sketch privacy. Formally there is such that , and such that for all . Since all values of are positive, we have

Hence this cardinality estimator does not satisfy -sketch average privacy at cardinality . □

This lemma leads to a similar version of the negative result.

Theorem 10. An unbiased cardinality estimator that satisfies -sketch KL privacy above cardinality has a variance at least for any and , where . It is therefore not precise.

Proof. This follows directly from Lemma 12 with , and Theorem 9. □

Recall that all existing cardinality estimators satisfy our axioms and have a bounded accuracy. Thus, an immediate corollary is that for all cardinality estimators used in practice, there are some users for which the average privacy loss is very large.

Note that we could obtain a result similar to Theorem 9 with another notion of average, using for example the same approach as Rényi DP (Definition 17 in Section 2.2.2). We could either simply use the fact that Rényi DP with is strictly stronger than KL privacy, or use the same reasoning to Proposition 3 in [284] to show that such a notion of average also implies -sketch privacy with even lower parameters.

Privacy loss of individual users

-sketch privacy protects all users uniformly. Similarly to the variants of DP from dimension V, we could relax this requirement and allow some users to have less privacy than others. Variants obtained this way would generally not be sufficiently convincing to be used in practice: one typically wants to protect all users, not just a majority of them. In this section, we show that even if we relax this requirement, cardinality estimators would in practice leak a significant amount of information.

Allowing unbounded privacy loss for some users

First, what happens if we allow some users to have unbounded privacy loss? We could achieve this by requiring the existence of a subset of users of density , such that every user in is protected by -sketch privacy above cardinality . In this case, a ratio of possible targets are not protected. This would be somewhat similar to random DP (Definition 29 in Section 2.2.4), except the unprotected users are always the same.

This approach only makes sense if the attacker cannot choose the target . For our attacker model, this might be realistic: suppose that the attacker wants to target just one particular person. Since all user identifiers are hashed before being passed to the cardinality estimator, this person will be associated to a hash value that the attacker can neither predict nor influence. Thus, although the attacker picks , the true target of the attack is , which the attacker cannot choose.

Unfortunately, this drastic increase in privacy risk for some users does not lead to a large increase in accuracy. Indeed, the best possible use of this ratio of users from an accuracy perspective would simply be to count exactly the users in a sample of sampling ratio .

Estimating the total cardinality based on this sample, similarly to what the optimal estimator in the proof of Lemma 6 does, leads to a variance of , which is very large if is reasonably small. With e.g. , this variance is too large for counting small values of (say, and ). This is not surprising: if of the values are ignored by the cardinality estimator, we cannot expect it to count values of on the order of thousands. But even this value of is larger than the typically used with -differential privacy, where is often chosen to be .

But in our running example, sketches must yield a reasonable accuracy both at small and large cardinalities, if many sketches are aggregated. This implicitly assumes that the service operates at a large scale, say with at least users. With , this means that thousands of users are not covered by the privacy property. This is unacceptable for most applications.

Averaging the privacy loss across users

Instead of requiring the same for every user, we could require that the average information gain by the attacker is bounded by . This approach is similar to on-average KL privacy [381], which we mentioned in Section 2.2.4. In this section, we take the example of HyperLogLog to show that accuracy is not incompatible with this notion of average privacy, but that cardinality estimators used in practice do not preserve privacy even if we average across all users.

First, we define this notion of average information gain across users.

Definition 73. Recall the definition of the positive privacy loss of given output at cardinality from Definition 71:

The maximum privacy loss of at cardinality is defined as: . A cardinality estimator satisfies -sketch privacy on average if we have, for all , .

In this definition, we accept that some users might have less privacy as long as the average user satisfies our initial privacy definition. Here, we average over all values of , but like we described at the end of Section 3.3.3.0, we could use other averaging functions, which would lead to strictly stronger definitions.

We show that HyperLogLog satisfies this definition, and we discuss the value of for various parameters and their significance. Intuitively, a HyperLogLog cardinality estimator puts every record in a random bucket, and each bucket counts the maximum number of leading zeroes of records added in this bucket. HyperLogLog cardinality estimators have a parameter that determines its memory consumption, its accuracy, and, as we will see, its level of average privacy.

Definition 74. Let be a uniformly distributed hash function. A HyperLogLog cardinality estimator of parameter is defined as follows. A sketch consists of a list of counters , all initialized to . When adding an record to the sketch, we compute , and represent it as a binary string . Let , i.e., the integer represented by the binary digits , and be the position of the leftmost -bit in . Then we update counter with .

For example, suppose that and , then , and (the position of the leftmost -bit in ). So we must look at the value for the counter and, if , set to .

Theorem 11. Assuming a sufficiently large , a HyperLogLog cardinality estimator of parameter satisfies -sketch privacy above cardinality on average where for ,

Proof. To simplify our analysis, we assume in this proof that is very large: for all reasonable values of , picking records uniformly at random in is the same as picking a subset of of size uniformly at random. In particular, we approximate by .

First, we compute for each , by considering the counter values of . We then use this to determine , and averaging them, deduce the desired result.

Step 1: Computing

Let , and such that . Decompose the binary string into two parts to get and . Denote by the counters of . Note that is characterized by two conditions:

  • for all , where denotes the condition that if bucket is non-empty, then an record with this number of leading zeroes was added in this bucket. Formally, iff whenever , there exists such that and .
  • , where denotes the condition that no record has more leading zeroes than the counter for its bucket. Formally, iff for all , .

Now, we compute the value of , depending on the value of . Without loss of generality, we assume that .

Case 1: Suppose .

We compute the probability of observing given . is already satisfied by as . So for to be equal to , ’s other records only have to satisfy for , and :

Next, we compute the probability of observing given . This time, all records of are chosen randomly, so

We can decompose this condition: there is a witness for (i.e., and ) and all others records satisfy the same conditions as in the case , namely and for all , , since this is equivalent to and for all , by the choice of . If is chosen uniformly in , since is a uniformly distributed hash function, then the following holds.

  • The probability of is .
  • The probability of is : if we denote , then with probability , with probability , etc.

Thus, the probability that an record witnesses is as are independent of . Since the set has size , there are possible chances that such an record is chosen: the probability that at least one record witnesses is . We can thus approximate by

and thus

Case 2: Suppose .

We can compute the probabilities and in a similar fashion.

Since (by hypothesis) , there exist distinct records of which, together, satisfy conditions and for all , . This allows us to bound from below. Suppose one record of satisfies and , and the other records in satisfy the conditions and for all . Then satisfies and for all . The lower bound follows as before:

and thus

We can use the results of both cases and immediately conclude that

and that the equality holds if satisfies .

Step 2: Determining

The previous reasoning shows that the worst case happens when the counter corresponding to ’s bucket contains the number of leading zeroes of , i.e., when :

We can then average this value over all . Since the hash function is uniformly distributed, of the values of satisfy , satisfy , etc. Thus

How does this positive result fit practical use cases? Figure 3.9 plots for three different HyperLogLog cardinality estimators. It shows two important results.

   21 (0.1% )
   15 (1% )
111110246n𝜀00000123459 (5% )

Figure 3.9:  as a function of , for HyperLogLog cardinality estimators of different parameters (and their corresponding relative standard error).

First, cardinality estimators used in practice do not preserve privacy. For example, the default parameter used for production pipelines at Google and on the BigQuery service [48] is . For this value of , an attacker can determine with significant accuracy whether a target was added to a sketch; not only in the worst case, but for the average user too. The average risk only becomes reasonable for , a threshold too large for most data analysis tasks.

Second, by sacrificing some accuracy, it is possible to obtain a reasonable average privacy. For example, a HyperLogLog sketch for which has a relative standard error of about , and an of about for . Unfortunately, even when the average risk is acceptable, some users will still be at a higher risk: users with a large number of leading zeroes are much more identifiable than the average. For example, if , there is a chance that at least one user has . For this user, , a very high value.

Our calculations yield only an approximation of that is an upper bound on the actual privacy loss in HyperLogLog sketches. However, these alarming results can be confirmed experimentally. We simulated , for uniformly random values of , using HyperLogLog sketches with the parameter . For each cardinality , we generated 10,000 different random target values, and added each one to 1,000 HyperLogLog sketches of cardinality (generated from random values). For each target, we counted the number of sketches that ignored it.

Figure 3.10 plots some percentile values. For example, the all-targets curve (100th percentile) has a value of 33% at cardinality = 10,000. This means that each of the 10,000 random targets was ignored by at most 33% of the 1,000 random sketches of this cardinality, i.e., for all . In other words, an attacker observes with at least 67% probability a change when adding a random target to a random sketch that did not contain it. Similarly, the 10th-percentile at = 10,000 has a value of 3.8%. So 10% of the targets were ignored by at most 3.8% of the sketches, i.e., for 10% of all users . That is, for the average user , there is a 10% chance that a sketch with 10,000 records changes with likelihood at least 96.2% when is first added.

  all targets
  10th percentile
  1st percentile
101010101000.20.40.60.81nℙn23456[a1dsdt( pSeErm,itl)l =e SE|t ℜ E]

Figure 3.10: Simulation of for uniformly chosen values of , using HyperLogLog sketches with parameter .

For small cardinalities (), adding an record that has not yet been added to the sketch will almost certainly modify the sketch: an attacker observing that a sketch does not change after adding can deduce with near-certainty that was added previously.

Even for larger cardinalities, there is always a constant number of people with high privacy loss. For = 1,000, no target was ignored by more than 5.5% of the sketches. For = 10,000, 10% of the users were ignored by at most 3.8% of the sketches. Similarly, the 1st percentile at = 100,000 and the 1st permille at = 1,000,000 are 4.6% and 4.5%, respectively. In summary, across all cardinalities , at least 1,000 users have . For these users, the corresponding privacy loss is . Concretely, if the attacker initially believes that is 1%, this number grows to 15% after observing that . If it is initially 10%, it grows to 66%. And if it is initially 25%, it grows to 86%.

All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).