Lowering the cost of anonymization

3.3.3 Weakening the privacy definition

Our main result is negative: no cardinality estimator satisfying our privacy definition can maintain a good accuracy. Thus, it is natural to wonder whether our privacy definition is too strict, and if the result still holds for weaker variants.

In this section, we consider two classes of weaker variants of our privacy definition: allowing the privacy loss to be larger than $ε$ on some inputs (relaxing the definition alongside dimension Q, in Section 2.2.2), and allowing some proportion of users to not be protected (relaxing the definition alongside dimension V, in Section 2.2.4).

First, we show that for the former class of variants, our negative result still holds: allowing a small probability of bad outcomes does not help. Second, we show that even though our negative result does not hold for the latter class of variants, cardinality estimators used in practice still leak a lot of information, even when we average the privacy loss across users.

Allowing larger values of the privacy loss

$ε$ -sketch privacy provides a bound on how much information the attacker can gain in the worst case. As exemplified by the DP variants alongside dimension Q, in Section 2.2.2, it is natural to relax this requirement

A first natural relaxation is to accept a small probability of failure, similarly to $(ε, δ)$ -DP (Definition 14 in Section 2.2.2): requiring a bound on the information gain in most cases, and accept a potentially unbounded information gain with low probability.

Definition 70. A cardinality estimator satisfies $(ε, δ)$ -sketch privacy above cardinality $N$ if for every $S_{O} \subseteq S$ , $n \geq N$ , and $t \in T$ ,

P_{n} [S_{E} \in S_{O} | t \in E] \leq e^{ε} \cdot P_{n} [S_{E} \in S_{O} | t \notin E] + δ .

Unfortunately, our negative result still holds for this variant of the definition. Indeed, we show that a close variant of Lemma 4 holds, and the rest follows directly.

Lemma 10. Let $t \in T$ . A cardinality estimator that satisfies $(ε, δ)$ -probabilistic sketch privacy above cardinality $N$ satisfies, for all $n \geq N$ :

P_{n} [add (S_{E}, t) = S_{E} | t \notin E] \geq (\frac{1}{2} - δ) \cdot e^{- ε} .

Proof. First, we show that if a cardinality estimator verifies $(ε, δ)$ -sketch privacy at a given cardinality, then for each target $t$ , we can find an explicit decomposition of the possible outputs: the bound on information gain is satisfied for each possible output, except on a density $δ$ . This is similar to the definition of probabilistic differential privacy (Definition 15 in Section 2.2.2), except the decomposition depends on the choice of $t$ . We then use this decomposition to prove a variant of our negative result.

Lemma 11. If a cardinality estimator satisfies $(ε, δ)$ -sketch privacy above cardinality $N$ , then for every $n \geq N$ and $t \in T$ , we can decompose the space of possible sketches $S = S_{1} ⊎ S_{2}$ such that $P_{n} [S_{E} \in S_{2} | t \in E] \leq 2 δ$ , and for all $S \in S_{1}$ :

P_{n} [S_{E} = S | t \in E] \leq 2 e^{ε} \cdot P_{n} [S_{E} = S | t \notin E] .

Proof. Suppose the cardinality estimator satisfies $(ε, δ)$ -sketch privacy at cardinality $N$ , and fix $n \geq N$ and $t \in T$ . Let $ε^{'} = ε + ln (2)$ .

Let $S_{1}$ be the set of outputs for which the privacy loss is higher than $ε^{'}$ . Formally, $S_{1}$ is the set of sketches $S$ that satisfy

P_{n} [S_{E} = S | t \in E] > e^{ε^{'}} \cdot P_{n} [S_{E} = S | t \notin E] .

We show that $S_{1}$ has a density bounded by $2 δ$ .

Suppose for the sake of contradiction that this set has density at least $2 δ$ given that $t \in E$ :

P_{n} [S \in S_{1} | t \in E] \geq 2 δ .

We can sum the inequalities in $S_{1}$ to obtain:

P_{n} [S_{E} \in S_{1} | t \in E] > e^{ε^{'}} \cdot P_{n} [S_{E} \in S_{1} | t \notin E] .

Averaging both inequalities, we get:

P_{n} [S_{E} \in S_{1} | t \in E] > e^{ε} \cdot P_{n} [S_{E} \in S_{1} | t \notin E] + δ

since $e^{ε^{'}} = e^{ε + ln (2)} = 2 \cdot e^{ε}$ . This contradicts the hypothesis that the cardinality estimator satisfies $(ε, δ)$ -sketch privacy at cardinality $N$ .

Thus, $P_{n} [S \in S_{1} | t \in E] < 2 δ$ and by the definition of $S_{1}$ , every output $S \in S_{2} = S ∖ S_{1}$ verifies:

P_{n} [S_{E} = S | t \in E] \leq e^{ε^{'}} \cdot P_{n} [S_{E} = S | t \notin E] .

□

We can then use this decomposition to prove Lemma 10. Lemma 11 allows us to get two sets $S_{1}$ and $S_{2}$ such that:

$P_{n} [S_{E} \in S_{2} | t \in E] \leq 2 δ$ ;
and for all $S \in S_{1}$ , $P_{n} [S_{E} = S | t \in E] \leq 2 e^{ε} \cdot P_{n} [S_{E} = S | t \notin E]$ .

We decompose $P_{n} [add (S_{E}, t) = S_{E} | t \in E]$ into:

P_{n} [add (S_{E}, t) = S_{E}, S_{E} \in S_{1} | t \in E] + P_{n} [add (S_{E}, t) = S_{E}, S_{E} \in S_{2} | t \in E] .

The same reasoning as in the proof of Lemma 4 gives:

\begin{matrix} P_{n} [add (S_{E}, t) = S_{E}, S_{E} \in S_{1} | t \in E] & \leq 2 e^{ε} \cdot P_{n} [add (S_{E}, t) = S_{E}, S_{E} \in S_{1} | t \notin E] \leq 2 e^{ε} \cdot P_{n} [add (S_{E}, t) = S_{E} | t \notin E] \end{matrix}

and since $P_{n} [S_{E} \in S_{2} | t \in E] \leq 2 δ$ , we immediately have:

P_{n} [add (S_{E}, t) = S_{E}, M \in S_{2} | t \in E] \leq 2 δ .

We conclude that

P_{n} [add (S_{E}, t) = S_{E} | t \in E] \leq 2 e^{ε} \cdot P_{n} [add (S_{E}, t) = S_{E} | t \notin E] + 2 δ .

Now, Lemma 2 gives $P [add (S_{E}, t) = S_{E} c] t \in E = 1$ , and finally:

P_{n} [add (S_{E}, t) = S_{E} | t \notin E] \geq (\frac{1}{2} - δ) \cdot e^{- ε} .

□

We can then deduce a theorem similar to our negative result for our weaker privacy definition.

Theorem 9. An unbiased cardinality estimator that satisfies $(ε, δ)$ -sketch privacy above cardinality $N$ has a variance at least $\frac{1 - c^{k}}{c^{k}} (n - k \cdot N)$ for any $n \leq N$ and $k \leq \frac{n}{N}$ , where $c = 1 - (\frac{1}{2} - δ) \cdot e^{- ε}$ . It is therefore not precise if $δ < 0.5$ .

Proof. This follows from Lemmas 10, 5, and 6. □

Instead of requiring that the attacker’s information gain is bounded by $ε$ for every possible output, we could bound the average information gain. This is equivalent to accepting a larger privacy loss in some cases, as long as other cases have a lower privacy loss. This intuition is similar to $ε$ -Kullback-Leiber privacy (Definition 16 in Section 2.2.2), often used in similar contexts [104 , 133 , 327 , 328]. In our case, we adapt it to maintain the asymmetry of our original privacy definition. First, we give a formal definition the privacy loss of a user $t$ given output $S$ .

Definition 71. Given a cardinality estimator, the positive privacy loss of $t$ given output $S$ at cardinality $n$ is defined as

ε_{n, t, M} = max (ln (\frac{P_{n} [S_{E} = S | t \in E]}{P_{n} [S_{E} = S | t \notin E]}), 0) .

This privacy loss is never negative: this is equivalent to discarding the case where the attacker gains negative information. Now, we bound this average over all possible values of $S_{E}$ , given $t \in E$ .

Definition 72. A cardinality estimator satisfies $ε$ -sketch KL privacy above cardinality $N$ if for every $n \geq N$ and $t \in T$ , we have

\sum S P_{n} [S_{E} = S | t \in E] \cdot ε_{n, t, M} \leq ε .

It is easy to check that $ε$ -sketch KL privacy above cardinality $N$ is strictly weaker than $ε$ -sketch privacy above cardinality $N$ . Unfortunately, this definition is also stronger than $(ε_{δ}, δ)$ -sketch privacy above cardinality $N$ for certain values of $ε$ and $δ$ , and as such, Lemma 10 also applies. We prove this in the following lemma.

Lemma 12. If a cardinality estimator satisfies $ε$ -sketch KL privacy above cardinality $N$ , then it also satisfies $(\frac{ε}{δ}, δ)$ -sketch privacy above cardinality $N$ for any $δ > 0$ .

Proof. Let $n \geq N$ and $δ > 0$ . Suppose that a cardinality estimator does not satisfy $(\frac{ε}{δ}, δ)$ -sketch privacy at cardinality $n$ . Then with probability strictly larger than $δ$ , the output does not satisfy $\frac{ε}{δ}$ -sketch privacy. Formally there is $S_{2} \subseteq S$ such that $P_{n} [S_{E} \in S_{2} | t \in E] > δ$ , and such that $ε_{n, t, M} > \frac{ε}{δ}$ for all $S \in S_{2}$ . Since all values of $ε_{n, t, M}$ are positive, we have

\begin{matrix} \sum S P_{n} [S_{E} = S | t \in E] \cdot ε_{n, t, M} & \geq \sum S \in S_{2} P_{n} [S_{E} = S | t \in E] \cdot ε_{n, t, M} > \frac{ε}{δ} \sum S \in S_{2} P_{n} [S_{E} = S | t \in E] > ε . \end{matrix}

Hence this cardinality estimator does not satisfy $ε$ -sketch average privacy at cardinality $n$ . □

This lemma leads to a similar version of the negative result.

Theorem 10. An unbiased cardinality estimator that satisfies $ε$ -sketch KL privacy above cardinality $N$ has a variance at least $\frac{1 - c^{k}}{c^{k}} (n - k \cdot N)$ for any $n \leq N$ and $k \leq \frac{n}{N}$ , where $c = 1 - \frac{e^{- 4 ε}}{4}$ . It is therefore not precise.

Proof. This follows directly from Lemma 12 with $δ = \frac{1}{4}$ , and Theorem 9. □

Recall that all existing cardinality estimators satisfy our axioms and have a bounded accuracy. Thus, an immediate corollary is that for all cardinality estimators used in practice, there are some users for which the average privacy loss is very large.

Note that we could obtain a result similar to Theorem 9 with another notion of average, using for example the same approach as Rényi DP (Definition 17 in Section 2.2.2). We could either simply use the fact that Rényi DP with $α > 1$ is strictly stronger than KL privacy, or use the same reasoning to Proposition 3 in [284] to show that such a notion of average also implies $(ε, δ)$ -sketch privacy with even lower parameters.

Privacy loss of individual users

$ε$ -sketch privacy protects all users uniformly. Similarly to the variants of DP from dimension V, we could relax this requirement and allow some users to have less privacy than others. Variants obtained this way would generally not be sufficiently convincing to be used in practice: one typically wants to protect all users, not just a majority of them. In this section, we show that even if we relax this requirement, cardinality estimators would in practice leak a significant amount of information.

Allowing unbounded privacy loss for some users

First, what happens if we allow some users to have unbounded privacy loss? We could achieve this by requiring the existence of a subset of users $T \subseteq T$ of density $1 - γ$ , such that every user in $T$ is protected by $ε$ -sketch privacy above cardinality $N$ . In this case, a ratio $γ$ of possible targets are not protected. This would be somewhat similar to random DP (Definition 29 in Section 2.2.4), except the unprotected users are always the same.

This approach only makes sense if the attacker cannot choose the target $t$ . For our attacker model, this might be realistic: suppose that the attacker wants to target just one particular person. Since all user identifiers are hashed before being passed to the cardinality estimator, this person will be associated to a hash value that the attacker can neither predict nor influence. Thus, although the attacker picks $t$ , the true target of the attack is $h (t)$ , which the attacker cannot choose.

Unfortunately, this drastic increase in privacy risk for some users does not lead to a large increase in accuracy. Indeed, the best possible use of this ratio $γ$ of users from an accuracy perspective would simply be to count exactly the users in a sample of sampling ratio $γ$ .

Estimating the total cardinality based on this sample, similarly to what the optimal estimator in the proof of Lemma 6 does, leads to a variance of $\frac{1 - γ}{γ} \cdot (n - N)$ , which is very large if $γ$ is reasonably small. With e.g. $γ ≃ 1 0^{- 4}$ , this variance is too large for counting small values of $n$ (say, $n ≃ 1000$ and $N ≃ 100$ ). This is not surprising: if $99.99 %$ of the values are ignored by the cardinality estimator, we cannot expect it to count values of $n$ on the order of thousands. But even this value of $γ$ is larger than the $δ$ typically used with $(ε, δ)$ -differential privacy, where $δ$ is often chosen to be $o (1 ∕ n)$ .

But in our running example, sketches must yield a reasonable accuracy both at small and large cardinalities, if many sketches are aggregated. This implicitly assumes that the service operates at a large scale, say with at least $1 0^{7}$ users. With $γ = 1 0^{- 4}$ , this means that thousands of users are not covered by the privacy property. This is unacceptable for most applications.

Averaging the privacy loss across users

Instead of requiring the same $ε$ for every user, we could require that the average information gain by the attacker is bounded by $ε$ . This approach is similar to on-average KL privacy [381], which we mentioned in Section 2.2.4. In this section, we take the example of HyperLogLog to show that accuracy is not incompatible with this notion of average privacy, but that cardinality estimators used in practice do not preserve privacy even if we average across all users.

First, we define this notion of average information gain across users.

Definition 73. Recall the definition of the positive privacy loss $ε_{n, t, M}$ of $t$ given output $S$ at cardinality $n$ from Definition 71:

ε_{n, t, M} = max (ln (\frac{P_{n} [S_{E} = S | t \in E]}{P_{n} [S_{E} = S | t \notin E]}), 0) .

The maximum privacy loss of $t$ at cardinality $n$ is defined as: $ε_{n, t} = max S (ε_{n, t, M})$ . A cardinality estimator satisfies $ε$ -sketch privacy on average if we have, for all $n$ , $\frac{1}{| T |} \sum_{t \in T} ε_{n, t} \leq ε$ .

In this definition, we accept that some users might have less privacy as long as the average user satisfies our initial privacy definition. Here, we average over all values of $ε$ , but like we described at the end of Section 3.3.3.0, we could use other averaging functions, which would lead to strictly stronger definitions.

We show that HyperLogLog satisfies this definition, and we discuss the value of $ε$ for various parameters and their significance. Intuitively, a HyperLogLog cardinality estimator puts every record in a random bucket, and each bucket counts the maximum number of leading zeroes of records added in this bucket. HyperLogLog cardinality estimators have a parameter $p$ that determines its memory consumption, its accuracy, and, as we will see, its level of average privacy.

Definition 74. Let $h$ be a uniformly distributed hash function. A HyperLogLog cardinality estimator of parameter $p$ is defined as follows. A sketch consists of a list of $2^{p}$ counters $C_{0}, \dots, C_{2^{p} - 1}^{p}$ , all initialized to $0$ . When adding an record $e$ to the sketch, we compute $h (e)$ , and represent it as a binary string $x = x_{1} x_{2} \dots$ . Let $b (e) = {⟨ x_{1} \dots x_{p} ⟩}_{2}$ , i.e., the integer represented by the binary digits $x_{1} \dots x_{p}$ , and $ρ (e)$ be the position of the leftmost $1$ -bit in $x_{p + 1} x_{p + 2} \dots$ . Then we update counter $C_{b (e)}$ with $C_{b (e)} \leftarrow max (C_{b (e)}, ρ (e))$ .

For example, suppose that $x = 10001101 \dots$ and $p = 2$ , then $b (e) = {⟨ 10 ⟩}_{2} = 2$ , and $ρ (e) = 3$ (the position of the leftmost $1$ -bit in $001101 \dots$ ). So we must look at the value for the counter $C_{2}$ and, if $C_{2} < 3$ , set $C_{2}$ to $3$ .

Theorem 11. Assuming a sufficiently large $| T |$ , a HyperLogLog cardinality estimator of parameter $p$ satisfies $ε_{n}$ -sketch privacy above cardinality $N$ on average where for $N \geq n$ ,

ε_{n} ≃ - \sum k \geq 1 2^{- k} ln (1 - {(1 - 2^{- p - k})}^{n}) .

Proof. To simplify our analysis, we assume in this proof that $| T |$ is very large: for all reasonable values of $n$ , picking $n$ records uniformly at random in $T$ is the same as picking a subset of $T$ of size $n$ uniformly at random. In particular, we approximate $P_{n} [S_{E} = S | t \notin E]$ by $P_{n} [S_{E} = S]$ .

First, we compute $ε_{n, t, M}$ for each $S$ , by considering the counter values of $S$ . We then use this to determine $ε_{n, t}$ , and averaging them, deduce the desired result.

Step 1: Computing $ε_{n, t, M}$

Let $t \in T$ , and $S \in S$ such that $P_{n} [S_{E} = S | t \in E] > 0$ . Decompose the binary string $h (t) = x$ into two parts to get $b (t)$ and $ρ (t)$ . Denote by $C_{0}, \dots, C_{2^{p} - 1}^{p}$ the counters of $S$ . Note that $S = S_{E}$ is characterized by two conditions:

${REACH}_{E} (i)$ for all $0 \leq i < 2^{p}$ , where ${REACH}_{E} (i)$ denotes the condition that if bucket $i$ is non-empty, then an record with this number of leading zeroes was added in this bucket. Formally, ${REACH}_{E} (i)$ iff whenever $C_{i} > 0$ , there exists $e \in E$ such that $b (e) = i$ and $ρ (e) = C_{i}$ .
${SAX}_{E}$ , where ${SAX}_{E}$ denotes the condition that no record has more leading zeroes than the counter for its bucket. Formally, ${SAX}_{E}$ iff for all $e \in E$ , $ρ (e) \leq C_{b (e)}$ .

Now, we compute the value of $ε_{n, t, M}$ , depending on the value of $C_{b (t)}$ . Without loss of generality, we assume that $b (t) = 0$ .

Case 1: Suppose $C_{0} = ρ (t)$ .

We compute the probability of observing $S$ given $t \in E$ . ${REACH}_{E} (0)$ is already satisfied by $t$ as $C_{0} = ρ (t)$ . So for $S_{E}$ to be equal to $S$ , $E$ ’s other records only have to satisfy $REACH (i)$ for $i \geq 1$ , and ${SAX}_{E}$ :

P_{n} [S_{E} = S | t \in E] = P_{n - 1} [{SAX}_{E}, {REACH}_{E} (i) for all i \geq 1] .

Next, we compute the probability of observing $S$ given $t \notin E$ . This time, all records of $E$ are chosen randomly, so

\begin{matrix} P_{n} [S_{E} = S | t \notin E] & ≃ P_{n} [S_{E} = S] = P_{n} [{SAX}_{E}, {REACH}_{E} (i) for all i] . \end{matrix}

We can decompose this condition: there is a witness $e \in E$ for ${REACH}_{E} (0)$ (i.e., $b (e) = 0$ and $ρ (e) = C_{0} = ρ (t)$ ) and all others records satisfy the same conditions as in the case $t \in E$ , namely ${SAX}_{E}$ and for all $i \geq 1$ , ${REACH}_{E} (i)$ , since this is equivalent to ${SAX}_{E ∖ {e}}$ and for all $i \geq 1$ , ${REACH}_{E ∖ {e}} (i)$ by the choice of $e$ . If $e$ is chosen uniformly in $T$ , since $h$ is a uniformly distributed hash function, then the following holds.

The probability of $b (e) = 0$ is $2^{- p}$ .
The probability of $ρ (e) = ρ (t)$ is $2^{- ρ (t)}$ : if we denote $h (t) = x_{1} \dots x_{p} . x_{p + 1} \dots$ , then $x_{p + 1} = 1$ with probability $1 ∕ 2$ , $x_{p + 1} x_{p + 2} = 01$ with probability $1 ∕ 4$ , etc.

Thus, the probability that an record $e$ witnesses ${REACH}_{E} (0)$ is $2^{- p - ρ (t)}$ as $x_{1} \dots x_{p}$ are independent of $x_{p + 1} x_{p + 2} \dots$ . Since the set $E$ has size $n$ , there are $n$ possible chances that such an record is chosen: the probability that at least one record witnesses ${REACH}_{E} (0)$ is $1 - {(1 - 2^{- p - ρ (t)})}^{n}$ . We can thus approximate $P_{n} [S_{E} = S | t \notin E]$ by

(1 - {(1 - 2^{- p - ρ (t)})}^{n}) \cdot P_{n - 1} [{SAX}_{E}, {REACH}_{E} (i) for all i \geq 1]

and thus

\frac{P_{n} [S_{E} = S | t \in E]}{P_{n} [S_{E} = S | t \notin E]} ≃ {(1 - {(1 - 2^{- p - ρ (t)})}^{n})}^{- 1} .

Case 2: Suppose $C_{0} > ρ (t)$ .

We can compute the probabilities $P_{n} [S_{E} = S | t \in E]$ and $P_{n} [S_{E} = S | t \notin E]$ in a similar fashion.

\begin{matrix} P_{n} [S_{E} = S | t \in E] & = P_{n - 1} [{SAX}_{E}, {REACH}_{E} (i) for all i] P_{n} [S_{E} = S | t \notin E] & ≃ P_{n} [S_{E} = S] = P_{n} [{SAX}_{E}, {REACH}_{E} (i) for all i] . \end{matrix}

Since (by hypothesis) $P_{n} [S_{E} = S | t \in E] > 0$ , there exist $n - 1$ distinct records of $T$ which, together, satisfy conditions ${SAX}_{E}$ and for all $i$ , ${REACH}_{E} (i)$ . This allows us to bound $P_{n} [S_{E} = S]$ from below. Suppose one record $e$ of $E$ satisfies $b (e) = 0$ and $ρ (e) = ρ (t)$ , and the $n - 1$ other records in $E ∖ {e}$ satisfy the conditions ${SAX}_{E ∖ {e}}$ and ${REACH}_{E ∖ {e}} (i)$ for all $i$ . Then $E$ satisfies ${SAX}_{E}$ and ${REACH}_{E} (i)$ for all $i$ . The lower bound follows as before:

P_{n} [S_{E} = S | t \notin E] > \sim (1 - {(1 - 2^{- p - ρ (t)})}^{n}) \cdot P_{n - 1} [{SAX}_{E}, {REACH}_{E} (i) for all i]

and thus

\frac{P_{n} [S_{E} = S | t \in E]}{P_{n} [S_{E} = S | t \notin E]} < \sim {(1 - {(1 - 2^{- p - ρ (t)})}^{n})}^{- 1} .

We can use the results of both cases and immediately conclude that

ε_{n, t, M} < \sim - ln (1 - {(1 - 2^{- p - ρ (t)})}^{n})

and that the equality holds if $S$ satisfies $C_{b (t)} = ρ (t)$ .

Step 2: Determining $ε_{n}$

The previous reasoning shows that the worst case happens when the counter corresponding to $t$ ’s bucket contains the number of leading zeroes of $t$ , i.e., when $ρ (t) = C_{b (t)}$ :

ε_{n, t} = max S (ε_{n, t, M}) ≃ - ln (1 - {(1 - 2^{- p - ρ (t)})}^{n}) .

We can then average this value over all $t$ . Since the hash function is uniformly distributed, $1 ∕ 2$ of the values of $t$ satisfy $ρ (t) = 1$ , $1 ∕ 4$ satisfy $ρ (t) = 1 ∕ 2$ , etc. Thus

ε_{n} ≃ - \sum k \geq 1 2^{- k} ln (1 - {(1 - 2^{- p - k})}^{n}) .

□

How does this positive result fit practical use cases? Figure 3.9 plots $ε_{n}$ for three different HyperLogLog cardinality estimators. It shows two important results.

21 (0.1% )
15 (1% )
111110246n𝜀00000123459 (5% ) — Figure 3.9: $ε_{n}$ as a function of $n$ , for HyperLogLog cardinality estimators of different $p$ parameters (and their corresponding relative standard error).

First, cardinality estimators used in practice do not preserve privacy. For example, the default parameter used for production pipelines at Google and on the BigQuery service [48] is $p = 15$ . For this value of $p$ , an attacker can determine with significant accuracy whether a target was added to a sketch; not only in the worst case, but for the average user too. The average risk only becomes reasonable for $n \geq 10, 000$ , a threshold too large for most data analysis tasks.

Second, by sacrificing some accuracy, it is possible to obtain a reasonable average privacy. For example, a HyperLogLog sketch for which $p = 9$ has a relative standard error of about $5 %$ , and an $ε_{n}$ of about $1$ for $n = 1000$ . Unfortunately, even when the average risk is acceptable, some users will still be at a higher risk: users $e$ with a large number of leading zeroes are much more identifiable than the average. For example, if $n = 1000$ , there is a $98 %$ chance that at least one user has $ρ (e) \geq 8$ . For this user, $ε_{n, t} ≃ 5$ , a very high value.

Our calculations yield only an approximation of $ε_{n}$ that is an upper bound on the actual privacy loss in HyperLogLog sketches. However, these alarming results can be confirmed experimentally. We simulated $P_{n} [add (S_{E}, t) = S_{E} | t \notin E]$ , for uniformly random values of $t$ , using HyperLogLog sketches with the parameter $p = 15$ . For each cardinality $n$ , we generated 10,000 different random target values, and added each one to 1,000 HyperLogLog sketches of cardinality $n$ (generated from random values). For each target, we counted the number of sketches that ignored it.

Figure 3.10 plots some percentile values. For example, the all-targets curve (100th percentile) has a value of 33% at cardinality $n$ = 10,000. This means that each of the 10,000 random targets was ignored by at most 33% of the 1,000 random sketches of this cardinality, i.e., $P_{n} [add (S_{E}, t) = S_{E} | t \notin E] \leq 33 %$ for all $t$ . In other words, an attacker observes with at least 67% probability a change when adding a random target to a random sketch that did not contain it. Similarly, the 10th-percentile at $n$ = 10,000 has a value of 3.8%. So 10% of the targets were ignored by at most 3.8% of the sketches, i.e., $P_{n} [add (S_{E}, t) = S_{E} | t \notin E] \leq 3.8 %$ for 10% of all users $t$ . That is, for the average user $t$ , there is a 10% chance that a sketch with 10,000 records changes with likelihood at least 96.2% when $t$ is first added.

all targets
10th percentile
1st percentile
101010101000.20.40.60.81nℙn23456[a1dsdt( pSeErm,itl)l =e SE|t ℜ E] — Figure 3.10: Simulation of $P_{n} [add (S_{E}, t) = S_{E} | t \notin E]$ for uniformly chosen values of $t$ , using HyperLogLog sketches with parameter $p = 15$ .

For small cardinalities ( $n < 1, 000$ ), adding an record that has not yet been added to the sketch will almost certainly modify the sketch: an attacker observing that a sketch does not change after adding $t$ can deduce with near-certainty that $t$ was added previously.

Even for larger cardinalities, there is always a constant number of people with high privacy loss. For $n$ = 1,000, no target was ignored by more than 5.5% of the sketches. For $n$ = 10,000, 10% of the users were ignored by at most 3.8% of the sketches. Similarly, the 1st percentile at $n$ = 100,000 and the 1st permille at $n$ = 1,000,000 are 4.6% and 4.5%, respectively. In summary, across all cardinalities $n$ , at least 1,000 users $t$ have $P_{n} [add (S_{E}, t) = S_{E} | t \notin E] \leq 0.05$ . For these users, the corresponding privacy loss is $e^{ε} = \frac{1}{0.055} ≃ 18$ . Concretely, if the attacker initially believes that $P_{n} [t \in E]$ is 1%, this number grows to 15% after observing that $add (S_{E}, t) = S_{E}$ . If it is initially 10%, it grows to 66%. And if it is initially 25%, it grows to 86%.

LINKPREV LINKUP LINKNEXT