Lowering the cost of anonymization

3.3.2 Private cardinality estimators are imprecise

Let us return to our privacy problem: someone with access to a sketch wants to know whether a given individual belongs to the aggregated individuals in the sketch. Formally, given a target $t$ and a sketch $S_{E}$ , the attacker must guess whether $t \in E$ with high probability. In Section 3.3.2.0, we explain how the attacker can use a simple test to gain significant information if the cardinality estimator is deterministic. Then, in Section 3.3.2.0, we reformulate the main technical lemma in probabilistic terms, and prove an equivalent theorem for probabilistic cardinality estimators.

Deterministic case

Given a target $t$ and a sketch $S_{E}$ , the attacker can perform the following simple attack to guess whether $t \in E$ . They can try to add the target $t$ to the sketch $S_{E}$ , and observe whether the sketch changes. In other words, they check whether $add (S_{E}, t) = S_{E}$ . If the sketch changes, this means with certainty that $t \notin E$ . Thus, Bayes’ law indicates that if $add (S_{E}, t) = S_{E}$ , then the probability of $t \in E$ cannot decrease.

How large is this increase? Intuitively, it depends on how likely it is that adding an record to a sketch does not change it if the record has not previously been added to the sketch. Formally, it depends on $P [add (S_{E}, t) = S_{E} | t \notin E]$ .

If $P [add (S_{E}, t) = S_{E} | t \notin E]$ is close to $0$ , for example if the sketch is just a list of all records seen so far, then observing that $add (S_{E}, t) = S_{E}$ will lead the attacker to believe with high probability that $t \in E$ .
If $P [add (S_{E}, t) = S_{E} | t \notin E]$ is close to $1$ , it means that adding an record to a sketch often does not change it. The previous attack does not reveal much information. But then, it also means that many records are ignored when they are added to the sketch, that is, the sketch does not change when adding the record. Intuitively, the accuracy of an estimator based solely on a sketch that ignores many records cannot be very good.

We formalize this intuition in the following theorem.

𝜀 = 0.2
𝜀 = 0.7
01234511111111nR,0,0,0,0,000000000e00000−−012345l0000021a𝜀tiv=e s1t.4andard error, in % — Figure 3.8: Minimum standard error for a cardinality estimator with $ε$ -sketch privacy above cardinality $100$ (left) and $500$ (right). The dotted line is the relative standard error of HyperLogLog with standard parameters.

𝜀 = 0.2
𝜀 = 0.7
05112211111111n,0505000000000,,,,−−0123450000021𝜀000000000= 1.4 — Figure 3.8: Minimum standard error for a cardinality estimator with $ε$ -sketch privacy above cardinality $100$ (left) and $500$ (right). The dotted line is the relative standard error of HyperLogLog with standard parameters.

Theorem 7. An unbiased deterministic cardinality estimator that satisfies $ε$ -sketch privacy above cardinality $N$ is not precise. Namely, its variance is at least $\frac{1 - c^{k}}{c^{k}} (n - k \cdot N)$ , for any $n \leq N$ and $k \leq \frac{n}{N}$ , where $c = 1 - e^{- ε}$

Proof. The proof is comprised of three steps, following the intuition previously given.

1.: We show that a sketch $S_{E}$ , computed from a random set $E$ with an $ε$ -sketch private estimator above cardinality $N$ , will ignore many records after $N$ (Lemma 4).
2.: We prove that if a cardinality estimator ignores a certain ratio of records after adding $n = N$ records, then it will ignore an even larger ratio of records as $n$ increases (Lemma 5).
3.: We conclude by proving that an unbiased cardinality estimator that ignores many records must have a large variance (Lemma 6).

The theorem follows directly from these lemmas. □

Note that if we were using differential privacy, this result would be trivial: no deterministic algorithm can ever be differentially private. However, this is not so obvious for our definition of privacy: prior work [36 , 46 , 176] as well as the results in Section 3.2 show that when the attacker is assumed to have some uncertainty about the data, even deterministic algorithms can satisfy the corresponding definition of privacy.

Figure 3.8 shows plots of the lower bound on the standard error of a cardinality estimator with $ε$ -sketch privacy at two cardinalities (100 and 500). It shows that the standard error increases exponentially with the number of records added to the sketch. This demonstrates that even if we require the privacy property for a large value of $N$ (500) and a relatively large $ε$ , the standard error of a cardinality estimator will become unreasonably large after 20,000 records.

Lemma 4. Let $t \in T$ . A deterministic cardinality estimator with $ε$ -sketch privacy above cardinality $N$ satisfies $P_{n} [add (S_{E}, t) = S_{E} | t \notin E] \geq e^{- ε}$ for $n \geq N$ .

Proof. We first prove that such an estimator also satisfies

P_{n} [add (S_{E}, t) = S_{E} | t \in E] \leq e^{ε} \cdot P_{n} [add (S_{E}, t) = S_{E} | t \notin E] .

We decompose the left-hand side of the inequality over all possible values of $S_{E}$ which that $add (S_{E}, t) = S_{E}$ . If we abbreviate this set $I_{t} = {M | add (S, t) = S}$ , we have:

\begin{matrix} P_{n} [add (S_{E}, t) = S_{E} | t \in E] & = \sum_{S \in I_{t}} P_{n} [S_{E} = S | t \in E] \leq e^{ε} \cdot \sum_{S \in I_{t}} P_{n} [S_{E} = S | t \notin E] \leq e^{ε} \cdot P_{n} [add (S_{E}, t) = S_{E} | t \notin E], \end{matrix}

where the first inequality is obtained directly from the definition of $ε$ -sketch privacy.

Now, Lemma 2 gives $P_{add (S_{E}, t) = S_{E}} [t \in E] = 1$ , and finally:

P_{n} [add (S_{E}, t) = S_{E} | t \notin E] \geq e^{- ε} .

□

Lemma 5. Let $t \in T$ . Suppose a deterministic cardinality estimator satisfies

P_{n} [add (S_{E}, t) = S_{E} | t \notin E] \geq p

for any $n \geq N$ . Then for any integer $k \geq 1$ , it also satisfies $P_{n} [add (S_{E}, t) = S_{E} | t \notin E] \geq 1 - {(1 - p)}^{k}$ , for $n \geq k \cdot N$ .

Proof. First, note that if $F \subseteq E$ , and $add (S_{F}, t) = S_{F}$ , then $add (S_{E}, t) = S_{E}$ . This is a direct consequence of Lemma 3: $S_{E} = merge (S_{E ∖ F}, S_{F})$ , so:

\begin{matrix} add (S_{E}, t) & = merge (S_{E ∖ F}, add (S_{F}, t)) = merge (S_{E ∖ F}, S_{F}) = S_{E} \end{matrix}

We show next that when $n \geq k \cdot N$ , generating a set $E \in P_{n} (T)$ uniformly randomly can be seen as generating $k$ independent sets in $P_{N} (T)$ , then merging them. Indeed, generating such a set can be done by as follows:

1.: For each $i \in {1, \dots, k}$ , generate a set $E_{i} \subseteq P_{N} (T)$ uniformly randomly, then take the union of all these sets: $E_{\cup} = ⋃_{i} E_{i}$ .
2.: If some records appear in multiple $E_{i}$ , the total cardinality might be lower than $n$ , so we need add records uniformly at random to reach the desired cardinality: calculate $d = | E_{\cup} |$ , then generate a set $E^{'} \in P_{n - d} (T ∖ E_{\cup})$ uniformly randomly.
3.: Add the missing items to complete the set: $E = E_{\cup} \cup E^{'}$ .

Step 1 ensures that we used $k$ independent sets of cardinality $N$ to generate $E$ , and step 2 and 3 ensure that $E$ has exactly $n$ records.

Intuitively, each time we generate a set $E_{i}$ of cardinality $N$ uniformly at random in $T$ , we have one chance that $t$ will be ignored by $E_{i}$ (and thus by $E$ ). So $t$ can be ignored by $S_{E}$ with a certain probability because it was ignored by $S_{E_{1}}$ . Similarly, it can also be ignored because of $S_{E_{2}}$ , etc. Since the choice of $E_{i}$ is independent of the choice of records in $⋃_{j \Leftrightarrow i} E_{j}$ , we can rewrite:

\begin{matrix} P_{n} [add (S_{E}, t) \Leftrightarrow S_{E} | t \notin E] & \leq k \prod i = 1 P_{n} [add (S_{E_{i}}, t) \Leftrightarrow S_{E_{i}} ∣ ∣ t \notin E] \leq k \prod i = 1 (1 - P_{n} [add (S_{E_{i}}, t) = S_{E_{i}} ∣ ∣ t \notin e_{i}]) \leq {(1 - p)}^{k} \end{matrix}

using the hypothesis of the lemma. Thus:

P_{n} [add (S_{E}, t) = S_{E} | t \notin E] \geq 1 - {(1 - p)}^{k} .

□

Lemma 6. Suppose a deterministic cardinality estimator satisfies:

P_{n} [add (S_{E}, t) = S_{E} | t \notin E] \geq 1 - p

for any $n \geq N$ and all $t$ . Then its variance for $n \geq N$ is at least $\frac{1 - p}{p} (n - N)$ .

Proof. The proof’s intuition is as follows. The hypothesis of the lemma requires that the cardinality estimator, on average, ignores a proportion $1 - p$ of new records added to a sketch (once $N$ records have been added): the sketch is not changed when a new record is added. The best thing that the cardinality estimator can do, then, is to store all records that it does not ignore, count the number of unique records among these, and multiply this number by $1 ∕ p$ to correct for the records ignored. It is well-known that estimating the size $k$ of a set based on the size of a uniform sample of sampling ratio $p$ has a variance of $\frac{1 - p}{p} k$ . Hence, our cardinality estimator has a variance of at least $\frac{1 - p}{p} (n - N)$ .

Formalizing this idea requires some additional technical steps, and the proof is decomposed into three steps. In Step 1, we fix the first $N$ records added to the sketch, and we bound the variance of the estimator that has these records as initial input. In Step 2, we explicitly compute the probability for an record $t$ to be ignored by $S_{E}$ , first when $t$ is fixed, then when $E$ is fixed. We then use these results in Step 3 to average the bound over all possible choices for the $N$ records in $E$ , which gives us an overall bound.

Step 1: Bound the estimator variance.

Let $E \in P_{N} (T)$ . Let $X_{E} = {t \in T | add (S_{E}, t) \Leftrightarrow S_{E}}$ be the set of records that are not ignored by $S_{E}$ . Let $p_{E} = \frac{| X_{E} |}{| T |}$ , or equivalently, let $p_{E} = P_{t} [add (S_{E}, t) \Leftrightarrow S_{E} | t \notin E]$ , where $P_{t}$ is the distribution that picks $t$ uniformly randomly in $T$ .

$X_{E}$ can be seen as the sampling set of the estimator with $E$ as initial input, while $p_{E}$ can be seen as its sampling fraction: all other records are discarded, so the estimator only has access to records in $E$ and $X_{E}$ to compute its estimate.

The optimal estimator to minimize variance in this context simply counts the exact number of distinct records in the sample (remembering each one), and divides this number by $p_{E}$ to estimate the total number of distinct records.²

What is the variance $V_{n | E}$ of this optimal estimator? Suppose we added $n - N$ records after reaching the sampling part. The number of records in the sample is a random variable with variance $p_{E} (1 - p_{E}) (n - N)$ . Dividing this random variable by $p_{E}$ gives a variance of $\frac{1 - p_{E}}{p_{E}} \cdot (n - N)$ . Thus:

V_{n | E} \geq \frac{1 - p_{E}}{p_{E}} \cdot (n - N) .

Since the first $N$ records are chosen uniformly at random, the variance of the overall estimator is bounded by their average. If we denote by $P_{N} (T)$ the set of all possible subsets $E \subseteq T$ of cardinality $N$ , we have:

V_{n} \geq {avg}_{E \in P_{N} (T)} \frac{1 - p_{E}}{p_{E}} (n - N)

(3.1)

where $avg$ stands for the average.

Step 2: Intermediary results.

Fix $E \in P_{N} (T)$ . Denoting by $1_{X}$ the function whose value is 1 if $X$ is satisfied and 0 otherwise,

\begin{matrix} P_{t} [add (S_{E}, t) \Leftrightarrow S_{E} | t \notin E] & = \frac{P_{t} [add (S_{E}, t) \Leftrightarrow S_{E}, t \notin E]}{P_{t} [t \notin E]} = \frac{| T |}{| T | - N} \cdot {avg}_{t \in T} (1_{add (S_{E}, t) \Leftrightarrow S_{E}}) . & (3.2) \end{matrix}

Indeed, $P_{t} [t \notin E] = \frac{| T | - N}{| T |}$ is straightforward, and since $t \in E$ implies $add (S_{E}, t) = S_{E}$ , the condition $add (S_{E}, t) \Leftrightarrow S_{E}, t \notin E$ can be simplified to $add (S_{E}, t) \Leftrightarrow S_{E}$ .

Now, fix $t \in T$ . Then

\begin{matrix} P_{N} [add (S_{E}, t) \Leftrightarrow S_{E} | t \notin E] & = \frac{P_{N} [add (S_{E}, t) \Leftrightarrow S_{E}, t \notin E]}{P_{N} [t \notin E]} = \frac{| T |}{| T | - N} \cdot {avg}_{E \in P_{N} (E)} (1_{add (S_{E}, t) \Leftrightarrow S_{E}}) . & (3.3) \end{matrix}

Indeed, we have

\begin{matrix} P_{N} [t \notin E] & = (\frac{| T | - 1}{N}) \cdot {(\frac{| T |}{N})}^{- 1} = \frac{(| T | - 1)!}{(| T | - 1 - N)! N!} \cdot \frac{(| T | - N)! N!}{| T |!} = \frac{| T | - N}{| T |} . \end{matrix}

Step 3: Conclude using convexity.

We now prove that ${avg}_{E \in P_{N} (T)} (p_{E}) \leq p$ . Our initial hypothesis states that for all $t$ , we have $p \geq P_{N} [add (S_{E}, t) \Leftrightarrow S_{E} | t \notin E]$ . We can average this for every $t$ and use (3.3) and (3.2):

\begin{matrix} p & \geq & {avg}_{t \in T} (P_{N} [add (S_{E}, t) \Leftrightarrow S_{E} | t \notin E]) (3.3) = & {avg}_{t \in T} (\frac{| T |}{| T | - N} \cdot {avg}_{E \in P_{N} (E)} (1_{add (S_{E}, t) \Leftrightarrow S_{E}})) = & {avg}_{E \in P_{N} (T)} (\frac{| T |}{| T | - N} \cdot {avg}_{t \in T} (1_{add (S_{E}, t) \Leftrightarrow S_{E}})) (3.2) = & {avg}_{E \in P_{N} (T)} (P_{t} [add (S_{E}, t) \Leftrightarrow S_{E} | t \notin E]) \geq & {avg}_{E \in P_{N} (T)} (p_{E}) & (3.4) \end{matrix}

Now, using (3.1) and (3.4) along with the fact that the function $x \to \frac{1 - x}{x} \cdot (n - N)$ is convex and decreasing on $(0, 1)$ , we can conclude by Jensen’s inequality that

V_{n} \geq \frac{1 - p}{p} (n - N) .

□

All existing cardinality estimators satisfy our axioms and their standard error remains low even for large values of $n$ . Theorem 7 shows, for all of them, that there are some users whose privacy loss is significant. In Section 3.3.3.0, we quantify this precisely for HyperLogLog.

Probabilistic case

Algorithms that add noise to their output, or more generally, are allowed to use a source of randomness, are often used in privacy contexts. As such, even though all cardinality estimators used in practical applications are deterministic, it is reasonable to hope that a probabilistic cardinality estimator could satisfy our very weak privacy definition. Unfortunately, this is not the case.

In the deterministic case, we showed that for any record $t$ , the probability that $t$ has an influence on a random sketch $S$ decreases exponentially with the sketch size. Or, equivalently, the distribution of sketches of size $kn$ that do not contain $t$ is “almost the same” (up to a density of probability ${(1 - e^{- ε})}^{k}$ ) as the distribution of sketches of the same size, but containing $t$ .

The following lemma establishes the same result in the probabilistic setting. Instead of reasoning about the probability that an record $t$ is “ignored” by a sketch $S$ , we reason about the probability that $t$ has a meaningful influence on this sketch. We show that this probability decreases exponentially, even if $P [S \Leftrightarrow add (S, t)]$ is very high.

First, we prove a technical lemma on the structure that the $merge$ operation imposes on the space of sketch distributions. Then, we find an upper bound on the “meaningful influence” of an record $t$ , when added to a random sketch of cardinality $n$ . We then use this upper bound, characterized using the statistical distance, to show that the estimator variance is as imprecise as for the deterministic case.

Definition 69. Let $V$ be the real vector space spanned by the family ${σ_{E} | E \subseteq T}$ (seen as vectors of $R^{S}$ ). For any probability distributions $σ, σ^{'} \in V$ , we denote $σ \cdot σ^{'} = merge (σ, σ^{'})$ . We show in Lemma 7 that this notation makes sense: on $V$ , we can do computations as if $merge$ was a multiplicative operation.

Lemma 7. The $merge$ operation defines a commutative and associative algebra on $V$ .

Proof. By the properties required from probabilistic cardinality estimators in Definition 67, the $merge$ operation is commutative and associative on the family ${σ_{E} | E \subseteq T}$ . By linearity of the $merge$ operation, these properties are preserved for any linear combination of vectors $σ_{E}$ . □

Lemma 8. Suppose a cardinality estimator satisfies $ε$ -sketch privacy above cardinality $N$ , and let $t \in T$ . Let $σ_{out, n}$ be the distribution of sketches obtained by adding $n$ uniformly random records of $T ∖ {t}$ into $S_{\emptyset}$ (or, equivalently, $σ_{out, n} (S) = P_{n} [S_{E} = S | t \notin E]$ ). Then:

υ (σ_{out, kn}, add (σ_{out, kn}, t)) \leq {(1 - e^{- ε})}^{k}

where $υ$ is the statistical distance between probability distributions.

Proof. Let $σ_{in, n} (S)$ be the distribution of sketches obtained by adding $t$ , then $n - 1$ uniformly random records of $T$ into $S$ (or, equivalently, $σ_{in, n} (S) = P_{n} [S_{E} = S | t \in E]$ ). Then the definition of $ε$ -sketch privacy gives that for every sketch $S$ , $σ_{out, n} (S) \geq e^{- ε} σ_{in, n} (S)$ . So we can express $σ_{out, n}$ as the sum of two distributions:

σ_{out, n} = e^{- ε} σ_{in, n} + (1 - e^{- ε}) ς

for a certain distribution $ς$ .

First, we show that $σ_{out, kn} = {(σ_{out, n})}^{k} \cdot σ$ for a certain distribution $σ$ . Indeed, to generate a sketch of cardinality $kn$ that does not contain $t$ uniformly randomly, one can use the following process.

1.: Generate $k$ random sketches of cardinality $n$ which do not contain $t$ , and merge them.
2.: For all $E \subseteq T$ , denote by $p_{E}$ the probability that the $k$ sketches were generated with the records in $E$ . There might be “collisions” between the $k$ sketches: if several sketches were generated using the same record, $| E | < kn$ . When this happens, we need to “correct” the distribution, and add additional records. Enumerating all the options, we denote $σ = \sum p_{E} σ_{E, nk}^{c}$ , where $σ_{E, nk}^{c}$ is obtained by adding $nk - | E |$ uniformly random records in $T ∖ E$ to $S_{\emptyset}$ . Thus, $σ_{out, kn} = {(σ_{out, n})}^{k} \cdot σ$ .

Note that these distributions are in $V$ : we can write $σ_{out, n} = {avg}_{E \in P_{n} (T), t \notin E} σ_{E}$ , and similarly for $σ_{in, n} = {avg}_{E \in P_{n} (T) T, t \in E} σ_{E}$ , as well as $ς = {(1 - e^{- ε})}^{- 1} (σ_{out, n} - e^{- ε} σ_{in, n})$ , etc. Thus:

\begin{matrix} σ_{out, kn} & = {(σ_{out, n})}^{k} \cdot σ = {(e^{- ε} σ_{in, n} + (1 - e^{- ε}) ς)}^{k} \cdot σ = k \sum i = 0 (\frac{k}{i}) e^{- i \cdot ε} {(1 - e^{- ε})}^{k - i} σ_{in, n}^{i} \cdot ς^{k - i} \cdot σ . \end{matrix}

Denoting $α = \sum_{i = 1}^{k} (\frac{k}{i}) e^{- i \cdot ε} {(1 - e^{- ε})}^{k - i} σ_{in, n}^{i - 1} \cdot ς^{k - i} \cdot σ$ and $β = ς^{k} \cdot σ$ , this gives us:

σ_{out, kn} = α \cdot σ_{in, n} + {(1 - e^{- ε})}^{k} β .

Finally, we can compute $add (σ_{out, kn}, t)$ :

\begin{matrix} add (σ_{out, kn}, t) & = α \cdot σ_{in, n} \cdot σ_{{t}} + {(1 - e^{- ε})}^{k} β \cdot σ_{{t}} = α \cdot σ_{in, n} + {(1 - e^{- ε})}^{k} β \cdot σ_{{t}} \end{matrix}

Note that since $σ_{in, n} = {avg}_{E \in P_{n} (T), t \in E} σ_{E}$ , we have $σ_{in, n} \cdot σ_{{t}} = σ_{in, n}$ by idempotence, and:

\begin{matrix} υ & (σ_{out, kn}, add (σ_{out, kn}, t)) = \frac{1}{2} {∥ σ_{out, kn} - add (σ_{out, kn}, t) ∥}_{1} = \frac{1}{2} {∥ ∥ {(1 - e^{- ε})}^{k} β - {(1 - e^{- ε})}^{k} β \cdot σ_{{t}} ∥ ∥}_{1} \leq \frac{{(1 - e^{- ε})}^{k}}{2} ({∥ β ∥}_{1} + {∥ ∥ β \cdot σ_{{t}} ∥ ∥}_{1}) \leq {(1 - e^{- ε})}^{k} . \end{matrix}

□

Lemma 8 is the probabilistic equivalent of Lemmas 4 and 5. Now, we state the equivalent of Lemma 6, and explain why its intuition still holds in the probabilistic case.

Lemma 9. Suppose that a cardinality estimator satisfies, for any $n \geq N$ and all $t$ :

υ (σ_{out, n}, add (σ_{out, n}, t)) \leq p .

Then its variance for $n \geq N$ is at least $\frac{1 - p}{p} (n - N)$ .

Proof. The condition “ $υ (σ_{out, n}, add (σ_{out, n}, t)) \leq p$ ” is equivalent to the condition of Lemma 6: with probability $(1 - p)$ , the cardinality estimator “ignores” when a new record $t$ is added to a sketch. Just like in Lemma 6’s proof, we can convert this constraint into estimating the size of a set based on a sampling set. The best known estimator for this problem is deterministic, so allowing the cardinality estimator to be probabilistic does not help improving the optimal variance.

The same result than in Lemma 6 follows. □

Lemmas 8 and 9 together immediately lead to the equivalent of Theorem 7 in the probabilistic case.

Theorem 8. An unbiased probabilistic cardinality estimator that satisfies $ε$ -sketch privacy above cardinality $N$ is not precise. Namely, its variance is at least $\frac{1 - c^{k}}{c^{k}} (n - k \cdot N)$ , for any $n \leq N$ and $k \leq \frac{n}{N}$ , where $c = 1 - e^{- ε}$

Somewhat surprisingly, allowing the algorithm to add noise to the data seems to be pointless from a privacy perspective. Indeed, given the same privacy guarantee, the lower bound on accuracy is the same for deterministic and probabilistic cardinality estimators. This suggests that the constraints of these algorithms (idempotence and commutativity) require them to somehow keep a trace of who was added to the sketch (at least for some users), which is fundamentally incompatible with even weak notions of privacy.

²The optimality of an estimator under these constraints is proven e.g. in [62].

LINKPREV LINKUP LINKNEXT