Let us return to our privacy problem: someone with access to a sketch wants to know whether a given individual belongs to the aggregated individuals in the sketch. Formally, given a target and a sketch , the attacker must guess whether with high probability. In Section 3.3.2, we explain how the attacker can use a simple test to gain significant information if the cardinality estimator is deterministic. Then, in Section 3.3.2, we reformulate the main technical lemma in probabilistic terms, and prove an equivalent theorem for probabilistic cardinality estimators.
Given a target and a sketch , the attacker can perform the following simple attack to guess whether . They can try to add the target to the sketch , and observe whether the sketch changes. In other words, they check whether . If the sketch changes, this means with certainty that . Thus, Bayes’ law indicates that if , then the probability of cannot decrease.
How large is this increase? Intuitively, it depends on how likely it is that adding an record to a sketch does not change it if the record has not previously been added to the sketch. Formally, it depends on .
- If is close to , for example if the sketch is just a list of all records seen so far, then observing that will lead the attacker to believe with high probability that .
- If is close to , it means that adding an record to a sketch often does not change it. The previous attack does not reveal much information. But then, it also means that many records are ignored when they are added to the sketch, that is, the sketch does not change when adding the record. Intuitively, the accuracy of an estimator based solely on a sketch that ignores many records cannot be very good.
We formalize this intuition in the following theorem.
Proof. The proof is comprised of three steps, following the intuition previously given.
- We show that a sketch , computed from a random set with an -sketch private estimator above cardinality , will ignore many records after (Lemma 4).
- We prove that if a cardinality estimator ignores a certain ratio of records after adding records, then it will ignore an even larger ratio of records as increases (Lemma 5).
- We conclude by proving that an unbiased cardinality estimator that ignores many records must have a large variance (Lemma 6).
The theorem follows directly from these lemmas. □
Note that if we were using differential privacy, this result would be trivial: no deterministic algorithm can ever be differentially private. However, this is not so obvious for our definition of privacy: prior work [36, 46, 176] as well as the results in Section 3.2 show that when the attacker is assumed to have some uncertainty about the data, even deterministic algorithms can satisfy the corresponding definition of privacy.
Figure 3.8 shows plots of the lower bound on the standard error of a cardinality estimator with -sketch privacy at two cardinalities (100 and 500). It shows that the standard error increases exponentially with the number of records added to the sketch. This demonstrates that even if we require the privacy property for a large value of (500) and a relatively large , the standard error of a cardinality estimator will become unreasonably large after 20,000 records.
Proof. We first prove that such an estimator also satisfies
We decompose the left-hand side of the inequality over all possible values of which that . If we abbreviate this set , we have:
where the first inequality is obtained directly from the definition of -sketch privacy.
Now, Lemma 2 gives , and finally:
for any . Then for any integer , it also satisfies , for .
Proof. First, note that if , and , then . This is a direct consequence of Lemma 3: , so:
We show next that when , generating a set uniformly randomly can be seen as generating independent sets in , then merging them. Indeed, generating such a set can be done by as follows:
- For each , generate a set uniformly randomly, then take the union of all these sets: .
- If some records appear in multiple , the total cardinality might be lower than , so we need add records uniformly at random to reach the desired cardinality: calculate , then generate a set uniformly randomly.
- Add the missing items to complete the set: .
Step 1 ensures that we used independent sets of cardinality to generate , and step 2 and 3 ensure that has exactly records.
Intuitively, each time we generate a set of cardinality uniformly at random in , we have one chance that will be ignored by (and thus by ). So can be ignored by with a certain probability because it was ignored by . Similarly, it can also be ignored because of , etc. Since the choice of is independent of the choice of records in , we can rewrite:
using the hypothesis of the lemma. Thus:
for any and all . Then its variance for is at least .
Proof. The proof’s intuition is as follows. The hypothesis of the lemma requires that the cardinality estimator, on average, ignores a proportion of new records added to a sketch (once records have been added): the sketch is not changed when a new record is added. The best thing that the cardinality estimator can do, then, is to store all records that it does not ignore, count the number of unique records among these, and multiply this number by to correct for the records ignored. It is well-known that estimating the size of a set based on the size of a uniform sample of sampling ratio has a variance of . Hence, our cardinality estimator has a variance of at least .
Formalizing this idea requires some additional technical steps, and the proof is decomposed into three steps. In Step 1, we fix the first records added to the sketch, and we bound the variance of the estimator that has these records as initial input. In Step 2, we explicitly compute the probability for an record to be ignored by , first when is fixed, then when is fixed. We then use these results in Step 3 to average the bound over all possible choices for the records in , which gives us an overall bound.
Step 1: Bound the estimator variance.
Let . Let be the set of records that are not ignored by . Let , or equivalently, let , where is the distribution that picks uniformly randomly in .
can be seen as the sampling set of the estimator with as initial input, while can be seen as its sampling fraction: all other records are discarded, so the estimator only has access to records in and to compute its estimate.
The optimal estimator to minimize variance in this context simply counts the exact number of distinct records in the sample (remembering each one), and divides this number by to estimate the total number of distinct records.2 What is the variance of this optimal estimator? Suppose we added records after reaching the sampling part. The number of records in the sample is a random variable with variance . Dividing this random variable by gives a variance of . Thus:
Since the first records are chosen uniformly at random, the variance of the overall estimator is bounded by their average. If we denote by the set of all possible subsets of cardinality , we have:
where stands for the average.
Step 2: Intermediary results.
Fix . Denoting by the function whose value is 1 if is satisfied and 0 otherwise,
Indeed, is straightforward, and since implies , the condition can be simplified to .
Now, fix . Then
Indeed, we have
Step 3: Conclude using convexity.
All existing cardinality estimators satisfy our axioms and their standard error remains low even for large values of . Theorem 7 shows, for all of them, that there are some users whose privacy loss is significant. In Section 3.3.3, we quantify this precisely for HyperLogLog.
Algorithms that add noise to their output, or more generally, are allowed to use a source of randomness, are often used in privacy contexts. As such, even though all cardinality estimators used in practical applications are deterministic, it is reasonable to hope that a probabilistic cardinality estimator could satisfy our very weak privacy definition. Unfortunately, this is not the case.
In the deterministic case, we showed that for any record , the probability that has an influence on a random sketch decreases exponentially with the sketch size. Or, equivalently, the distribution of sketches of size that do not contain is “almost the same” (up to a density of probability ) as the distribution of sketches of the same size, but containing .
The following lemma establishes the same result in the probabilistic setting. Instead of reasoning about the probability that an record is “ignored” by a sketch , we reason about the probability that has a meaningful influence on this sketch. We show that this probability decreases exponentially, even if is very high.
First, we prove a technical lemma on the structure that the operation imposes on the space of sketch distributions. Then, we find an upper bound on the “meaningful influence” of an record , when added to a random sketch of cardinality . We then use this upper bound, characterized using the statistical distance, to show that the estimator variance is as imprecise as for the deterministic case.
Definition 69. Let be the real vector space spanned by the family (seen as vectors of ). For any probability distributions , we denote . We show in Lemma 7 that this notation makes sense: on , we can do computations as if was a multiplicative operation.
Proof. By the properties required from probabilistic cardinality estimators in Definition 67, the operation is commutative and associative on the family . By linearity of the operation, these properties are preserved for any linear combination of vectors . □
Lemma 8. Suppose a cardinality estimator satisfies -sketch privacy above cardinality , and let . Let be the distribution of sketches obtained by adding uniformly random records of into (or, equivalently, ). Then:
where is the statistical distance between probability distributions.
Proof. Let be the distribution of sketches obtained by adding , then uniformly random records of into (or, equivalently, ). Then the definition of -sketch privacy gives that for every sketch , . So we can express as the sum of two distributions:
for a certain distribution .
First, we show that for a certain distribution . Indeed, to generate a sketch of cardinality that does not contain uniformly randomly, one can use the following process.
- Generate random sketches of cardinality which do not contain , and merge them.
- For all , denote by the probability that the sketches were generated with the records in . There might be “collisions” between the sketches: if several sketches were generated using the same record, . When this happens, we need to “correct” the distribution, and add additional records. Enumerating all the options, we denote , where is obtained by adding uniformly random records in to . Thus, .
Note that these distributions are in : we can write , and similarly for , as well as , etc. Thus:
Denoting and , this gives us:
Finally, we can compute :
Note that since , we have by idempotence, and:□
Then its variance for is at least .
Proof. The condition “” is equivalent to the condition of Lemma 6: with probability , the cardinality estimator “ignores” when a new record is added to a sketch. Just like in Lemma 6’s proof, we can convert this constraint into estimating the size of a set based on a sampling set. The best known estimator for this problem is deterministic, so allowing the cardinality estimator to be probabilistic does not help improving the optimal variance.
The same result than in Lemma 6 follows. □
Somewhat surprisingly, allowing the algorithm to add noise to the data seems to be pointless from a privacy perspective. Indeed, given the same privacy guarantee, the lower bound on accuracy is the same for deterministic and probabilistic cardinality estimators. This suggests that the constraints of these algorithms (idempotence and commutativity) require them to somehow keep a trace of who was added to the sketch (at least for some users), which is fundamentally incompatible with even weak notions of privacy.