A corollary of Theorem 7 and of our analysis of Section 3.3.3 is that the cardinality estimators used in practice do not preserve privacy. How can we best protect cardinality estimator sketches against insider threats,in realistic settings? Of course, classical data protection techniques are relevant: encryption, access controls, auditing of manual accesses, etc. But in addition to these best practices, cardinality estimators like HyperLogLog allow for specific risk mitigation techniques, which restrict the attacker’s capabilities.
As explained in Section 3.3.1, most cardinality estimators use a hash function as the first step of the operation: only depends on and the hash value . This hash can be salted with a secret value. This salt can be made inaccessible to humans, with access controls restricting access to production binaries compiled from trusted code. Thus, an adversary cannot learn all the relevant parameters of the cardinality estimator and can no longer add users to sketches. Of course, to avoid a salt reconstruction attack, a cryptographic hash function must be used.
The use of a salt does not hinder the usefulness of sketches: they can still be merged (for all cardinality estimators given as examples in Section 3.3.1) and the cardinality can still be estimated without accuracy loss. However, if an attacker gains direct access to a sketch with the aim of targeting a user and does not know the secret salt, then they cannot compute and therefore cannot compute . This prevents the previous obvious attack of adding to and observing whether the result is different.
However, this solution has two issues.
- Secret salt rotation: the secret salt must be the same for all sketches as otherwise sketches cannot be merged. Indeed, if a hash function is used to create a sketch and is used to create , then if for some that is added to both and , will be seen as a different user in and : the cardinality estimator no longer ignores duplicates. Good key management practices also recommend regularly rotating secret keys. In this context, changing the key requires recomputing all previously computed sketches. This requires keeping the original raw data, makes pipelines more complex, and can be computationally costly.
- Sketch intersection: for most cardinality estimators given as examples in Section 3.3.1, it
is possible for an attacker to guess
from a family of sketches (,
for which the attacker knows that .
For example, intersecting the lists stored in K-Minimum Values sketches can provide
information on which hashes come from users that have been in all sketches. For HyperLogLog,
one can use the leading zeroes in non-empty buckets to get partial information on the hash
value of users who are in all sketches. Moreover, HyperLogLog++  has a sparse
mode that stores full hashes when the sketch contains a small number of values; this makes
intersection attacks even easier.
Intersection attacks are realistic, although they are significantly more complex than simply checking if . In our running example, sketches come from counting users across locations and time periods. If an internal attacker wants to target someone they know, they can gather information about where they went using side channels like social media posts. This gives them a series of sketches that they know their target belongs to, and from these, they can get information on and use it to perform an attack equivalent to checking whether .
Another possible risk mitigation technique is homomorphic encryption. Each sketch could be encrypted in a way that allows sketches to be merged, and new records to be added; while ensuring that an attacker cannot do any operation without some secret key. Homomorphic encryption typically has significant overhead, so it is likely to be too costly for most use cases. Our impossibility results assume a computationally unbounded attacker; however, it is possible that an accurate sketching mechanism using homomorphic encryption could provide privacy against polynomial-time attackers. We leave this area of research for future work.
Using cardinality estimator sketches to perform data analysis tasks only requires access to two operations: and . So a simple option is to process the sketches over an API that only allows this type of operation. One option is to provide a SQL engine on a database, and only allow SQL functions that correspond to and over the column containing sketches. In the BigQuery SQL engine, this corresponds to allowing HLL_COUNT.MERGE and HLL_COUNT.EXTRACT functions, but not other functions over the column containing sketches . Thus, the attacker cannot access the raw sketches.
Under this technique, an attacker who only has access to the API can no longer directly check whether . Since they do not have access to the sketch internals, they cannot perform the intersection attack described previously either. To perform the check, their easiest option is to impersonate their target within the service, interact with the service so that a sketch containing only their target is created in the sketch database, and compare the estimates obtained from and . Following the intuition given in Section 3.3.2, if these estimates are the same, then the target is more likely to be in the dataset. How much information the attacker gets this way depends on . We can increase this quantity by rounding the result of the operation, thus limiting the accuracy of the external attacker. This would make the attack described in this work slightly more difficult to execute, and less efficient. However, it is likely that the attack could be adapted, for example by repeating it multiple times with additional fake records.
This risk mitigation technique can be combined with the previous one. The restricted API protects the sketches during normal use by data analysts, i.e., against external attackers. The hash salting mitigates the risk of manual access to the sketches, e.g., by internal attackers. This type of direct access is not needed for most data analysis tasks, so it can be monitored via other means.