Lowering the cost of anonymization

a PhD thesis

3.3  Limits: cardinality estimation

So far, we showed that considering an uncertain attacker could lead to positive results about the privacy of noiseless aggregations. In this section, we show that this setting can also be used to derive negative results: some problems simply cannot be solved in a privacy-preserving way, even under the assumption that the attacker has complete uncertainty over the original data.

The problem we now focus on is cardinality estimation. Cardinality estimators, like HyperLogLog, are algorithms that produce a type of output named sketches. Sketches can estimate the number of distinct records in a large multiset, and can also be merged, ignoring duplicates across sketches. Their widespread use in privacy-sensitive contexts leads to a natural question: can they be made anonymous according to some reasonable privacy notion?

In this section, we show that the answer to this question is negative. We formulate an abstract notion of cardinality estimators, that captures their aggregation requirement: one can merge sketches without losing precision. We show that for all cardinality estimators that satisfy our aggregation requirement, sketches are almost as sensitive as raw data. Indeed, even in a setting where the attacker is assumed to have zero background knowledge about the input data, they can still gain significant knowledge about their target by looking at the output of the algorithm.

Our results are a natural consequence of the aggregation properties: to aggregate sketches without counting the same user twice, they must contain information about which users were previously added. The attacker can use this information, even if they do not know any other users in the sketch. Furthermore, adding noise either violates the aggregation property, or has no influence on the success of the attack. Thus, it is pointless to try and design privacy-preserving cardinality estimators: privacy and accurate aggregation are fundamentally incompatible.

This section is organized as follows.

In Section 3.3.1, we motivate our work with an example of how cardinality estimators can be used in practice. We formally define an abstract notion of cardinality estimators, and give a weak definition of privacy that is well-suited to how they are used in practice.
In Section 3.3.2, we prove that cardinality estimators cannot satisfy this privacy property while maintaining good accuracy asymptotically.
In Section 3.3.3, we consider even weaker versions of our initial definition. We show that our results still hold for some variants, while others notions are compatible with accurate cardinality estimation. We then highlight the consequences of our results on practical applications.
In Section 3.3.4, we propose and discuss various risk mitigation techniques to use cardinality estimators in practice.

All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).