# Research post: Cardinality Estimators do not Preserve Privacy

**Next** summer at PETS 2019, I'll present a paper I wrote with Andreas
Lochbihler and my PhD advisor David Basin. This post will
attempt to explain what the paper is about, and what its results mean in
practice.

# tl;dr

You can't both remember unique individuals and not remember them.

It's not incredibly surprising, but still cool to have a formal negative result about it :D

# Cardinality estimators

The title of the paper is *Cardinality Estimators do not Preserve Privacy*.
First, what's a cardinality estimator? It's an algorithm, and an associated data
structure called a *sketch*. It does two things:

- it can
*count unique items*in a list (without duplicates); - and you can
*merge*several of them to count unique items in multiple lists.

Suppose you have two huge log files, each containing one billion unique identifiers. How many unique identifiers are present in the union of both files? If you only have counts, there's no way to tell. Both files might have the same identifiers, so the total count is one billion. Or maybe they only contain different identifiers, so the total count is two billion. Or it could be in-between these two extreme options.

A cardinality estimator can answer that question. You can apply it to each log
and get two sketches. Each can estimate the number of unique items in its input
log. Further, you can merge the two sketches to get a new sketch… And the
estimated count of this new sketch is the *deduplicated* version of the two
counts.

The simplest example is a set that remembers all the items it saw. It will
return an exact count, and merging two sketches is straightforward.
Unfortunately, it doesn't scale very well: you need a lot of memory to remember
every single element, and lots of time to merge large sketches. Instead, we
often trade precision for performance, and use *approximate* algorithms. They
don't return an exact answer, but are much faster and memory-efficient.

The first cardinality estimator I encountered is *HyperLogLog*. It's a very
popular choice: very efficient, and precise enough for most use cases. The idea
behind it is quite smart: if you want to learn about it, I recommend this blog
post and this research paper.

As part of my privacy reviewer role at Google, a team approached me with a question. They were storing HyperLogLog sketches, and asked me: "are those sketches as sensitive as the original data?".

# Privacy modeling

To answer that question, one first has to define what we mean by sensitive. We used the idea behind differential privacy, the attacker's information gain. Intuitively, if an attacker has some suspicion that a user was added to a sketch, and then looks at the sketch… How much can their suspicion increase, or decrease?

However, we added two twists to this idea.

- First, we only care if the level of suspicion
*increases*, not if it decreases. If the attacker learns that their target is in a given sketch, it's a problem. If they learn that their target is*not*in a given sketch, then we don't consider it that big of a deal. - Second, we don't assume that the attacker knows all users except their target.
Instead, we assume that the attacker knows
*nothing*about the sketch. From their perspective, the data is 100% random.

These choices only made sense for the original use case (the Google privacy
review). For the sketches I was considering, learning that somebody was *not* in
a sketch wouldn't have been very problematic. There was also no reason to assume
the attacker would have any background knowledge. These assumptions are much
weaker than in most data privacy papers and use cases.

If this research had led to a positive result, it wouldn't have been very convincing. "Here is a cardinality estimator that satisfies this super weak notion of privacy!" People would have pushed back, saying that the assumptions were too weak.

Luckily (for me), the result was otherwise. First, I found that HyperLogLog was not private according to this definition. That was the easy part, and it led to a natural follow-up question: can we make it private? Or more generally, can we build a private cardinality estimator? We want it to have the same nice properties as HyperLogLog, but a better privacy.

# Main result

It turns out that the answer is negative. Even with our weak privacy notion, the
problem is *unsolvable*. No cardinality estimator can be both private and
accurate. A private cardinality estimator has its accuracy gets exponentially
worse with its size.

Since the result is negative, the privacy definition's weakness makes our result
*stronger*. Accuracy is incompatible with a weak notion of privacy… So it's also
incompatible with stronger notions. We also considered even weaker variants,
e.g. allowing for a small probability of error. It didn't change the
negative result. There seems to be a fundamental incompatibility between privacy
and accuracy.

There is one caveat: this is only true if you want to be able to merge an arbitrary number of sketches. If the accuracy can get worse as you merge sketches, the result does not hold. In such a context, privacy-preserving schemes might exist. So, if your use case only requires you to merge a bounded number of sketches, you might have options. But if you want analysts to be able to do arbitrary aggregations of sketches and still get reasonably good results… then privacy is an impossible goal.

With this added caveat, our result becomes more intuitive. To merge two sketches that count unique users, you have to deduplicate users… So you have to keep the information about which users are in the sketch. As with HyperLogLog, this information doesn't have to be exact. But the more you remember, the more an attacker can use it to break the privacy property. HyperLogLog remembers some users more than others, and that's what allows it to stay accurate. If you can't remember any user well, then your cardinality estimator gets very inaccurate.

So there are two contributions: a theoretical one and a practical one.

The theoretical part is a confirmation and formalization of an expected phenomenon. It's still interesting, because it's quite rare. There aren't many negative results in the world of differential privacy. A typical privacy paper takes a problem and solves it in a differentially private way. Here, we're presenting a problem for which this is impossible. This leads to an open question, which we ask at the end of the paper: what's a minimal set of constraints that make differential privacy impossible?

There is also one practical consequence: cardinality estimators in use today are not private. Their sketches should be considered roughly as sensitive as raw data. We proved it manually for HyperLogLog… But our result is generic, so it holds for all cardinality estimators.

# Behind the scenes

The story we tell in the paper isn't exactly the path we actually followed to
get our results. For example, the attacker's lack of background knowledge came
from a practical constraint. In the original problem, cardinality estimators
were built using large-scale tools like MapReduce. Such tools assume that the
aggregation primitives are *deterministic*: for example, MapReduce double-checks
the computation results for fault tolerance. HyperLogLog is deterministic: a
sketch formed with a given input is always the same. All other cardinality
estimators we found were also deterministic. So at first, we required that any
solution to our question should also be deterministic.

But it's impossible to get differential privacy without adding noise to the data. We assumed that the attacker lacked knowledge about the data to get around this problem. If the data itself is random, it can play the same role as noise from the attacker's perspective.

The first negative result we got had the assumption that cardinality estimators were deterministic. I was unhappy about it, and wanted the paper to have a more generic result. For a good chunk of 2017, I tried to extend the result to arbitrary cardinality estimators. We gave up and tried to submit the paper in autumn, but got rejected for exactly this reason. This gave me a motivation boost: reviewers, too, thought this was important. The next submission had the generalized result. A good example of the peer review process working well! ^^

*August 2019 edit: this paper and the presentation I gave to PETS won a
Best Student Paper award!
After two rejections from other conferences, and the fact that it only got
accepted to PETS thanks to a
consistency experiment (one group of
reviewers accepted it, not the other), it was unexpected and nice. My talk was
recorded, you can watch it here.*

Also, the "a-ha!" moment to get the generic result didn't happen when I expected. It didn't click when I was spending hours working on it in the lab. Instead, it struck in the shower, after spending a week of vacation without an Internet connection. I strongly suspect this isn't a coincidence… Logging off must have helped my brain be more relaxed and creative or something.

# What comes next

I'll try to write a similar blog post for each research paper I publish^{1}. If
you're a researcher, I'd encourage you to do the same. The time it takes to
publish a write-up like this is negligible compared to doing the research and
writing the paper… And many more people will be able to read it!