Lowering the cost of anonymization

a PhD thesis

3  Differential privacy under partial knowledge

However, just because something isn’t surprising doesn’t mean it’s easy to deal with. (Nnedi Okorafor, Binti)

The success of differential privacy, and the many variants and extensions proposed since its introduction, suggests a solution to our initial problem. To pick a definition of privacy that makes sense in a certain context, we can start from differential privacy, and optionally modify it to better fit our use case.

If we want to modify the protected characteristic, we can change the definition of neighborhood (Section 2.2.3 ). If is too expensive to consider the absolute worst-case output, we can relax this by averaging the privacy loss, or accepting a very small risk (Section 2.2.2 ). If different people in our dataset have distinct privacy requirements, we can also model this in the definition (Section 2.2.4 ). In all the aforementioned cases, we can pick a variant that satisfies useful properties like post-processing or composition, and still keep the strong formal guarantees that differential privacy offers.

From a purely academic standpoint, it seems like this definitional problem is now solved. Unfortunately, this are a bit more complicated in practice. In real-world scenarios, people trying to advocate for formal privacy guarantees often encounter pushback. “Why is it necessary to opt for such a strong notion?”, a skeptic might say. “This data is heavily aggregated, surely the risk is low enough. It seems fine to me. Can you find a realistic way of attacking it? Maybe then I will be convinced that stronger guarantees are necessary.”

To advocate for using differential privacy, we can try explaining what happens if the definition is not satisfied. DP states that changing the data of a single user cannot be detected by looking at the mechanism’s output. So by definition, a mechanism that is not differentially private allows the above situation to happen. There exist two datasets and that differ in the data of a single user, such that the difference between and is arbitrarily large, at least for some outputs. This seems bad!

However, the “attack” above is not convincing to laypeople: it feels too artificial. Surely, they say, a realistic attacker cannot craft an arbitrary pair of datasets and and observe the results. They do not control the data, nor do they have perfect knowledge about it. If the data is heavily aggregated, for example using -anonymity with a reasonably large , can an attacker really find out sensitive information about individuals?

In practice, aggregated data is frequently being published without noise, under the assumption that this is safe enough. The most prominent example is probably voting: it would be unacceptable to add noise to the tally of a democratic election, yet nobody seems to worry that publishing the exact count might somehow constitute a privacy risk. The intuition behind this hinges on the implicit assumption that the secrecy of votes makes the result uncertain, from the point of view of any realistic attacker.

Can this intuition be formalized? We saw in Section 2.2.5 that this idea is not new, and that multiple variants of differential privacy were introduced to capture an attacker with uncertainty over the input data. In this chapter, we focus on this specific setting. We show that modeling this scenario correctly is much more difficult than it initially appears, but that a careful analysis can formalize the above intuition, draw novel links between syntactic definitions like -anonymity and differential privacy, and be used to derive negative results on problems that are impossible to solve in a privacy-preserving way.

On the theory front, we start by reviewing existing definitions formalizing the scenario where the attacker only has partial knowledge over the data. We point out fundamental flaws in existing notions in the case where there are correlations in the data, and introduce an additional criterion that is needed to fix these definitional problems. We then delineate two cases that behave differently in the presence of partial knowledge: whether the attacker can influence the input dataset, or whether they have a passive partial knowledge about the original data.

We then use this novel theory to formalize positive results in our setting of interest. We show that aggregating the data of many users does provide privacy guarantees under the partial knowledge assumption, by drawing links between this setting and amplification by shuffling. We also show that thresholding is an effective privacy-preserving strategy against passive attackers, and use this insight to derive new links between -anonymity and differential privacy.

Finally, we show that even under this assumption of partial knowledge, some use cases cannot be solved in a privacy-preserving way. In particular, we study cardinality estimators, data structures that can estimate the cardinality of large datasets, and that can be merged while removing duplicates: this class of algorithms cannot be made private, even under extremely weak assumptions.

All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).