Lowering the cost of anonymization

a PhD thesis

3.4  Conclusion

It is natural to want to model attackers with only partial background knowledge, and to take this uncertainty into account when quantifying the privacy properties of a given mechanism. Certainly, in an ideal world where privacy would be the only concern, this would not be necessary: it is safer to always assume the most powerful adversary. However, we do not live in such a world. In practice, stakeholders frequently complain about the utility or complexity cost of differential privacy, and question whether its use is really necessary to reach a reasonable level of protection. Surely, aggregating data over many people is enough to prevent realistic adversaries from getting information about single individuals?

Ignoring those complaints, or dismissing them as entirely misguided, is an inadequate answer. Real-world adversaries do not have full background knowledge, and as such, it is worthwhile to improve our formal understanding of the guarantees provided by this uncertainty. This chapter provided three complementary insights about this setting.

The first insight is that modeling an attacker with partial background knowledge is more complicated and hazardous than it initially appears. As we showed in Section 3.1.2, correctly modelling such adversaries requires to carefully consider the influence of data correlations, and cleanly delineate the attacker’s partial knowledge from the information that we are trying to protect. Furthermore, as we showed in Section 3.1.3, an attacker who only passively knows part of the data is a priori weaker than an attacker who can influence the data, so practical applications must also consider this dichotomy—a subtlety that does not exist for classical differential privacy. On the positive side, our work provides a solid theoretical basis for reasoning about partial knowledge in a formal, correct way; we hope that this can be of use to researchers and practitioners trying to quantify the privacy of given mechanisms under this assumptions.

The second insight is that some of these intuitions about the inherent safety of certain mechanisms under partial knowledge are correct. Results from Section 3.2.1 show that aggregating data across many users does preserve privacy under reasonable assumptions. Our results focus on counting, but they could easily be extended to other aggregations. Indeed, the core insight from the proof of Theorem 3 comes from drawing a link between our setting and amplification by shuffling. In essence, uncertainty about individual records can be expressed with -indistinguishability, and symmetric aggregation is at least as good as random shuffling to boost privacy guarantees. This suggests that additional results and links could be drawn, and that both settings could be combined to obtain finer privacy analyses with reasonable attacker assumptions.

We also showed in Section 3.2.2 that thresholding does preserve privacy against passive attackers by providing a natural and intuitive example of the distinction between passive and active attackers with partial knowledge. Formalizing this intuition also allows us to understand under which conditions simple -anonymity mechanisms preserve privacy against an attacker with partial knowledge. This last result depends on a number of assumptions, most of which can be linked to a corresponding weakness of -anonymity: if one of these assumptions does not hold, then one can find an example of when -anonymity fails to protect individual privacy. This shows one of the weaknesses of differential privacy under partial knowledge: it is difficult to obtain robust results that do not depend on brittle assumptions about the exact nature of the attacker’s uncertainty.

Finally, in Section 3.2.4, we looked at composition properties for differential privacy with partial knowledge. By contrast to differential privacy in its original form, sequential composition does not hold in general. However, we found that quantifying the amount of correlation between the results of two queries can be used to compose privacy guarantees, and that yet again, -indistinguishability is a natural formalism to do so. We also raised the interesting question of combining guarantees from partial knowledge and noise addition; a question that we largely leave open.

The third insight, from Section 3.3, is that modeling an attacker with partial knowledge can be used to obtain strong negative results: some practical problems are inherently impossible to solve in a privacy-preserving way, even if we assume an attacker with complete uncertainty over the input data. Our main result on cardinality estimation is an important first step in this direction, and it suggests a wide range of open follow-up questions. Can we extend it to cardinality estimators that only preserve the precision of the operator when merging them, but not necessarily the full distribution of sketches? Can we find a finer trade-off between privacy and accuracy, allowing e.g. a small error to be incurred by each merging operation in exchange for better privacy guarantees? Are there other similar problems that cannot be solved in a privacy-preserving way without giving up strong aggregation properties, e.g. sketching algorithms used for quantiles estimation?

All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).