Lowering the cost of anonymization

a PhD thesis

3.2  Opportunities: aggregations & thresholding

Consider a national referendum where more than 10 million people vote on a Yes/No question. Do the exact results of this referendum reveal information about individuals? It is very reasonable to assume that no realistic attacker has background knowledge of more than 99% of all votes. The remaining uncertainty of 1% of the data (100k data points) leads to a significant uncertainty that can, if properly quantified, show that no attacker can use the results of the referendum to determine how a given individual voted: the referendum results are private, even if no noise was added to them.

In the previous section, we introduced conceptual tools to rigorously analyse such scenarios. In this section, we build on these foundations to analyze such noiseless mechanisms and prove strong and intuitive results about their privacy. We also analyze the conditions necessary to use these notions in practical contexts, without making risky assumptions on their uncertainty or capabilities. Our main contributions are as follows.

First, we provide a formal account of the privacy of common kinds of queries under partial knowledge, using the tools introduced in the previous section. We show that counting queries under partial knowledge can provide privacy, with significantly lower bounds than those previously given.

Second, we show that thresholding—only returning the output if it is larger than a given threshold—can render counting queries private against passive attackers, even when there is not enough entropy in the data for the previous result to apply.

Third, we combine these results to show that under certain conditions, -anonymity protects against a passive attacker with partial knowledge. We discuss the genericity and the limits of this result, which clearly delineates the conditions in which -anonymity provides meaningful protection.

Finally, we look at composition, a property that is crucial to the practical applicability of privacy notions. We prove bounds for the sequential composition of noiseless mechanisms, which allows us to quantify the privacy leakage of multiple mechanisms with the same input, or of a mechanism repeated over time. We also look at nested composition: combining the uncertainty coming from the attacker’s uncertainty with noise added to the result of the aggregation. We show basic properties of such mechanisms, and explain how numeric evaluation can be used to estimate the privacy loss in this context.

All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).