Lowering the cost of anonymization

a PhD thesis

3.1  Definitional intricacies of partial knowledge

As we saw previously, differential privacy models strong attackers that can not only learn, but also influence, all but one element from the input dataset. While this strong attacker model over-approximates realistic attackers, it can also lead to overly cautious choices of noise parameters, unnecessarily deteriorating the algorithms’ accuracy. Relaxing the assumptions about attackers’ background knowledge and their influence on the dataset can lead to smaller noise parameters and, in turn, to more accurate results.

In some scenarios [373], one cannot exclude that the attacker might be able to influence the entire dataset. But there are many natural scenarios in which the risk of an attacker injecting a large number of data points into the dataset is negligible: censuses, phone polls, or elections are natural examples1. In those cases, it is appropriate to consider privacy guarantees that model weaker attackers without influence over the dataset, but with some background knowledge. While the precise estimation of an attacker’s capabilities may be difficult, a more balanced privacy analysis should characterize the privacy leakage against attackers with varying degrees of background knowledge and influence over the dataset.

As we show in this section, while there is a rich body of prior work on differential privacy with partial knowledge [36, 46, 116, 228, 254, 326], it fails to account for data with correlations, and it does not make the attacker model explicit and precise. In this section, we provide a theoretical foundation for answering these questions. Our formalism solves the problems that previous definitions encounter when the data contains correlations, and it clearly delineates between attackers that only have some background knowledge and attackers that can influence the data. Our main contributions are as follows.

First, we show that existing notions of privacy under partial knowledge break down when the data contains correlations, allowing very revealing mechanisms to be mistakenly considered private. We propose a practical criterion and approach to fix this class of issues.

Second, we show that there are two distinct ways to model partial knowledge, depending on whether the attacker can only learn some properties of the data or can modify the data. We define two corresponding notions of privacy under partial knowledge: active partial knowledge, where the attacker can influence the dataset, and passive, where the attacker is unable to influence the dataset. We show that these notions have natural properties, and prove that they are equivalent for a large class of common mechanisms and assumptions over the data. Moreover, we show that the active partial knowledge assumption can be used to alleviate the challenge of precisely estimating the dataset distribution.

1This implicitly assumes the absence of malicious insiders with direct access to the collected data, but those could break privacy anyway by leaking the entire dataset, so anonymization becomes irrelevant.

All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).