Lowering the cost of anonymization

a PhD thesis

2.2.1  Preliminaries

In this section, we introduce the notations used throughout this work, formally introduce basic properties of privacy definitions, define how definitions can relate to each other, introduce our dimensions along which DP can be modified, and present the methodology used in this literature review.

Throughout Section 2.2, we denote by an arbitrary set of possible records, and by the space of possible datasets; where a dataset is a finite indexed family of records. We typically use or to denote records, and , , or to denote datasets. The indices of a dataset are typically called and , with referring to the -th record of a dataset . We denote by the dataset whose -th record has been removed.

Let denote an arbitrary set of possible outputs; outputs are typically called , and sets of outputs called . A mechanism is a function, possibly randomized, which takes a dataset as input and returns an output. Mechanisms are typically called , while is usually a random variable.

Probability distributions on are called , probability distribution on are called , and family of probability distributions on are called . Given some event , let denote the random variable corresponding to the output of , when is drawn from a distribution conditioned on . These notations, and others used throughout this section, are summarized in the tables on pages 2932.

Dimensions

Variants and extensions of differential privacy modify the original definition in various ways. To establish a comprehensive taxonomy, a natural approach is to partition them into categories, depending on which aspect of the definition they change. Unfortunately, this approach fails for privacy definitions, many of which modify several aspects at once, so it is impossible to have a categorization such that every definition falls neatly into only one category.

The approach we take is to define dimensions along which the original definition can be modified. Each variant or extension of DP can be seen as a point in a multidimensional space, where each coordinate corresponds to one possible way of changing the definition along a particular dimension. To make this representation possible, our dimensions need to satisfy the following two properties.

  • Mutual compatibility: definitions that vary along different dimensions can be combined to form a new, meaningful definition.
  • Inner exclusivity: definitions in the same dimension cannot be combined to form a new, meaningful definition (but they can be pairwise comparable).

In addition, each dimension should be motivatable: there should be an intuitive explanation of what it means to modify DP along each dimension. Moreover, each possible choice within a dimension should be similarly understandable, to allow new practitioners to determine quickly which kind of definition they should use or study, depending on their use case.

We introduce our dimensions by reformulating the guarantee offered by DP, highlighting aspects that have been modified by its variants or extensions. Each dimension is attributed a letter, and we note the dimension letter corresponding to each highlight. This formulation considers the point of view of an attacker, trying to find out some sensitive information about some input data using the output of a mechanism.

An attacker with perfect background knowledge (B) and unbounded computation power (C) is unable (R) to distinguish (F) anything about an individual (N), uniformly across users (V) even in the worst-case scenario (Q).

This informal definition of DP with the seven highlighted aspects give us seven distinct dimensions. We denote each one by a letter and summarize them in Table 2.18. Each is introduced in its corresponding section.

Dimension Description Typical motivations
Quantification of How is the privacy loss Averaging risk, obtaining
privacy loss quantified across outputs? better composition properties
Neighborhood Which properties are protected Protecting specific values
definition from the attacker? or multiple individuals
Variation of Can the privacy loss vary Modeling users with different
privacy loss across inputs? privacy requirements
Background How much prior knowledge Using mechanisms that add
knowledge does the attacker have? less or no noise to data
Formalism Which formalism describes Exploring other intuitive
change the attacker’s knowledge gain? notions of privacy
Relativization of What is the knowledge gain Guaranteeing privacy for
knowledge gain relative to? correlated data
Computational How much computational Combining cryptography
power power can the attacker use? techniques with DP
Table 2.18: The seven dimensions and their typical motivation.

Note that the interpretation of DP is subject to some debate. In [368], authors summarize this debate, and show that DP can be interpreted under two possible lenses: it can be seen as an associative property, or as a causal property. The difference between the two interpretations is particularly clear when one supposes that the input dataset is modeled as being generated by a probability distribution.

  • In the associative view, this probability distribution is conditioned upon the value of one record. If the distribution has correlations, this change can affect other records as well.
  • In the causal view, the dataset is first generated, and the value of one record is then changed before computing the result of the mechanism.

While the causal view does not require any additional assumption to capture the intuition behind DP, the associative view requires that either all records are independent in the original probability distribution (the independence assumption), or the adversary must know all data points except one (the strong adversary assumption, which we picked in the reformulation above). These considerations can have a significant impact on DP variants and extensions, either leading to distinct variants that attempt to capture the same intuition, or to the same variant being interpreted in different ways; in this work, we point it out when the distinction between the two interpretations is particularly relevant to a specific variant.

Properties of privacy definitions

In this section, we introduce three abstract, desirable properties of data privacy definitions. As shown in Section 2.1.6.0, these are satisfied by differential privacy itself. We will see in this work that these often, but not always, satisfied by its variants and extensions.

Two important properties of data privacy notions are called privacy axioms, proposed in [225, 226]. These are not axioms in a sense that they assumed to be true; rather, they are consistency checks: properties that, if not satisfied by a data privacy definition, indicate a flaw in the definition3.

Definition 9 (Privacy axioms [225, 226]).

1.
Post-processing4 (or transformation invariance): A privacy definition satisfies the post-processing axiom if, for any mechanism satisfying and any probabilistic function , the mechanism also satisfies .
2.
Convexity (or privacy axiom of choice): A privacy definition satisfies the convexity axiom if, for any two mechanisms and satisfying , the mechanism defined by with probability and with probability also satisfies .

A third important property is, as we mentioned in Section 2.1.6.0 one of differential privacy’s main strengths: composability. We saw that there were three types of composition; in this survey, we focus on sequential composition, which we simply call composition.

Definition 10 (Composability). A privacy definition with parameter is composable if for any two mechanisms and satisfying respectively - and -, the mechanism satisfies - for some (non-trivial) .

We highlight the properties satisfied by each variant and extension of DP in Table 2.20 in Section 2.2.9.

Relations between definitions

When learning about a new data privacy notion, it is often useful to know what are the known relations between this notion and other definitions. However, definitions have parameters that often have different meanings, and whose value is not directly comparable. To capture extensions, when a definition can be seen as a special case of another, we introduce the following definition.

Definition 11 (Extensions). Let and be data privacy definitions. We say that is extended by , and denote is as , if for all , there is a value of such that - is identical to -.

Concerning variants, to claim that a definition is stronger than another, we adopt the concept of ordering established in [88] using and as tuples, encoding multiple parameters. We slightly changed the original definition: the original only required the second condition to hold, which would classify any extension as a stronger variant.

Definition 12 (Relative strength of privacy definitions). Let and be data privacy definitions. We say that is stronger than , and denote it , if:

1.
for all , there is a such that ;
2.
for all , there is an such that .

If is both stronger than and weaker than , we say that the two definitions are equivalent, and denote it .

Relative strength implies a partial ordering on the space of possible definitions. On the other hand, if two definitions are equivalent, this does not mean that they are equal: they could be only equal up to a change in parameters. Both relations are reflexive and transitive; and we define the symmetric counterpart of these relations as well (i.e., and ). Moreover, for brevity, we combine these two concepts in a single notation: if and , we say that is a weaker extension of , and denote it .

A summarizing table is presented in Section 2.2.9 (Table 2.20). For each definition, we highlight its dimensions and its relation to other notions, and we also specify whether these notions satisfy the privacy axioms and the composability property (: yes, : no, ?: currently unknown). In Section 2.2.9.0, we either provide a reference or a novel proof for each of these claims.

Methodology

Whether a data privacy definition fits our description is not always obvious, so we use the following criterion: the attacker’s capabilities must be clearly defined, and the definition must prevent this attacker from learning about a protected property. Consequently, we do not consider:

  • syntactic definitions, which are a property of the output data and not of the mechanism;
  • variants of technical notions that are not data privacy properties;
  • definitions whose only difference with DP is in the context and not in the formal property, like the distinction between local and central models of differential privacy.

In Section 2.2.10.0, we give a list of notions that we found during our survey, and considered to be out of scope for our work.

To find a comprehensive list of DP notions, besides the definitions we were aware of or were suggested to us by experts, we conducted a wide literature review using two research datasets: BASE [34] and Google Scholar [172]. The exact queries were run on 2020–06–01. The corresponding result counts are summarized in Table 2.19.

Query (BASE) Hits
“differential privacy” relax year:[2000 to *] 130
“differential privacy” variant -relax year:[2000 to *] 115
a b
Query (Google Scholar) Hits
“differential privacy” “new notion” 224
“differential privacy” “new definition” -“new notion” 180
Table 2.19: Queries for the literature review.

First, we manually reviewed each paper and filtered them out until we had only papers which either contained a new definition or were applying DP in a new setting. All papers which defined a variant or extension of DP are cited in this work.

3The necessity of these were questioned in [202], where authors showed a natural notions of anonymity that contradict them.

4This definition must be slightly adapted for some variants, see for example Proposition 17 in Section 2.2.9.

All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).