Lowering the cost of anonymization

a PhD thesis

In this section, we detail our criteria for excluding particular data privacy definitions from our work, we list some relevant definitions that were excluded by the criteria presented in Section 2.2.1.0 , and we list related works and existing surveys in the field of data privacy.

Out of scope definitions

As detailed in Section 2.2.1.0 , we considered certain data privacy definitions to be out of scope for our work, even when they seem to be related to differential privacy. This section elaborates on such definitions.

Lack of semantic guarantees

Some definitions do not provide clear semantic privacy guarantees, or are only used as a tool in order to prove links between existing definitions. As such, we did not include them in our survey.

  • -privacy, introduced in [ 264], was a first attempt at formalizing an adversary with restricted background knowledge. Its formulation does not provide a semantic guarantee, and it was superseded by noiseless privacy [ 46 , 116] (introduced in Section 2.2.5 ).
  • Relaxed indistinguishability, introduced in [ 326] is a relaxation of adversarial privacy that provides a plausible deniability by requiring for each tuple, that at least tuples must exist with -indistinguishability. It does not provide any guarantee against Bayesian adversaries.
  • Differential identifiability, introduced in [ 244], bounds the probability that a given individual’s information is included in the input datasets but does not measure the change in probabilities between the two alternatives. As such, it does not provide any guarantee against Bayesian adversaries24.
  • Crowd-blending privacy, introduced in [ 160], combines differential privacy with -anonymity. As it is strictly weaker than any mechanism which always returns a -anonymous dataset, the guarantees it provides against a Bayesian adversary are unclear. It is mainly used to show that combining crowd-blending privacy with pre-sampling implies zero-knowledge privacy [160 , 263].
  • Membership privacy25, introduced in [ 336], is tailored to membership inference attacks on machine learning models; the guarantees it provides are not clear.
  • -anonymity, introduced in [ 197], first performs -anonymisation on a subset of the quasi identifiers and then -DP on the remaining quasi-identifiers with different settings for each equivalence class of the -anonymous dataset. The semantic guarantees of this definition are not made explicit.
  • Posteriori DP, introduced in [ 378], compares two posteriors in a way similar to inferential privacy, but does not make the prior (and thus, the attacker model) explicit.
  • Noiseless privacy26, introduced in [ 147], limits the change in the number of possible outputs when one record in the dataset changes. As it does not bound the change in probabilities of the mechanism, it does not seem to offer clear guarantees against a Bayesian adversary.
  • Weak DP, introduced in [ 382], adapts DP for streams, but it only provides a DP guarantee for the average of all possible mechanism outputs27, rather than for the mechanism itself. Thus, its semantics guarantees are also unclear.
  • Error Preserving Privacy, introduced in [ 90], states that the variance of the adversary’s error when trying to guess a given user’s record does not change significantly after accessing the output of the mechanism. The exact adversary model is not specified.

Variants of sensitivity

A important technical tool used when designing differentially private mechanisms is the sensitivity of the function that we try to compute. There are many variants to the initial concept of global sensitivity [ 128], including local sensitivity [ 303], smooth sensitivity [303], restricted sensitivity [ 52], empirical sensitivity [ 74], empirical differential privacy28 29 [ 5], recommendation-aware sensitivity [ 410], record and correlated sensitivity [ 411], dependence sensitivity [ 258], per-instance sensitivity [ 379], individual sensitivity [ 89], elastic sensitivity [ 209] and derivative sensitivity [ 238]. These notions only change how to achieve a given privacy definition (typically DP), and are not relevant to the definition itself, so we did not consider these notions in our work.

Local model and other contexts

In this work we focused on DP variants/extensions typically used in the global model, in which a central entity has access to the whole dataset. It is also possible to use DP in other contexts, without formally changing the definition. The main alternative is the local model, where each individual randomizes their own data before sending it to an aggregator. This model, formally introduced in [ 117], is used e.g., by Google [ 141], Apple [ 360], or Microsoft [ 107]. These models can be thought of as different ways of deploying a given privacy definition, rather than distinct definitions.

Many definitions we listed were initially presented in the local model, such as -privacy [ 67], geo-indistinguishability [ 15], earth mover’s Pr [ 151], location Pr [ 138], profile-based DP [ 162], divergence DP and smooth DP from [ 31], and extended DP, distribution Pr, and extended distribution Pr from [ 219].

Below, we list the definitions that are the same as previously listed definitions, but used in a different attacker setting; the list also includes alternatives to the local and global models.

  • In [ 344], the authors introduce distributed DP, which corresponds to local DP, with the additional assumption that only a portion of participants are honest.
  • In [ 220], the authors define joint DP, to model a game in which each player cannot learn the data from any other player, but are still allowed to observe the influence of their data on the mechanism output. In [ 390], authors define a slightly different version of this idea, multiparty DP, in which the view of each subgroup of players is differentially private in respect to other players inputs.
  • In [ 50], the authors define DP in the shuffled model, which falls in-between the global and the local model: the local model is augmented by an anonymous channel that randomly permutes a set of user-supplied messages, and differential privacy is only required of the output of the shuffler.
  • In [ 207], the authors define localized information privacy, a local version of information privacy (mentioned in Section 2.2.6 ).
  • In [ 289], the authors define utility-optimized local DP, a local version of one-sided differential privacy (mentioned in Section 2.2.3 ) which additionally guarantees that if the data is considered sensitive, then a certain set of outputs is forbidden.
  • In [ 6 , 112 , 301], the authors define personalized local DP, a local version of personalized DP (mentioned in Section 2.2.4 ).
  • In [ 12], the authors define -local DP, a local version of -DP (mentioned in Section 2.2.4 ); this was redefined as condensed local DP in [ 180].
  • In [ 250], the authors define task-global DP and task-local DP, which are equivalents of element-level DP (mentioned in Section 2.2.3 ) in a meta-learning context.

The relation between the main syntactic models of anonymity and DP was studied in [ 78], in which the authors claim that the former is designed for privacy-preserving data publishing (PPDP), while DP is more suitable for privacy preserving data mining (PPDM). We disagree with this assessment, and discuss differentially private data publishing at length in Chapter 4 .

In [ 196], the authors classify different privacy enhancing technologies (PETs) into 7 complementary dimensions. Indistinguishability falls into the Aim dimension, but within this category, only -anonymity and oblivious transfer are considered; differential privacy is not mentioned. In [ 7], the authors survey privacy concerns, measurements and privacy-preserving techniques used in online social networks and recommender systems. They classify privacy into 5 categories; DP falls into Privacy-preserving models along with e.g., -anonymity. In [ 376] the authors classified 80+ privacy metrics into 8 categories based on the output of the privacy mechanism. One of their classes is Indistinguishability, which contains DP as well as several variants. Some variants are classified into other categories; for example Rényi DP is classified into Uncertainty and mutual-information DP into Information gain/loss. The authors list 8 differential privacy variants; our taxonomy can be seen as an extension of the contents of their work (and in particular of the Indistinguishability category).

In [ 378], authors establish connections between differential privacy (seen as the additional disclosure of an individual’s information due to the release of the data), identifiability (seen as the posteriors of recovering the original data from the released data), and mutual-information privacy (which measures the average amount of information about the original dataset contained in the released data).

The appropriate selection of the privacy parameters for DP was also exhaustively studied. This problem in not trivial, and many factors can be considered: in [ 200], the authors used economic incentives, in [ 234 , 243 , 312], the authors looked at individual preferences, and in [ 237 , 259], the authors took into account an adversary’s capability in terms of hypothesis testing and guessing advantage respectively.

Earliest surveys focusing on DP summarize algorithms achieving DP and applications [ 123 , 124]. The more detailed “privacy book” [ 131] presents an in-depth discussion about the fundamentals of DP, techniques for achieving it, and applications to query-release mechanisms, distributed computations or data streams. Other textbooks have focused on empirical performance of various algorithms [ 252], asymptotic upper and lower bounds for various tasks [ 372], or have tried to make differential privacy more approachable to non-experts [ 304]. Other surveys focus on the release of histograms and synthetic data with DP [ 189 , 296].

Finally, some surveys focus on location privacy. In [ 266], the authors highlight privacy concerns in this context and list mechanisms with formal provable privacy guarantees; they describe several variants of differential privacy for streaming (e.g., pan-privacy) and location data (e.g., geo-indistinguishability) along with extensions such as pufferfish and blowfish privacy. In [ 68], the authors analyze different kinds of privacy breaches and compare metrics that have been proposed to protect location data.

24Differential identifiability was reformulated in [ 254] as an instance of membership privacy.

25Another definition with the same name is introduced in [254], we mention it in Section 2.2.6 .

26Another definition with the same name is introduced in [ 46 , 116], we mention it in Section 2.2.5 .

27It also assumes that some uncertainty comes from the data itself, similarly to definitions in Section 2.2.5 .

28Even though it is introduced as a variant of DP, it was later shown to be a measure of sensitivity [ 64].

29Another definition with the same name is introduced in [ 58], we mention it in Section 2.2.5 .

All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).