Lowering the cost of anonymization

a PhD thesis

2.2.3  Neighborhood definition (N)

The original definition of differential privacy considers datasets differing in one record. Thus, the datasets can differ in two possible ways: either they have the same size and differ only on one record, or one is a copy of the other with one extra record. These two options do not protect the same thing: the former protects the value of the records while the latter also protects their presence in the data: together, they protect any property about a single individual.

In many scenarios, it makes sense to protect a different property about their dataset, e.g., the value of a specific sensitive field, or entire groups of individuals. It is straightforward to adapt DP to protect different sensitive properties: all one has to do is change the definition of neighborhood in the original definition.

Changing the sensitive property

The original definition states that the -indistinguishability property should hold for “any datasets and that differ only on the data of one individual”. Modifying the set of pairs such that is equivalent to changing the protected sensitive property.

Weaker notions

In DP, the difference between and is sometimes interpreted as “one record value is different”, or “one record has been added or removed”. In [227], the authors formalize these two options as bounded DP and unbounded DP. They also introduced attribute DP and bit DP, for smaller changes within the differing record.

Definition 20 ([227]). If a privacy mechanism satisfies for any pair , where can be obtained from by…

  • …adding or removing one record, then is -unbounded DP (uBoDP).
  • …changing exactly one record, then is -bounded DP (BoDP).
  • …changing one attribute in a record, then is -attribute DP (AttDP).
  • …changing one bit of an attribute in a record, then is -bit DP (BitDP).

In [227], authors show that -unbounded DP implies -bounded DP, as changing a record can be seen as deleting it and adding a new one in its place. The original definition of -DP is the conjunction of -unbounded DP and -bounded DP. However, bounded DP is frequently used in the literature: locally differentially private mechanisms, in particular, in which each user randomizes their own data before sending it to the aggregator, typically uses bounded DP. It is often simply called differential privacy, and sometimes renamed, like in [149], where the authors call it per-person DP.

Variants of differential privacy that do not protect individuals, but single contributions (in the case where the same person can contribute multiple times to a dataset), are also often used in practice, especially for machine learning applications [271]. Some recent works also argue that in-between definitions are appropriate: rather than protecting a single contribution or entire users contributions, authors in [22] suggest that protecting elements that reveal information about users, after deduplicating or clustering contributions. For example, rather than protecting all website visits by a single user, or each visit individually, one might choose to protect the fact that a user ever visited a website (but not whether the user visited the same website once or many times). They call the corresponding definition element-level DP (ELDP).

Another way to relax the neighborhood definition in DP is to consider that only certain types of information are sensitive. For example, if the attacker learns that their target has cancer, this is more problematic than if they learn that their target does not have cancer. This idea is captured in one-sided DP [231]: the neighbors of a dataset are obtained by replacing a single sensitive record with any other record (sensitive or not). The idea of sensitivity is formalized by a policy , which specifies which records are sensitive. This idea cannot be captured simply by -indistinguishability, since one-sided DP is asymmetric.

Definition 21 (-one-sided differential privacy [231]). Given a policy , a privacy mechanism is -one-sided DP (OSDP) iff for all datasets and , where has been obtained by replacing a record by any other record and for all :

When , this is equivalent to bounded DP. Similar ideas were proposed in multiple papers:

  • In [23], the authors propose sensitive privacy, which determines which records are sensitive based on the data itself and a normality property and a graph-based definition of -neighborhood, instead of using a data-independent determination.
  • In [51], the authors introduce anomaly-restricted DP, which assumes that there is only one outlier in the dataset, and that this outlier should not be protected.

Stronger notions

More restrictive definitions are also possible. First, some definitions make the definition of neighborhood more explicit when a single person can contribute multiple times to a dataset; this is the case for client/participant DP, defined in [271]. In [123], the authors implicitly define -group privacy considers datasets that do not differ in one record, but possibly several, to protect multiple individuals. This can also be interpreted as taking correlations into account when using DP: DP under correlation [73] uses an extra parameter to describe the maximum number of records that the change of one individual can influence.

These two definitions are formally equivalent; but the implicit interpretation of DP behind them is different. -group privacy is compatible with the associative view under the strong adversary assumption (the adversary knows all records except ) or the causal view ( records are changed after the data is generated). Meanwhile, DP under correlation implicitly considers the associative view with the independence assumption; and tries to relax that assumption. This last approach was further developed via dependent DP [258], which uses “dependence relationships” to describe how much the variation in one record can influence the other records.

Definition 22 (-dependent differential privacy [258]). A privacy mechanism is -dependent DP (DepDP) where is the probabilistic dependence relationship and is the dependence size, if for any pair of datasets and , where has been obtained from by changing one record and the corresponding at most other records according to , .

Note that when is the empty relation, or when , this definition is equivalent to bounded DP: under the associative view of DP, this represents independence between records. Similar definitions appear in [392, 393] as correlated DP (CorDP), in which correlations are defined by an observation on other datasets, and in in [399] as bayesian DP6 (BayDP[399]), where the neighborhood relation is defined by an adversary having some knowledge about correlations in the data. An extension is proposed in [256] as prior DP (PriDP) which considers a family of adversaries instead of a single adversary.

The strongest possible variant is considered in [227], where the authors define free lunch privacy, in which the attacker must be unable to distinguish between any two datasets, even if they are completely different. This guarantee can be interpreted as a reformulation of Dalenius’ privacy goal [91], which we mentioned in Section 2.1.6. As such, all mechanisms that satisfy free lunch privacy have a near-total lack of utility.

Definition 23 (-free lunch privacy [227]). A privacy mechanism satisfies -free lunch privacy (FLPr) if for any pair of datasets .

Limiting the scope of the definition

Redefining the neighborhood property can also be used to reduce the scope of the definitions. In [349], the authors note that DP requires -indistinguishability of results between any pair of neighboring datasets, but in practice, the data custodian has only one dataset they want to protect. Thus, they only require -indistinguishability between this dataset and all its neighbors, calling the resulting definition individual DP. An equivalent definition was proposed in [64] as conditioned DP.

Definition 24 (-individual differential privacy [349]). Given a dataset , a privacy mechanism satisfies -individual DP (IndDP) if for any dataset that differs in at most one record from , .

This definition was further restricted in [379] where besides fixing a dataset , a record is also fixed.

Applying the definition to other types of input

Many adaptations of differential privacy are simply changing the neighborhood definition to protect different types of input data than datasets. A few examples follow.

  • In [75, 138, 396], the authors adopted differential privacy for locations. In [138] the authors defined location privacy, in which neighbors are datasets which differ in at most one record, and the two differing records are at a physical distance smaller than a given threshold. This definition also appears in [75] as DP on -location set7. Several more location-related differential privacy variants were defined in [291]: untrackability (which adopts differential privacy for set of locations by protecting whether they originated from a single user or by two users), undetectability and multi user untrackability, (which extend this idea further by not assuming both sets originated from the same private data and to multiple users respectively).
  • In [108, 188, 316, 329, 351, 359], the authors adopt differential privacy to graph-structured data. In [188], authors present multiple alternative definitions, which protect different parts of the graph: the strongest is node-DP, which protects a node and all its edges; the weakest is edge-DP, only protects one edge; and an intermediate definition is -edge-DP, which protects a total of edges and nodes. In [359], the authors introduce out-link privacy, which protects all outgoing edges from a given node. In [329], the author introduces -edged-labeled DP similar to out-link privacy, but only protecting a predetermined subset of outgoing edges. In [316], the author introduces -weighted DP, in which graph edges are weighted, and graphs are neighbors when the total weight of their differing edges is smaller than 1; this notion was also defined implicitly in [340]. In [351], the authors define decentralized DP which extends the graph neighborhood to two jumps. In [221], the authors introduce protected DP, which adapts DP for graphs and guarantees that no observer can learn much about the set of edges corresponding to any protected node while offering no guarantees for the other nodes. Finally, in [108] the authors introduce seamless privacy, which rather than protecting characteristics of a specific input graph, it ensures that certain pairs of queries on this graph return similar answers.
  • In [125, 129, 130, 148, 223, 382], authors adapt differential privacy to a streaming context, where the attacker can access the mechanism’s internal states. In [125, 129, 130], authors define pan-privacy, which comes in two variants: event-level pan-privacy (called strong DP in [382]) protects individual events, and user-level pan-privacy protects all events associated to a single user. In [223], the authors extend the previous idea and propose -event privacy, which protects any event sequence occurring within a window of at most timestamps. In [148, 291] this was further extended to an infinite horizon via discounted differential privacy (which keep assigning smaller-and-smaller weights to further-and-further events) and everlasting privacy (which limit the leakage of information users suffer, no matter how many executions a mechanism had), respectively.
  • In [375] the authors adopt differential privacy for Random Access Memory and Private Information Retrieval. For RAM the neighborhood is defined over the sequence of logical memory requests over time; the same notion appears in [63] as differential obliviousness and in [11] as oblivious DP. The adaptation of neighborhood is similar in case of PIR; a similar notion appears in [364] as -private PIR and in [311] as -DPIR. Additionally, in [222], the authors use a similar idea to define differential privacy for outsourced database systems.
  • In [211], the authors adapt differential privacy for symbolic control systems, and introduce word-DP and substitution-word-DP, protecting respectively pairs of words whose Levenshtein distance is lower than a given parameter, or whose Hamming distance is lower than a given parameter.
  • In [405], the authors adapt differential privacy for text vectors, and propose text indistinguishability, in which the neighborhood relationship between two word vectors depends on their Euclidean distance.
  • In [398] the authors defined set operation DP, which adopts differential privacy for set operations where and are neighbor if either and or and are neighbors.
  • In [242, 401], the authors define refinement DP and pixel DP respectively, which adopts the definition for images with neighbors given by some transformation or metric.
  • In [201], the authors adopt DP to gossip protocols to protect the privacy of information source.
  • In [307], the authors define functional DP, which adopts differential privacy for functions.
  • In [347], the authors define phenotypic DP, which adopts differential privacy for genomic data, where the difference in the phenotype vector defines the neighborhood.
  • In [178], the authors adapt differential privacy to recommendation systems, and define distance-based DP, which protects not only a given user recommendation but also all the recommendations of a given user for similar items.
  • In [262], the authors adapt differential privacy to machine learning, and define differential training privacy, which quantifies the risk of membership inference of a record with respect to a classifier and its training data.
  • In [366], the authors define DP for bandit algorithms, where the neighborhood notion is defined by changing any single reward in a multi-armed bandit game sequence. In [41], this definition is renamed to instantaneous DP (Def 5), and few more variants are proposed for this problem space: local DP for bandits, pan-privacy for bandits, sequential privacy for bandits, and environment privacy for bandits.

Extensions

It is natural to generalize the variants of this section to arbitrary neighboring relationships. One example is mentioned in [227], under the name generic DP8, where the neighboring relation is entirely captured by a relation between datasets.

Definition 25 (-generic differential privacy [227]). Given a relation , a privacy mechanism satisfies -generic DP (GcDP[227]) if for all , .

This definition is symmetric, but it can easily be modified to accommodate asymmetric definitions like one-sided DP.

Other definitions use different formalizations to also generalize the concept of changing the neighborhood relationship. Some (like pufferfish privacy, see Definition 35 in Section 2.2.5) use pairs of predicates that and must respectively satisfy to be neighbors. Others (like coupled-worlds privacy, see Definition 44 in Section 2.2.7) use private functions taking datasets as input, and define neighbors to be datasets that have a different output according to this private function. Others use a distance function between datasets, and neighbors are defined as datasets at a distance lower than a given distance threshold; this is the case for DP under a neighborhood (DPUN), introduced in [145], adjacent DP (AdjDP), introduced in [235]9, constrained DP (ConsDP), introduced in [409] (where the distance captures a utility-related constraint), and distributional privacy10 (DlPr[409]), also introduced in [409] (with additional constraints on the neighborhood definition: neighboring datasets must be part of a fixed set and have elements in common). This distance can also be defined as the sensitivity of the mechanism, like in sensitivity-induced DP [335] (SIDP), or implicitly defined by a set of constraints, like what is done implicitly in [227] via induced neighbors DP (INDP).

One notable instantiation of generic DP is blowfish privacy [193]. Its major building blocks are a policy graph , that specifies which pairs of domain values in should not be distinguished between by an adversary; and a set of constraints that specifies the set of possible datasets that the definition protects. It was inspired by the Pufferfish framework [228] (see Definition 35 in Section 2.2.5), but the attacker is not assumed to have uncertainty over the data: instead, it models an attacker whose knowledge is a set of deterministic constraints on the data.

Definition 26 (-blowfish privacy [187, 193]). Given a policy graph and a set of constraints , a privacy mechanism satisfies -blowfish privacy (BFPr) if for all datasets and that satisfy constraints and differ in only one element such that , .

As noted in [193], a mechanism satisfies -bounded DP if and only if it satisfies -blowfish privacy, where is the complete graph, and is any dataset of size . A particular instantiation of this idea is explored in [227] as induced DP, where the definition of neighbors is induced by a set of constraints.

Multidimensional definitions

Modifying the protected property is orthogonal to modifying the risk model implied by the quantification of privacy loss: it is straightforward to combine these two dimensions. Indeed, many definitions mentioned in this section were actually introduced with a parameter allowing for a small probability of error. One particularly general example is adjacency relation divergence DP [218], which combines an arbitrary neighborhood definition (like in generic DP) with an arbitrary divergence function (like in divergence DP).

As the examples in Section 2.2.3.0 show, it is very common to change the definition of neighborhood in practical contexts to adapt what aspect of the data is protected. Further, local DP mechanisms like RAPPOR [141] implicitly use bounded DP: the participation of one individual is not secret, only the value of their record is protected. Variants that limit the scope of the definition to one particular dataset or user, however, provide few formal guarantees and do not seem to be used in practice.

6There are two other notions with the same name: introduced in [248, 367], we mention them in Section 2.2.4 and 2.2.5 respectively.

7Distinct from DP on -location set [396], which we mention in Section 2.2.4.

8Another definition with the same name is introduced in [225, 226], we mention it in Section 2.2.4.

9Originally simply called “differential privacy” by its authors.

10Another definition with the same name is introduced in [53, 333], we mention it in Section 2.2.4.

All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).