Lowering the cost of anonymization

a PhD thesis

2.2.4  Variation of privacy loss (V)

In DP, the privacy parameter is uniform: the level of protection is the same for all protected users or attributes, or equivalently, only the level of risk for the most at-risk user is considered. In practice, some users might require a higher level of protection than others or a data custodian might want to consider the level of risk across all users, rather than only considering the worst case. Some definitions take this into account by allowing the privacy loss to vary across inputs, either explicitly (by associating each user to an acceptable level of risk), or implicitly (by allowing some users to be at risk, or averaging the risk across users).

Varying the privacy level across inputs

In Section 2.2.3, we saw how changing the definition of the neighborhood can be used to adapt the definition of privacy and protect different aspects of the input data. However, the privacy protection in those variants is binary: either a given property is protected, or it was not. A possible option to generalize this idea further is to allow the privacy level to vary across possible inputs.

One natural example is to consider that some users might have higher privacy requirements than others, and make the vary according to which user differs between the two datasets. This is done in personalized DP, a notion first defined informally in [302], then independently in [134, 165, 212, 261]. An equivalent notion is also defined in [10] as heterogeneous DP, while a location-based definition is presented in [97] as personalized location DP.

Definition 27 (-personalized differential privacy [212]). A privacy mechanism provides -personalized DP (PerDP) if for every pair of neighboring datasets and for all sets of outputs :

where is a privacy specification: maps the records to personal privacy preferences and denotes the privacy preference of the -th record.

As shown in [89, 212], -DP implies -PerDP, where for all ; and if the definition is modified to become symmetric, -PerDP implies -DP where .

This definition can be seen as a refinement of the intuition behind one-sided DP, which separated records into sensitive and non-sensitive ones. The idea of making the privacy level vary across inputs can be generalized further, by also making the privacy level depend on the entire dataset, and not only in the differing record. This is done in [263], where the authors define tailored DP.

Definition 28 (-tailored differential privacy [263]). A mechanism satisfies -tailored differential privacy (TaiDP) for if for any dataset , .

This concept can be applied to strengthen or weaken the privacy requirement for a record depending on whether they are an outlier in the dataset. In [263], the authors formalize this idea and introduce outlier privacy, which tailors an individual’s protection level to their “outlierness”. Other refinements are also introduced in [263]: simple outlier privacy (SOPr), simple outlier DP (SODP), and staircase outlier privacy (SCODP). A similar idea was explored in [216], which introduced pareto DP (ParDP): it utilizes a pareto distribution of parameters to separate a large number of low-frequency individuals from a small number of high-frequency, and the sensitivity is calculated based on only the low-frequency individuals.

Finally, varying the privacy level across inputs also makes sense in continuous scenarios, where the neighborhood relationship between two datasets is not binary, but quantified. This is, for example, the case for -geo-indistinguishability [15], where two datasets and are considered -neighbors if the only different record between and are at a distance of each other, and the grows linearly with .

Randomizing the variation of privacy levels

Varying the privacy level across inputs can also be done in a randomized way, by guaranteeing that some random fraction of users have a certain privacy level. One example is proposed in [185] as random DP: the authors note that rather than requiring DP to hold for any possible datasets, it is natural to only consider realistic datasets, and allow “edge-case” or very unrealistic datasets to not be protected. This is captured by generating the data randomly, and allowing a small proportion of cases to not satisfy the -indistinguishability property.

Definition 29 (-random differential privacy [185]). Let be a probability distribution on , a dataset generated by drawing i.i.d. elements in , and the same dataset as , except one element was changed to a new element drawn from . A mechanism is -random DP (RanDP) if , with probability at least on the choice of and .

The exact meaning of “with probability at least on the choice of and ” can vary slightly. In [184] and in [270], the authors introduce predictive DP (PredDP) and model-specific DP respectively, which quantify over all possible choices of , and picks randomly in the neighborhood of . In [111], and are both taken out of a set of density larger than , and the authors call this definition generalized DP (GdDP). The distribution generating the dataset is also not always assumed to be generating i.i.d. records; we denote the corresponding parameter by .

Random DP might look similar to probabilistic DP: in both cases, there is a small probability that the privacy loss is unbounded. On the other hand, they are very different: in random DP, this probability is computed inputs of the mechanisms (i.e., users or datasets), for probabilistic DP, it is computed across mechanism outputs. Also similarly to probabilistic DP, excluding some cases altogether creates definitional issues: random DP does not satisfy the convexity axiom (see Proposition 15 in Section 2.2.9). We postulate that using a different tool to allow some inputs to not satisfy the mechanism, similar to approximate DP or Rényi DP, could solve this problem.

Usually, data-generating distributions are used for other purposes: they typically model an adversary with partial knowledge. However, definitions in this section still compare the outputs of the mechanisms given fixed neighboring datasets: the only randomness in the indistinguishability property comes from the mechanism. By contrast, definitions of Section 2.2.5 compare the output of the mechanism on a random dataset, so the randomness comes both from the data-generating distribution and the mechanism.

Multidimensional definitions

As varying the privacy level or limiting the considered datasets are two distinct way of relaxing differential privacy, it is possible to combine them with the previously mentioned dimensions.

Combination with N

The definitions described in Section 2.2.3 (e.g., generic DP or blowfish privacy) have the same privacy constraint for all neighboring datasets. Thus, they cannot capture definitions that vary the privacy level across inputs. However, both ideas can be naturally captured together via distance functions. In [67], the authors introduce -privacy, in which the function takes both datasets as input, and returns the corresponding maximal privacy loss (the ) depending on the difference between the two datasets.

Definition 30 (-privacy [67]). Let . A privacy mechanism satisfies -privacy (-Pr) if for all pairs of datasets , and all sets of outputs :

When is proportional to the Hamiltonian difference between datasets, this is equivalent to -DP. In -privacy, the function specifies both the privacy parameter and the definition of neighborhood: it can simply return on non-neighboring datasets, and vary the privacy level across inputs for neighboring datasets. In the original definition, the authors impose that is symmetric, but this condition can also be relaxed to allow -privacy to extend definitions like one-sided DP.

Equivalent definitions of -privacy also appeared in [138] as -privacy, and in [219] as extended DP. Several other definitions, such as weighted DP [322] (WeiDP), smooth DP [31] (SmoDP) and earth mover’s privacy [151] (EMDP), can be seen as particular instantiations of -privacy for specific functions measuring the distance between datasets. This is also the case for some definitions tailored for location privacy, like geo-graph-indistinguishability [357], which specifically applies to network graphs.

Random DP can also be combined with changing the neighborhood definition: in [396], the authors define DP on a -location set11, for which the neighborhood is defined by a set of “plausible” locations around the true location of a user. A notable definition using the same combination of dimensions is distributional privacy12, introduced in [53, 333]: it combines random DP (for a large family of distributions) and free lunch privacy.

Definition 31 (-distributional privacy [53, 333]). An algorithm satisfies -distributional privacy (DlPr[53, 333]) if for any distribution over possible tuples, if are picked by randomly drawing a fixed number of elements from without replacement, then with probability at least over the choice of and , .

Interestingly, this definition captures an intuition similar to the variants in Section 2.2.7: the adversary can only learn properties of the data-generating distribution, but not about particular samples (except with probability lower than ). The authors also prove that if is -DlPr and , where is the size of the dataset being generated, then is also -DP.

Combination with Q

Different risk models, like the definitions in Section 2.2.2, are also compatible with varying the privacy parameters across inputs. For example, in [234], the author proposes endogenous DP (EndDP), which is a combination of -DP and personalized DP. Similarly, pseudo-metric DP (PsDP), defined in [106], is a combination of -privacy and -DP; while extended divergence DP (EDivDP), defined in [218], is a combination of -privacy and divergence DP.

Randomly limiting the scope of the definition can also be combined with ideas from the previous sections. For example, in [367], the authors introduce weak Bayesian DP (WBDP), which combines random DP and approximate DP. In [381], the authors introduce on average KL privacy, which uses KL-divergence as quantification metric, but only requires the property to hold for an “average dataset”, like random DP; a similar notion appears in [150] as average leave-one-out KL stability. In [92, 367], the authors introduce Bayesian DP13 (BayDP[367]) and privacy at risk (PAR) respectively; both definitions combine random DP with probabilistic DP, with slightly different approaches: the former quantifies over all possible datasets and changes one fixed record randomly, while the latter selects both datasets randomly, conditioned on these datasets being neighbors.

In [225], Kifer et al. go further and generalize the intuition from generic DP, introduced in Section 2.2.3, and generalize the indistinguishability condition entirely. The resulting definition is also called generic differential privacy.

Definition 32 (-generic differential privacy [225, 226]). A privacy mechanism satisfies -generic DP (GcDP[225]) if for all measurable sets and for all :

where is a concave function continuous on such as .

The privacy relation is still the generalization of neighborhood and the privacy predicate is the generalization of the -indistinguishability to arbitrary functions. In particular, it can encompass all variants of described in Section 2.2.2 in addition to the ones in this section: for example, if holds for for all and , then this is equivalent to -DP. This definition was an attempt at finding the most generic definition that still satisfies privacy axioms: another extension defined in the same work, abstract DP (AbsDP) is even more generic, but no longer satisfies the privacy axioms.

Definitions in this section are particularly used in the context of local DP14 and in particular for applications to location privacy: various metrics have been discussed to quantify how indistinguishable different places should be to provide users of a local DP mechanism with meaningful privacy protection [68].

11Distinct from DP on -location set [75], which we mention in Section 2.2.3.

12Another definition with the same name is introduced in  [409], we mention it in Section 2.2.3.

13There are two other notions with the same name: introduced in [248, 399], we mention them in Section 2.2.3 and 2.2.5 respectively.

14For details, see Section 2.2.10.0.

All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).