Lowering the cost of anonymization

2.2.5 Background knowledge (B)

In differential privacy, the attacker is implicitly assumed to have full knowledge of the dataset: their only uncertainty is about their target. This implicit assumption is also present for the definitions of the previous dimensions: indeed, the attacker has to distinguish between two fixed datasets $D_{1}$ and $D_{2}$ . The only source of randomness in $ε$ -indistinguishability comes from the mechanism itself. In many cases, this assumption is unrealistic, and it is natural to consider weaker adversaries, who do not have full background knowledge. One of the main motivations to do so is to use significantly less noise in the mechanism [116].

The typical way to represent this uncertainty formally is to assume that the input data comes from a certain probability distribution (named data evolution scenario in [228]): the randomness of this distribution models the attacker’s uncertainty. Informally, the more random this probability distribution is, the less knowledge the attacker has. However, the definition that follows depends whether DP is considered with the associative or the causal view. In the associative view, the sensitive property changes before the data is generated: it conditions the data-generating probability distribution. In the causal view, however, the sensitive property is only changed after the data is generated. The two options lead to very distinct definitions.

Conditioning the output on the sensitive property

Using a probability distribution to generate the input data means that the $ε$ -indistinguishability property cannot be expressed between two fixed datasets. Instead, one natural way to express it is to condition this distribution on some sensitive property. The corresponding notion, noiseless privacy¹⁵ was first introduced in [116] and formalized in [46].

Definition 33 ( $(Θ, ε)$ -noiseless privacy [46 , 116]). Given a family $Θ$ of probability distribution on $D$ , a mechanism $M$ is $(Θ, ε)$ -noiseless private (NPr) if for all $θ \in Θ$ , all $i$ and all $t, t^{'} \in T$ :

M {(D)}_{| D \sim θ, D (i) = t} \approx_{ε} M {(D)}_{| D \sim θ, D (i) = t^{'}}^{'} .

In the original definition, the auxiliary knowledge of the attacker is explicitly pointed out in an additional parameter. As we show in Section 3.1.2.0 after Definition 49, in the case where there is no $δ$ of $(ε, 0)$ -DP, this syntactic add-on is not necessary, so we omitted it here.

This definition follows naturally from considering the associative view of DP with the strong adversary assumption, and attempting to relax this assumption. The exact way to model the adversary’s uncertainty can be changed; for example DP under sampling [255], an instance of noiseless privacy, models it using random sampling.

This definition was not the first attempt at formalizing an adversary with partial background knowledge. In [264], authors define $ε$ -privacy, which represents the partial knowledge as a Dirichlet distribution (instead of an arbitrary distribution), whose parameters are interpreted as characteristics of the attacker. However, this definition imposes a condition on the output, but not on the mechanism which produced the output. As such, it does not offer strong semantic guarantees like the other definitions presented in this survey.

Removing the effect of correlations in the data

In [36], however, the authors argue that in the presence of correlations in the data, noiseless privacy can be too strong, and make it impossible to learn global properties of the data. Indeed, if one record can have an arbitrarily large influence on the rest of the data, conditioning on the value of this record can lead to very distinguishable outputs even if the mechanism only depends on global properties of the data. To fix this problem, they propose distributional DP (DistDP), an alternative definition that that only conditions the data-generating distribution on one possible value of the target record, and quantifies the information leakage from the mechanism. DistDP uses a simulator, similarly to variants introduced in Section 2.2.7.

As we show later, this creates definitional issues in the presence of partial knowledge. These issues are discussed in length in Section 3.1.2, where we also introduce an alternative definition that captures the same intuition without encountering the same problems: causal DP.

Definition 34 ( $(Θ, ε)$ -causal differential privacy). Given a family $Θ$ of probability distributions on $D$ , a mechanism $M$ satisfies $(Θ, ε)$ -causal DP (CausDP) if for all probability distributions $θ \in Θ$ , for all $i$ and all $t, t^{'} \in T$ :

M {(D)}_{| D \sim θ, D (i) = t} \approx_{ε} M {(D_{i \to t^{'}}^{'})}_{| D \sim θ, D (i) = t} .

where $D_{i \to t^{'}}^{'}$ is the dataset obtained by changing the $i$ -th record of $D$ into $t^{'}$ .

In this definition, one record is changed after the dataset has been generated, so it does not affect other records through dependence relationships. These dependence relationships are the only difference between noiseless privacy and causal DP: as we show in Proposition 22 in Section 3.1.2.0, when each record is independent of all others, this definition is equivalent to noiseless privacy.

Multidimensional definitions

Limiting the background knowledge of an attacker is orthogonal to the dimensions introduced previously: one can modify the risk model, introduce different neighborhood definitions, or even vary the privacy parameters across the protected properties along with limiting the attacker’s background knowledge.

Combination with Q

Modifying the risk model while limiting the attacker’s background knowledge has interesting consequences. In Section 3.1.3, we show that two options are possible: either consider the partial knowledge as additional information given to the attacker or let the attacker influence the partial knowledge. This distinction between an active and a passive attacker does not matter if only the worst-case scenario is considered, like in noiseless privacy. However, under different risk models, such as allowing a small probability of error, they lead to two different definitions: active partial knowledge DP (APKDP) and passive partial knowledge DP (PPKDP). Both definitions are introduced and discussed in more detail in Section 3.1.3.

To model an active attacker, APKDP quantifies over all possible values of the partial knowledge. It was first introduced in [36 , 46] as noiseless privacy, with an additional $δ$ parameter. We reformulate it to clarify that it implicitly assumes an active attacker. One specialization of this definition is DP under sampling [255] (DPuS), which mandates DP to be satisfied after a random sampling is applied to the dataset. The authors use this definition to show that applying $k$ -anonymity to a randomly sampled dataset provides differential privacy; but this definition could also be used on its own, to model the attacker’s uncertainty using a randomly sampled distribution.

PPKDP is strictly weaker: it models a passive attacker, who cannot choose their partial knowledge, and thus cannot influence the data. In this context, $δ$ does not only apply to the output of the mechanism, but also to the value of the partial knowledge.

Causal DP can also be adapted to a risk model similar to $(ε, δ)$ -DP: in [58], authors introduce a similar notion to causal DP as inherit DP (InhDP), with the small difference that the second dataset is obtained by removing one record from the first dataset, instead of replacing it; and $(ε, δ)$ -indistinguishability is used. The authors also define empirical DP [58]¹⁶, which is identical, except the empirical distribution is used instead of the actual data distribution, in context where the latter is unknown. In both cases, the influence of $δ$ on the attacker model is unclear.

Combination with N

Modifying the neighborhood definition is simpler: it is clearly orthogonal to the dimensions introduced in this section. In all definitions of this section so far, the two possibilities between which the adversary must distinguish are similar to bounded DP. This can easily be changed to choose other properties to protect from the attacker. This is done in pufferfish privacy [228], which extends the concept of neighboring datasets to neighboring distributions of datasets.

Definition 35 ( $(Θ, Φ, ε)$ -pufferfish privacy [228]). Given a family of probability distributions $Θ$ on $D$ , and a family of pairs of predicates $Φ$ on datasets, a mechanism $M$ verifies $(Θ, Φ, ε)$ -pufferfish privacy (PFPr) if for all distributions $θ \in Θ$ and all pairs of predicates $(ϕ_{1}, ϕ_{2}) \in Φ$ :

M {(D)}_{| D \sim θ, ϕ_{1} (D)} \approx_{ε} M {(D)}_{| D \sim θ, ϕ_{2} (D)}

Pufferfish privacy extends the concept of neighboring datasets to neighboring distributions of datasets; starting with a set of data-generating distributions, then conditioning them on sensitive attributes. The result compares pairs of distributions encompasses noiseless privacy, as well as other notions. For example, it captures bayesian DP¹⁷ (BayDP^[248]), introduced in [248], in which neighboring dataset have up to $k$ fixed records in common and all other records are generated randomly from a distribution $π$ .

The same idea can be formalized by comparing pairs of distributions directly. This is done in [206 , 219] via distribution privacy (DnPr). The two formalisms are equivalent: an arbitrary pair of distributions can be seen as a single distribution, conditioned on the value of a secret parameter. Distribution privacy was instantiated in [162] via profile-based DP (PBDP), in which the attacker tries to distinguish between different probabilistic user profiles.

Further relaxations encompassing the introduced dimensions are probabilistic distribution privacy [219] (PDnPr), a combination of distribution privacy and probabilistic DP, extended distribution privacy [219] (EDnPr), a combination of distribution privacy and $d_{D}$ -privacy, divergence distribution privacy [218] (DDnPr), a combination of distribution privacy, and extended divergence distribution privacy [218] (EDDnPr), which combines the latter two definitions. Finally, divergence distribution privacy with auxiliary inputs [218] considers the setting where the attacker might not know the input probability distribution perfectly.

Definitions of this section are an active area of research; a typical question is to quantify in which conditions deterministic mechanisms can provide some level of privacy. However, they are not used a lot in practice, likely because of their brittleness: if the assumptions about the attacker’s limited background knowledge are wrong in practice, then the definitions do not provide any guarantee of protection.

¹⁵Another definition with the same name is introduced in [147], we mention it in Section 2.2.10.0.

¹⁶Another definition with the same name is introduced in [5], we mention it in Section 2.2.10.0.

¹⁷There are two other notions with the same name: introduced in [367 , 399], we mention them in Section 2.2.3 and 2.2.4 respectively.

LINKPREV LINKUP LINKNEXT