Lowering the cost of anonymization

a PhD thesis

3.1.3  Passive vs. active attackers

In this section and the rest of this chapter, we assume that the criterion introduced in Definition 54 is satisfied: the data-generating distributions considered are always acceptable. For simplicity, we also assume that the sensitive property is the value of one record. Under these conditions, any value of the partial knowledge is compatible with any value of the sensitive record , since these two events are independent. This also allows us to only consider the formalism of NPr; since it is equivalent with CausDP under this criterion (Proposition 24).

Correlations in the data are not the only issue to account for when limiting the attacker’s background knowledge in DP. Another important question is whether the attacker simply receives some partial knowledge passively, or whether the attacker can actively influence the data. In this section, we show how to model these two situations by adapting the notion of a privacy loss random variable to model an attacker with partial knowledge, and we explore the relationship between the two corresponding definitions. To help understand this distinction, in this section, we consider the following example.

Example 7 (Thresholding). 1000 people take part in a Yes/No-referendum. Each person votes “Yes” with some probability, independently from the others. The mechanism counts the number of “Yes” votes, but only returns this if it is above 100; otherwise it returns . The partial knowledge contains the votes of 100 participants, and the attacker wants to know the vote of another individual . We will see that in case the probability that each person votes “Yes” is very small (say, ), the privacy of this scheme will depend on whether the attacker is passive or active.

Privacy loss random variable for partial knowledge

In Section 2.2.2, we introduced the privacy loss random variable (Definition 13), defined as:

The PLRV quantifies how much information is revealed by the output of a mechanism, and we saw in Proposition 8 that a mechanism is -DP iff for all neighboring databases and , and all outputs :

Now, suppose the attacker only has partial knowledge about the data. How to adapt the definition of the PLRV to this setting? The data comes from a distribution , and the attacker tries to distinguish between and by observing , given partial knowledge . Since is given to the attacker prior to , we must condition the probabilities by .

Definition 56 (PLRV for partial knowledge). Given a mechanism , a distribution with values in , an index , and values , the PLRV of an output given partial knowledge is:

using the convention for all .

This PLRV captures the same idea as for classical DP: it quantifies the attacker’s information gain. If , this definition is the same as the classical PLRV.

Now that we translated the concept of PLRV to account for partial knowledge, we can use it to adapt the privacy definition. The formula in Proposition 8 averages the PLRV over all possible outputs , but the PLRV with partial knowledge has a second parameter, . How should this new parameter be handled? There are at least two reasonable possibilities.

Active partial knowledge

The first option is to quantify over all possibilities for the attacker’s partial knowledge. We assume the worst: we consider the case where the attacker’s partial knowledge causes the privacy to be the greatest. This models a scenario where the attacker can not only see, but also influence the data. If the attacker can, for example, add fake users to the database, then they can choose the values associated to these new records to maximize the chances of information gain. We therefore call this option active partial knowledge, short for “partial knowledge under active attacks”.

Definition 57 (APKDP). Given a family of distributions , a mechanism is -APKDP (Active Partial Knowledge Differential Privacy) if for all distributions , all indices , all , and all :

or, equivalently:

The proof of this equivalence is the same as in Lemma 1 in  [MM18], which makes it explicit that APKDP is the same as NPr in its reformulation in  [BGKS13]. As shown in  [TSD20], APKDP and DP coincide whenever the attacker has full knowledge (when ).

With APKDP, a fixed part of the distribution can be arbitrarily determined. In Example 7, this corresponds to having the attacker control some percentage of voters. Such an active attacker can simply add many fake “Yes” votes to the database, to reach the threshold of 100: this makes the thresholding pointless. then becomes a simple counting query which does not provide privacy: with high probability, everybody votes “No”, and the only uncertainty left is over the attacker’s target.

In addition to modeling an active attacker, APKDP can also be used in scenarios where is unknown, but can be approximated. The “partial knowledge” can represent the error between the true distribution and the approximation and if APKDP is satisfied, then the privacy property also holds for the true database.

Note that in this context, explicitly modeling the background knowledge is technically not necessary. Instead, we could simply create a new family of probability distributions by conditioning each by the value of each possible . We make this background knowledge explicit instead, so APKDP is easier to compare with PPKDP, defined in the next section.

Passive partial knowledge

APKDP represents situations where the attacker can modify the data. An example is an online service that publishes statistics about its use, where the attacker can interact with the service before the usage statistics are published. Now, what if the attacker cannot interact with the data? Consider e.g. researchers publishing the results of a clinical study about patients having a medical condition. A typical attacker cannot influence the clinical data, but might have some partial knowledge about other participants to the survey.

How can we model such a passive attacker, which has access to some information about the data, but cannot influence it? It no longer makes sense to quantify over arbitrary partial knowledge. In the same way that the reformulation of -DP using the PLRV averages the PLRV over all possible outputs, we must average the PLRV over all possible values of the partial knowledge.

Definition 58 (PPKDP). Given a family of distributions , a mechanism is -PPKDP (Passive Partial Knowledge Differential Privacy) if for all distributions , all indices , and all :

In this context, has a similar meaning as in -DP: it captures the probability that the attacker is lucky. In -DP, it means that allows the attacker to distinguish between and with high probability, or, equivalently, that the PLRV associated to output is large. In -PPKDP however, captures the probability of the attacker getting either a favorable output , or a favorable partial knowledge .

With PPKDP, the thresholding mechanism of Example 7 is private. Indeed, with high probability, the partial knowledge will have only “No” votes; and almost certainly, the mechanism will output and gives no information. We formalize this intuition in Section 3.2.2.

Note that as opposed to APKDP, PPKDP cannot be easily reformulated using -indistinguishability. Since the statement conditions both probabilities, the in -indistinguishability only applies to the randomness of . To use an indistinguishability-based formulation, we would need to use twice, and (for example) explicitly require that -indistinguishability holds with probability over the choice of .

Remark 1. PPKDP shares some characteristics with inference-based distributional differential privacy (IBDDP), introduced in  [BGKS13]. A mechanism satisfies IBDDP if there is a simulator such that for all probability distributions , and indices , the statement:

holds with probability over the choice of and .

Leaving aside the simulator, note that is used in two separate parts of the definition: both over the choice of and , and in the indistinguishability. As such, it is difficult see intuitively what corresponds to, and the interpretation based on the “probability that the attacker gets more information than is not correct. This is one of the reasons why a PLRV-based formulation is more convenient: can simply be interpreted in the same way as in -differential privacy.

Further, the strict implication between DistDP and IBDDP proven in  [BGKS13] can be explained by a similar distinction between an active and a passive attacker, even if it is not made explicit in the original paper. Indeed, when , this applies only to the indistinguishability property of DistDP, and DistDP quantifies over all possible values of the background knowledge: the attacker is assumed to be able to choose the most favorable value of the background knowledge. In contrast, with IBDDP, is applied both to the indistinguishability property and to the choice of the background knowledge; hence the attacker is implicitly assumed to get the background knowledge randomly.

Relation between definitions

In this section, we formalize the relation between PPKDP and APKDP and show basic results on those definitions.

First, APKDP and PPKDP satisfy both privacy axioms proposed in  [KL10], which we reproduced in Definition 9 in Section 2.2.1. These axioms express natural properties that we expect to be true for any reasonable definition of privacy.

Proposition 25. PPKDP satisfies the post-processing axiom: if a mechanism is -PPKDP, then for any function , the mechanism defined by is also -PPKDP. It also satisfies the convexity axiom: if two mechanisms and are both -PPKDP, then the mechanism that applies with some probability and with probability is also -PPKDP.

APKDP also satisfies these axioms.

Proof. For APKDP, the reformulation of the definition using classical -indistinguishability makes the result straightforward: the proof is the same as for -DP. For PPKDP, we first reformulate the definition using -divergence. Let . Then a mechanism is -PPKDP iff:

This view allows us to use the monotonicity and joint convexity properties of the -divergence to immediately prove the result for PPKDP. □

We saw that -APKDP bounds the probability mass of the PLRV above by , for all possible partial knowledge . By contrast, PPKDP bounds the same probability mass, averaged over all possible values of , weighted by their likelihood. We formalize this interpretation and use it to show that APKDP is, as expected, stronger than PPKDP. More surprisingly, we also use it to show that when , both definitions are equivalent.

Theorem 1. Given a distribution , a mechanism , an index , two values , an output , a possible value of the partial knowledge , and a fixed , let us denote . The respective quantities bounded by the requirements of APKDP and PPKDP are:

and:

As an immediate consequence, if is -APKDP, then it is also -PPKDP. Further, -PPKDP and -APKDP are equivalent.

Proof. We decompose depending the value of .

For the second part of the statement, if the mechanism is -APKDP, then for all , , , and , , so:

For the last part of the statement, assume that is -PPKDP. Then:

All summands are non-negative, so the sum can only be if all summands are : for all , , and is -APKDP. □

When , the implication is strict: the PLRV can be arbitrarily higher for certain values of the background knowledge. Thus, quantifying over all possible values can lead to much larger values of and than averaging over all possible values of the background knowledge: Example 7 illustrates this phenomenon. When however, -APKDP and -PPKDP are both worst-case properties, like -DP: an attacker’s ability to choose the background knowledge does not matter, since even for the passive attacker, we need to consider the worst possible output and background knowledge .

-reducible mechanisms

For some mechanisms and probability distributions, active attackers are, perhaps surprisingly, not more powerful than passive attackers, even when . We introduce here a necessary condition for APKDP and PPKDP to be equivalent and we show that this condition appears in natural contexts.

Consider the example of a referendum where 2000 users take part in a vote with two options. Each user votes “yes” with probability , and “no” with probability . The mechanism returns the exact tally of the vote. We assume that the attacker knows half of the votes: their partial knowledge is the vote of 1000 users. They might know that e.g. 500 of these users voted “yes”, 500 voted “no”, and the remaining 1000 votes are unknown. The attacker aims to get information on the vote of their target.

Does it matter, in this situation, whether the attacker is passive or active? If the attacker can choose the votes of 1000 users, they can decide that each known user will vote “yes”. But changing these votes will only modify the tally in a predictable way: the attacker can remove these votes from the total tally. Intuitively, it does not matter whether these known users all vote “yes”, “no”, or have any other behavior known to the attacker. Without dependency relationships between users, the attacker’s uncertainty solely resides in the unknown votes: a passive attacker is not weaker than an active attacker.

We generalize this intuition via the concept of -reducibility, which captures scenarios where all possible values of background knowledge are equivalent from a privacy perspective. We then show that under this condition, APKDP is equivalent to PPKDP, and formalize the above example to show that it satisfies this condition.

Definition 59 (-reducibility). A mechanism is -reducible if for all indices , and all , there is a bijective mapping such that for all and for all :

This equivalence between possible outputs under and can be translated to an equivalence between the corresponding PLRVs: if is -reducible, then the global behavior of and of are the same. This equivalence between PLRVs enables us to show that -reducibility implies APKDP and PPKDP are the same, even when . So there are mechanisms and background knowledge functions for which an active attacker is not stronger than a passive one.

Theorem 2. Let be a family of probability distributions, and let be a -reducible mechanism. Then is -APKDP iff it is -PPKDP.

Proof. First, we show that for all and all :

This statement directly follows by unfolding the definition of (Definition 56) and -reducibility (Definition 59):

We can now prove Theorem 2. Since is the expected value of (Theorem 1), it is enough to prove that for any and , . Recall that , and fix and in . We have:

using Definition 59 and the technical result above. We can then reindex the sum using the bijection , and conclude:

-reducible mechanisms are fairly common, especially under the natural assumption that the attacker knows a fixed part of the dataset. We give a few examples.

Proposition 26. Let be a family of distributions in which each generates the record of each user independently, and assume that the background knowledge of the attacker is fixed records of the database, for a given . Then the following mechanisms are -reducible.

1.
Counting queries: given a predicate , the algorithm’s output is the number of records for which is true. This a generalizes binary voting.
2.
Linear queries: given a fixed family of weights , the mechanism returns .
3.
Different types of means: the arithmetic mean, the geometric mean, the harmonic mean, and the quadratic mean (assuming all are positive).

Proof. Fix . For counting queries, given , define . This function is linear, so injective. Call the random distribution of the records not present in . Then for all and all :

For linear queries, we use a similar mapping to , which also depends on the mapping. Let be the indices of records present in ; then a linear query is -reducible with:

This also proves the result for the arithmetic mean.

Similarly, it is easy to verify that the following functions show -reducibility for the geometric, harmonic and quadratic mean.

The injectivity of each of these functions is clear. □

Note that in Proposition 26, it is crucial that the records in the attacker’s partial knowledge are fixed. In general, knowing the records of users is not equivalent to knowing the records of other users. Indeed, if these records are generated with different probabilities, the randomness from the unknown records differs between both scenarios, and it is in general impossible to convert one into the other.

Remark 2. Observation 1 in  [GK17b] claims that the partial knowledge of records in a database of size is the same as no partial knowledge in a database of size . For counting queries, this holds for the same reason that counting queries are -reducible: one can “remove” the known records from the mechanism output and obtain a bijection between the cases with and without partial knowledge. Thus, the partial knowledge is irrelevant to the mechanism’s privacy and can be ignored.

This observation, however, does not hold in general: we later show in Example 7 that counting queries with thresholding are not -reducible, and in this case, the knowledge of records in a database of size has a very different effect than no partial knowledge in a database of size .

All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).