Lowering the cost of anonymization

3.2.3 Application to $k$ -anonymity

Sections 3.2.1 and 3.2.2 formalize two intuitive phenomena under partial knowledge. First, if the attacker has a significant enough uncertain about enough people, counting queries do not leak too much information about individuals. Second, for counting queries that apply to rare enough behavior, thresholding provides meaningful protection against a passive attacker. This suggests a link to an older anonymization notion: $k$ -anonymity. In this section, we formalize that link, and combine these two intuitions to provide a relation between $k$ -anonymity and differential privacy under partial knowledge.

$k$ -anonymity, which we introduce in Section 2.1.1, requires each record in a database to be indistinguishable from at least $k - 1$ other records. The intuition is that blending in a large enough crowd provides protection; this intuition is close to the results of Section 3.2.1. $k$ -anonymity is generally obtained by generalizing the data to group similar records together, then dropping the groups with less than $k$ records. The link with the results of Section 3.2.2 is obvious.

To formalize it, we need to clarify the notion of a $k$ -anonymity mechanism. For simplicity, we will simply assume that such a mechanism groups records by their value, and returns a truncated histogram, where all values with less than $k$ records have been removed.

Definition 61 ( $k$ -anonymity mechanism). The $k$ -anonymity mechanism $M_{k}$ takes a dataset in $D$ as input, and returns a histogram in ${(N \cup ⊥)}^{T}$ . For each $t \in T \cup {⊥}$ , $M_{k} (D)$ is defined as:

for all $t \in T$ , $M_{k} (D) (t) = | {i ∣ | D (i) = t} |$ if this number is at least $k$ (if there are less than $k$ records with value $t$ in $D$ );
$M_{k} (D) (t) = ⊥$ otherwise.

If an input record is not in $T$ , it is ignored by $M_{k}$ .

Note that we skipped the generalization step. The results below can be easily extended to any fixed generalization strategy, i.e. a fixed mapping between $T$ and an arbitrary space forming the support of the histogram. It is important that this strategy is fixed. If this function depends on the data, arbitrary correlations can be embedded in the output, which might leak additional information; minimality attacks [389] provide an example of this phenomenon.

Now, under which condition is such a mechanism private? The distribution that captures the attacker’s uncertainty must be such that for all possible values $t \in T$ , either this value is rare enough to be thresholded with high probability, either there is sufficient randomness in the input data that releasing the exact value does not leak too much information.

In addition, we assume that it is possible for a given record to have the value $⊥$ , representing their absence in the dataset. The count corresponding to $⊥$ are never released. We discuss later the importance of such a special value, and its practical interpretation.

Theorem 5. Let $θ$ be a distribution that generates $n$ independent records in $T \cup {⊥}$ . Assume that there is a $λ$ such that for all $t \in T$ :

either for all indices $i$ , $P [D (i) = t] \leq λ$ ,
or for all indices $i$ , $λ \leq P [D (i) = t] \leq 1 - λ$ ;

furthermore, assume that for all indices $i$ , $λ \leq P [D (i) = ⊥] \leq 1 - λ$ , and that the attacker does not have any background knowledge.

Let $T$ be a threshold such that $r = \frac{λ (n - 1)}{(1 - λ) k} < 1$ . Then $M_{T}$ is $({θ}, ε, δ)$ -APKDP for all $δ \geq δ_{0}$ , where:

\begin{matrix} δ_{0} & = \frac{2 \cdot f (T, n - 1, λ)}{1 - r} ε & = 2 \cdot max (- ln (1 - \frac{f (T, n - 1, λ)}{1 - r}), ε_{c}) \end{matrix}

and $ε_{c}$ is such that $δ \geq P [\frac{X}{Y} \geq ε_{c}]$ , where $X$ and $Y$ are two independent random variables sampled from a binomial distribution with $n - 1$ trials and success probability $2 λ$ .

Proof. For a given index $i$ and a possible record $t \in T$ , we compare the events $D (i) = t$ and $D (i) = ⊥$ . If we find $ε$ and $δ$ such that $M {(D)}_{| D (i) = t} \approx_{ε, δ} M {(D)}_{| D (i) = ⊥}$ , then we have $M {(D)}_{| D (i) = t} \approx_{2 ε, 2 ε} M {(D)}_{| D (i) = t}$ for all $t, t^{'} \in T$ , which would conclude the proof immediately.

There are two options that we must consider: either for all indices $i$ , $P [D (i) = t] \leq λ$ , either for all indices $i$ , $λ \leq P [D (i) = t] \leq 1 - λ$ .

In the first case, we can reuse the analysis of Section 3.2.2: with probability $1 - δ_{0}$ , where $δ_{0} = \frac{f (T, n - 1, λ)}{1 - r}$ , the result is thresholded, and the corresponding privacy loss is bounded by $ε_{0} = - ln (1 - \frac{f (T, n - 1, λ)}{1 - r})$ , following the same reasoning than in the proof of Theorem 4. Importantly, in this case, comparing $D (i) = t$ and $D (i) = ⊥$ allows us to restrict our analysis to the value of $M_{T} (D) (t)$ : the distributions of values of $M_{T} (D) (t^{'})$ for all $t^{'} \Leftrightarrow t$ are the same when $θ$ is conditioned on $D (i) = t$ or $D (i) = ⊥$ .

In the second case, we reuse the analysis of Section 3.2.1: the distribution of $M_{T} (D) (t)$ can be seen as the sum of records, each of whom has been randomized using a binary randomized response with parameter $2 λ$ . Since $M_{T} (D) (⊥)$ also follows this binary randomized response process, we can directly apply the proof of Theorem 3 with $δ_{0}$ .

Combining both cases directly leads to the desired result, using the indistinguishability property between $M_{T} (D) (t)$ and $M_{T} (D) (⊥)$ to get one between arbitrary $t$ and $t^{'}$ . □

Theorem 5 is relatively complex, and depends on a number of conditions. Let us discuss its limitations. Some of them are necessary for the result to be true, others could be overcome with a more careful analysis, at the cost of simplicity.

First, we assume that the attacker has no partial knowledge over the data. The result can easily be extended to the case where the attacker has non-zero passive partial knowledge of $m$ records over the data: for the counting case, we can simply remove these $m$ records and obtain the results with $n - m$ instead of $m$ , and for the thresholding case, we can apply Theorem 4 directly. The discussion in Theorem 4 shows that cannot be easily extended to the case where the attacker has the ability to influence the data, unless a very small number of records can be influenced (as in Proposition 28). This captures the correct intuition that $k$ -anonymity is vulnerable against active attackers.

Second, the choice distribution $θ$ might seem artificial, carefully chosen so the previous results can be applied. Why would there be a value $λ$ such that all records have a probability lower than $λ$ of being in a fixed category, or larger than $λ$ ? The first option is reasonable: many real-life distributions are long-tailed; some types of actions, or characteristics, are simply very rare. The second option is less natural: maybe a characteristic that is common for many people is extremely rare in others, so requiring all records to have a high enough probability for this record seems too restrictive. However, note that this high probability captures the attacker’s uncertainty: if the attacker knows that some records have a particularly low probability of having a certain record, it is possible to over-approximate this knowledge, and simply consider these records as known by the attacker. We can then use the previous point to still get an upper bound on the attacker’s information gain.

Third, what is the meaning of the $⊥$ special case, and is it necessary for the proof of Theorem 5 to work? We use it to prove the desired indistinguishability property in the second case of the proof. Without it, it turns out that subtle problems can arise. Suppose, for example, that $T = {a, b, c}$ , and that for all $i$ , $P [D (i) = a]$ is infinitesimally small, while $P [D (i) = b]$ and $P [D (i) = c]$ are both close to $0.5$ . If the total number of records is fixed (and implicitly assumed to be known by the attacker), note that thresholding the count for $a$ is pointless: with high probability, we can retrieve it by computing the difference between $n$ and the counts for $b$ and $c$ . This phenomenon is a real vulnerability of $k$ -anonymity when the total number of participants is known: any result showing that $k$ -anonymity protects privacy under partial knowledge must find a way of guaranteeing that this does not happen.

Creating an artificial category $⊥$ whose count is never released solves this problem, assuming that this category has sufficient uncertainty. This hides the total number of participants and mitigate this vulnerability. Another way would be to impose that the distribution $θ$ has multiple $t \in T$ whose counts will likely be thresholded, and that these $t$ together have enough uncertainty to hide the total count. This is also realistic in practice, given that most distributions are long-tailed, but would likely require a more complex analysis, as well as complicate the theorem statement.

Note that a link between $k$ -anonymity and differential privacy was already introduced in [255]. We use the same notion of a $k$ -anonymity mechanism, however, we model the attacker’s partial knowledge differently. In [255], the attacker is assumed to know the value of every single record from the original dataset, but not which records have been randomly sampled from it. Arguably, the only way to satisfy that assumption in practice is to have the mechanism actually sample the data before applying $k$ -anonymity. In that case, the original differential privacy definition is satisfied. By contrast, our setting assumes an attacker that has some uncertainty about the value of the records themselves; we argue that this is a much more natural way of capturing the natural assumption that the attacker has partial knowledge over the data.

How strong are the privacy parameters provided by Theorem 5 for realistic use cases? First, note that as shown by Figure 3.4, whenever the $δ$ parameter from the thresholding operation is reasonably low, then we have $ε \approx δ$ , which is much lower than the $ε$ values from the results on the privacy of noiseless aggregations (Section 3.2.1). Thus, to use Theorem 5 in practice, one would supposedly need to:

1.: first, fix a target $ε$ and $δ$ that we want to obtain;
2.: these privacy parameters give a range of acceptable values of $λ$ , according to Theorem 3; we then select one such value that will serve as a boundary between “rare events” and “common events”;
3.: finally, calculate the threshold $T$ based on $λ$ and $δ$ , according to Theorem 4.

Admittedly, the above process is not trivial to actually apply, and the constraints on $λ$ make our second point above even more salient: not only is this choice brittle. All in all, our result is an interesting link between two important notions, and formalizes a natural intuition about the inherent privacy of simple aggregation and thresholding mechanisms, but is probably not suited for practical applications. Can it be significantly improved or extended, without adding more brittle assumptions? We leave this as an open question.

LINKPREV LINKUP LINKNEXT

3.2.3 Application to k-anonymity

3.2.3 Application to $k$ -anonymity