Lowering the cost of anonymization

4.3.2 Partition selection

Recall the running example of Section 4.2.2.

SELECT 
  browser_agent, 
  COUNT(*) AS visits 
FROM access_logs 
GROUP BY browser_agent

Listing 4.12: Simple histogram query

One of the pitfalls of making such a query differentially private, identified in Section 4.2.2.0, is to select which partitions (here, browser agents) will be present in the output. In Section 4.2.2.0 and Section 4.2.3.0, we reused an insight from [230], and used Laplace-based thresholding to avoid this pitfall: we essentially count unique users associated with each partition, add Laplace noise to each count, and keep only the partitions whose counts are above a fixed threshold. The scale of the noise and the threshold value determine $ε$ and $δ$ .

In this section, we explore possible improvements to this partition selection method. We start by discussing prior work in more detail (Section 4.3.2.0) and introducing definitions (Section 4.3.2.0). Then, we present a partition selection mechanism for the case where each user contributes to one partition, prove its optimality (Section 4.3.2.0), and experimentally compare it to existing methods (Section 4.3.2.0). We then discuss possible extensions to cases where each user contributes to multiple partitions as well as implementation considerations (Section 4.3.2.0).

Prior work and contributions

Even though Laplace thresholding was introduced in 2009 in [230], the specific primitive of partition selection did not much attention until [179], where the authors call the generic problem differentially private set union. Each user is associated with one or several partitions, and the goal is to release as many partitions as possible while making sure that the output is differentially private.

In [179], the main use case is word and n-gram discovery in Natural Language Processing: data used in training models must not leak private information about individuals. In this context, each user potentially contributes to many elements; the sensitivity of the mechanism can be high. The authors propose two strategies applicable in this context. First, they use a weighted histogram so that if a user contributes to fewer elements than the maximum sensitivity, these elements can add more weight to the histogram count. Second, they introduce policies that determine which elements to add to the histogram depending on which histogram counts are already above the threshold. These strategies obtain significant utility improvements over the simple Laplace-based strategy.

In this work, in contrast to [179], we focus on the low-sensitivity use case: each user contributes to exactly one partition. This different setting is common in data analysis: when the GROUP BY operation partitions the set of users in distinct partitions, each user contributes exactly one element to the set union. Choosing the contributions of each user is therefore not relevant; the only question is to optimize the probability of releasing each element in the final result. For this specific problem, we introduce an optimal approach, which maximizes this probability.

Definitions

Throughout most of this work, we will assume that each user contributes to only one partition; and the goal is to release as many partitions as possible. In that case, each partition can be considered independently, so the problem is simple to model. Each partition has a certain number of users associated with it, and the only question is: with which probability do we release this partition? Thus, a strategy for partition selection is simply a function associating the number of users in a partition with the probability of keeping the partition.

Definition 82 (Partition selection primitive). A partition selection primitive is a function $π : N \to [0, 1]$ such that $π (0) = 0$ . The corresponding partition selection strategy $ρ_{π}$ counts the number $n$ of users in each partition, and releases this partition with probability $π (n)$ .

Formally, we say that a partition selection primitive is $(ε, δ)$ -differentially private if the corresponding partition selection strategy $ρ_{π} : N \to {drop, keep}$ , defined by:

ρ_{π} (n) = {\begin{matrix} drop & with probability 1 - π (n) \end{matrix}

is $(ε, δ)$ -differentially private.

Note that partitions associated with no users are not present at all in the input data, so the probability of releasing them must be $0$ : the definition requires $π (0) = 0$ .

Main result

Let us define an $(ε, δ)$ -DP partition selection primitive $π_{opt}$ and prove that the corresponding partition selection strategy is optimal. In this context, optimal means that it maximizes the probability of releasing a partition with $n$ users, for all $n$ .

Definition 83 (Optimal partition selection primitive). A partition selection primitive $π_{opt}$ is optimal for $(ε, δ)$ -DP if it is $(ε, δ)$ -DP, and if for all $(ε, δ)$ -DP partition selection primitives $π$ and all $n \in N$ :

π (n) \leq π_{opt} (n) .

We introduce our main result, then we prove it in two steps: we first prove that the optimal partition selection primitive can be obtained recursively, then derive the closed-form formula of our main result from the recurrence relation.

Theorem 13 (General solution for $π_{opt}$ ). Let $ε > 0$ and $δ \in (0, 1)$ . Defining:

\begin{matrix} n_{1} & = 1 + ⌊ \frac{1}{ε} ln (\frac{e^{ε} + 2 δ - 1}{(e^{ε} + 1) δ}) ⌋, n_{2} & = n_{1} + ⌊ \frac{1}{ε} ln (1 + \frac{e^{ε} - 1}{δ} (1 - π_{opt} (n_{1}))) ⌋, \end{matrix}

and $m = n - n_{1}$ , the partition selection primitive $π_{opt}$ defined by:

π_{opt} (n) = ⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ \begin{matrix} \frac{e^{nε} - 1}{e^{ε} - 1} \cdot δ & if n \leq n_{1} (1 - e^{- mε}) (1 + \frac{δ}{e^{ε} - 1}) + e^{- mε} π_{opt} (n_{1}) & if n > n_{1} and n \leq n_{2} \end{matrix}

is optimal for $(ε, δ)$ -DP.

These formulas assume $ε > 0$ and $δ > 0$ . We also cover the special cases where $ε = 0$ or $δ = 0$ .

Theorem 14 (Special cases for $π_{opt}$ ).

1.: If $δ = 0$ , partition selection is impossible: the optimal partition selection primitive $π_{opt}$ for $(ε, 0)$ -DP is defined by $π_{opt} (n) = 0$ for all $n$ .
2.: If $ε = 0$ , the optimal partition selection primitive $π_{opt}$ for $(0, δ)$ -DP is defined by $π_{opt} (n) = min (1, nδ)$ for all $n$ .

The rest of this section is a proof of Theorem 13.

Recursive construction

How do we construct a partition selection primitive $π$ so that the partition is output with the highest possible probability under the constraint that $π$ is $(ε, δ)$ -DP? Using the definition of differential privacy, the following inequalities must hold for all $n \in N$ .

\begin{matrix} π (n + 1) & \leq e^{ε} π (n) + δ & (4.1) π (n) & \leq e^{ε} π (n + 1) + δ & (4.2) (1 - π (n + 1)) & \leq e^{ε} (1 - π (n)) + δ & (4.3) (1 - π (n)) & \leq e^{ε} (1 - π (n + 1)) + δ . & (4.4) \end{matrix}

These inequalities are not only necessary, but also sufficient for $π$ to be DP. Thus, the optimal partition selection primitive can be constructed by recurrence, maximizing each value while still satisfying the inequalities above. As we will show, only inequalities (4.1) and (4.4) above need be included in the recurrence relationship. The latter can be rearranged as:

π_{opt} (n + 1) \leq 1 - e^{- ε} (1 - π_{opt} (n) - δ)

which leads to the following recursive formulation for $π_{opt}$ .

Lemma 13 (Recursive solution for $π_{opt}$ ). Given $δ \in [0, 1]$ and $ε \geq 0$ , $π_{opt}$ satisfies the following recurrence relationship: $π_{opt} (0) = 0$ , and for all $n > 0$ :

π_{opt} (n) = min (e^{ε} π_{opt} (n - 1) + δ, 1 - e^{- ε} (1 - π_{opt} (n - 1) - δ), 1)

(4.5)

Proof. Let $π_{0}$ be defined by recurrence as above; we will prove that $π_{0} = π_{opt}$ .

First, let us show that $π_{0}$ is monotonic. Fix $n \in N$ . It suffices to show for each argument of the min function in (4.5) is larger than $π_{0} (n)$ .

First argument: since $ε \geq 0$ implies $e^{ε} \geq 1$ and $δ \geq 0$ , we have $e^{ε} π_{0} (n) + δ \geq π_{0} (n)$ .

Second argument: we have

\begin{matrix} 1 - e^{- ε} (1 - π_{0} (n) - δ) & = 1 - e^{- ε} (1 - π_{0} (n)) + e^{- ε} δ \geq 1 - (1 - π_{0} (n)) = π_{0} (n) \end{matrix}

using that $1 - π_{0} (n) \geq 0$ since $π_{0} (n) \leq 1$ by (4.5).

Third argument: this is immediate given (4.5) and the fact that $π_{0} (0) = 0$ .

It follows that $π_{0} (n + 1) \geq π_{0} (n)$ .

Because $π_{0}$ is monotonic, it immediately satisfies inequalities (4.2) and (4.3), and inequalities (4.1) and (4.4) are satisfied by definition.

Since $π_{0}$ satisfies all four inequalities above, it is $(ε, δ)$ -DP. Its optimality follows immediately by recurrence: for each $n + 1$ , if $π (n + 1) > π_{opt} (n + 1)$ , it cannot be $(ε, δ)$ -DP, as one of the inequalities above is not satisfied: $π_{0}$ is the fastest-growing DP partition selection strategy, and therefore equal to $π_{opt}$ . □

The special cases for $π_{opt}$ in Theorem 14 can be immediately derived from Lemma 13: the rest of this section focuses on proving the general form in Theorem 13.

Derivation of the closed-form solution

Let us now show that the closed-form solution of Theorem 13 can be derived from the recursive solution in Lemma 13. First, we show that there is a crossover point $n_{1}$ , below which only the first term of the recurrence relation matters, and after which only the second term matters (until $π_{opt} (n)$ reaches $1$ ).

Lemma 14. Assume $ε > 0$ and $δ > 0$ . There are crossover points $n_{1}, n_{2} \in N$ such that $0 < n_{1} \leq n_{2}$ and:

π_{opt} (n) = ⎧ ⎪ ⎨ ⎪ ⎩ \begin{matrix} 0 & if n = 0 π_{opt} (n - 1) e^{ε} + δ & if n > 0 and n \leq n_{1} 1 - e^{- ε} (1 - π_{opt} (n - 1) - δ) & if n > n_{1} and n \leq n_{2} \end{matrix}

(4.6)

Proof. We consider the arguments in the min statement in (4.5), substituting $x$ for $π_{opt} (n)$ :

\begin{matrix} α_{1} (x) & = e^{ε} x + δ α_{2} (x) & = 1 - e^{- ε} (1 - x - δ) α_{3} (x) & = 1 \end{matrix}

This substitution allows us to work directly in the space of probabilities instead of restricting ourselves to the sequence ${(π_{opt} (n))}_{n = 0}^{\infty}$ . Taking the first derivative of these functions yields:

\begin{matrix} α_{1}^{'} (x) & = e^{ε} α_{2}^{'} (x) & = e^{- ε} α_{3}^{'} (x) & = 0 \end{matrix}

Since the derivative of $α_{1} (x) - α_{2} (x)$ is $e^{ε} - e^{- ε} > 0$ , there exists at most one crossover point $x_{1}$ such that $α_{1} (x) < α_{2} (x)$ for all $x < x_{1}$ , $α_{2} (x_{1}) = α_{1} (x_{1})$ , and $α_{1} (x) > α_{2} (x)$ for all $x > x_{1}$ . Setting $α_{1} (x) = α_{2} (x)$ and solving for $x$ yields:

e^{ε} x + δ = 1 - e^{- ε} (1 - x - δ)

which leads to:

e^{ε} x - e^{- ε} x = 1 - δ - e^{- ε} (1 - x - δ)

and finally:

x_{1} = (1 - δ) \cdot \frac{1 - e^{- ε}}{e^{ε} - e^{- ε}} .

Since the derivative of $α_{2} (x) - α_{3} (x)$ is $e^{- ε} > 0$ , there exists at most one crossover point $x_{2}$ such that $α_{2} (x) < α_{3} (x)$ for all $x < x_{2}$ , $α_{2} (x_{2}) = α_{3} (x_{2})$ , and $α_{2} (x) > α_{3} (x)$ for all $x > x_{2}$ . Setting $α_{2} (x) = α_{3} (x)$ and solving for $x$ yields:

x_{2} = 1 - δ .

From the formulas for $x_{1}$ and $x_{2}$ , it is immediate that $0 < x_{1} < x_{2} < 1$ . As such, the interval $[0, 1]$ can be divided into three non-empty intervals:

1.: On $[0, x_{1}]$ , $α_{1} (x)$ is the active argument of $min (α_{1} (x), α_{2} (x), α_{3} (x))$ .
2.: On $[x_{1}, x_{2}]$ , $α_{2} (x)$ is the active argument of $min (α_{1} (x), α_{2} (x), α_{3} (x))$ .
3.: On $[x_{2}, 1]$ , $α_{3} (x)$ is the active argument of $min (α_{1} (x), α_{2} (x), α_{3} (x))$ .

The existence of the crossover points is not enough to prove the lemma: we must also show that these points are reached in a finite number of steps. For all $n \geq 1$ such that $π_{opt} (n) \Leftrightarrow 1$ , we have:

\begin{matrix} π_{opt} (n) - π_{opt} (n - 1) = min (e^{ε} π_{opt} (n - 1) + δ, 1 - e^{- ε} (1 - π_{opt} (n - 1) - δ)) - π_{opt} (n - 1) \geq min (δ, (1 - e^{- ε}) (1 - π_{opt} (n - 1)) + e^{- ε} δ) \geq e^{- ε} δ . \end{matrix}

Since $π_{opt} (n) - π_{opt} (n - 1)$ is bounded from below by a strictly positive constant $e^{- ε} δ$ , the sequence achieves the maximal probability 1 for finite $n$ . □

This allows us to derive the closed-form solution for $n < n_{1}$ and for $n_{1} \leq n < n_{2}$ stated in Theorem 13.

Lemma 15. Assume $ε > 0$ and $δ \leq 0$ . If $n \leq n_{1}$ , then $π_{opt} (n) = \frac{e^{nε} - 1}{e^{ε} - 1} \cdot δ$ . If $n_{1} \leq n < n_{2}$ , then denoting $m = n - n_{1}$ :

π_{opt} (n) = (1 - e^{- mε}) (1 + \frac{δ}{e^{ε} - 1}) + e^{- mε} π_{opt} (n_{1}) .

Proof. For $n < n_{1}$ , expanding the recurrence relation yields:

\begin{matrix} π_{opt} (n) & = π_{opt} (n - 1) e^{ε} + δ = δ n - 1 \sum k = 0 e^{kε} = \frac{e^{nε} - 1}{e^{ε} - 1} \cdot δ . \end{matrix}

For $n_{1} \leq n < n_{2}$ , denoting $m = n - n_{1}$ , expanding the recurrence relation yields:

\begin{matrix} π_{opt} (n) & = 1 - e^{- ε} (1 - π_{opt} (n - 1) - δ) = (1 - e^{- ε} + δ e^{- ε}) m - 1 \sum k = 0 e^{- kε} + e^{- mε} π_{opt} (n_{1}) = (1 - e^{- ε} + δ e^{- ε}) \frac{1 - e^{- mε}}{1 - e^{- ε}} + e^{- mε} π_{opt} (n_{1}) = (1 - e^{- mε}) (1 + \frac{δ}{e^{ε} - 1}) + e^{- mε} π_{opt} (n_{1}) . \end{matrix}

□

We can now find a closed-form solution for $n_{1}$ and for $n_{2}$ .

Lemma 16. The first crossover point $n_{1}$ is:

n_{1} = 1 + ⌊ \frac{1}{ε} ln (\frac{e^{ε} + 2 δ - 1}{δ (e^{ε} + 1) δ}) ⌋

Proof. Using the formula for $x_{1}$ in the proof of Lemma 14, we see that $π_{opt} (n - 1) \leq x_{1}$ whenever:

\frac{e^{(n - 1) ε} - 1}{e^{ε} - 1} \cdot δ \leq \frac{1 - δ}{e^{ε} + 1} .

Rearranging terms, we can rewrite this inequality as:

\begin{matrix} n & \leq 1 + \frac{1}{ε} ln [\frac{(1 - δ) (e^{ε} - 1)}{δ (e^{ε} + 1)} + 1] = 1 + \frac{1}{ε} ln [\frac{(1 - δ) (e^{ε} - 1) + δ (e^{ε} + 1)}{δ (e^{ε} + 1)}] = 1 + \frac{1}{ε} ln [\frac{e^{ε} + 2 δ - 1}{δ (e^{ε} + 1)}] . \end{matrix}

Since $n$ is an integer, the supremum value defining $n_{1}$ is achieved by taking the floor of the right-hand side of this inequality, which concludes the proof. □

Lemma 17. The second crossover point $n_{2}$ is:

n_{2} = n_{1} + ⌊ \frac{1}{ε} ln (1 + \frac{e^{ε} - 1}{δ} (1 - π_{opt} (n_{1}))) ⌋

Proof. We want to find the maximal $m$ such that:

(1 - e^{- mε}) (1 + \frac{δ}{e^{ε} - 1}) + e^{- mε} π_{opt} (n_{1}) \leq 1 .

We can rewrite this condition into:

- e^{- mε} (1 + \frac{δ}{e^{ε} - 1} - π_{opt} (n_{1})) \leq \frac{- δ}{e^{ε} - 1}

which leads to:

\begin{matrix} e^{mε} & \leq \frac{e^{ε} - 1}{δ} (1 + \frac{δ}{e^{ε} - 1} - π_{opt} (n_{1})) \leq 1 + \frac{e^{ε} - 1}{δ} (1 - π_{opt} (n_{1})) \end{matrix}

and finally:

m \leq \frac{1}{ε} ln (1 + \frac{e^{ε} - 1}{δ} (1 - π_{opt} (n_{1})))

since $m$ must be an integer, we take the floor of the right-hand side of this inequality to obtain the result. □

Numerical validation

Theorem 13 shows that the optimal partition selection primitive $π_{opt}$ outperforms all other options. How does it compare with our previous strategy of adding Laplace noise and thresholding the result, described in Section 4.2.3.0? For simplicity, we recall this strategy in the simpler case where the sensitivity is one.

Definition 84 (Laplace-based partition selection [230]). We denote by $Lap (b)$ a random variable sampled from a Laplace distribution of mean $0$ and of scale $b$ . The following partition selection strategy $ρ_{Lap}$ , called Laplace-based partition selection, is $(ε, δ)$ -differentially private:

ρ_{Lap} (n) = {\begin{matrix} drop & if n + Lap (\frac{1}{ε}) < 1 - \frac{ln (2 δ)}{ε} \end{matrix}

We denote by $π_{Lap}$ the corresponding partition selection primitive:

π_{Lap} (n) = P [ρ_{Lap} (n) = keep] .

As expected, using the optimal partition selection primitive translates to a larger probability of releasing a partition with the same user. As shown in Figure 4.18, the difference is especially large in the high-privacy regime.

[πopt ]
68111112000001nℙ024680....2468ρ(πnL)a =p keep

[]πopt
16018020022024026028030000.20.40.60.81nℙρ(n)=keepπLap

To better understand the dependency on $ε$ and $δ$ , we also compare the midpoint obtained for both partition selection strategies $ρ$ : the number $n$ for which the probability of releasing a partition with $n$ users is $0.5$ . For Laplace-based partition selection, this $n$ is simply the threshold. As Figure 4.19 shows, the gains are especially substantial when $ε$ is small, and not significant for $ε > 1$ . Figure 4.20 shows the dependency on $δ$ : for a fixed $ε$ , there is a constant interval between the midpoints of both strategies. Thus, the relative gains are larger for a larger $δ$ , since the midpoint is also smaller.

πop[t ]
00024681𝜀n.0.10000,0 s100000.0t.π ℙLapρ(n) = keep = 1∕2

πopt[ ]
01024681𝜀n.100000 s0.t.π ℙLaρp(n) = keep = 1∕2

111110123δn00000000 s−−−−−000.11864t.ππ20oℙp[tρ(n) = keep] = 1∕2
Lap

111110123δn00000000−−−−−s11864.t20. ℙ[ππρop(tn) = keep] = 1∕2
Lap

Discussion

The approach presented here is both easy to implement and efficient. The random decision for a given partition takes a constant time to compute, thanks to the closed-form formula in Theorem 13. Counting the number of unique users per partition can be done in one pass over the data and is massively parallelizable. Furthermore, since there is a relatively small value $N$ such that the probability of keeping a partition with $n \geq N$ users is 1, the counting process can be interrupted as soon as a partition reaches $N$ users. This keeps memory usage low (in $O (N)$ ) without requiring approximate count-distinct algorithms like HyperLogLog, for which a more complex sensitivity analysis would be needed.

Our approach could, in principle, be extended to cases where each user can contribute to $Δ > 1$ partitions. Following the intuition of Lemma 13, we could list a set of recursive equations defining $π_{opt} (n)$ as a function of $π_{opt} (k)$ for $k < n$ . However, this recursive formulation is much more complex when $Δ$ is large, for multiple reasons.

1.: With $Δ$ partitions to consider at the same time, the set of possible outcomes has cardinality $2^{Δ}$ : $keep$ or $drop$ for each partition.
2.: To compute $π_{opt} (n)$ by recurrence, we must consider a large set of possible neighboring databases: each of the other $Δ - 1$ contributions of the user can have anywhere from $0$ to $n - 1$ users. Separately considering all those ${(Δ - 1)}^{n}$ possibilities quickly becomes intractable.
3.: For each of these possibilities, the probability can be expressed as a polynomial of degree $Δ$ in the values of $π_{opt} (k)$ , for $k < n$ . Solving these also becomes intractable as $Δ$ increases.

The above approach might be workable for $Δ = 2$ or even $Δ = 3$ , but the complexity cost is likely too high to be worth implementing. Furthermore, the recurrence-based proof of optimality of $π_{opt}$ only holds assuming that each user contributes to exactly $Δ$ partitions in the original dataset, so strategies based on selecting which partitions to contribute to, or weighing the partition of each user if there are fewer than $Δ$ , cannot bring additional benefits. This case is relatively frequent for $Δ = 1$ , but rarely happens for larger values of $Δ$ .

Thus, this work leaves two obvious open questions. Is it possible to extend to overcome the problems described above, and extend our optimal approach to larger sensitivities in a simple and efficient manner? Furthermore, is it possible to combine this primitive with existing approaches to differentially private set union [179], like weighted histograms or policy-based strategies?

In the meantime, can we simply use this primitive for the low-sensitivity use case, and adopt the approach from [179] when each user can contribute to multiple partitions? If scalability is a hard requirement, then the answer is not straightforward.

The policy-based approaches described in [179] require the values of each user one after the other, and compare them with the histogram previously built from all previous users. The linearity prevents us from implementing the algorithm in a massively parallel fashion, and the full histogram must fit in memory. Both are significant obstacles to scalability; designing alternative algorithms with better scalability properties is left as an open question.
However, part of their core insights can still be used. For most values of $ε$ and $δ$ , Gaussian noise gives better results for $Δ > 3$ than naively splitting the budget across contributions and using our partition selection primitive, as shown in Figure 4.21. Furthermore, for users who contribute fewer values than $Δ$ , giving these contributions more weight in the histogram is also possible.

πopt
02468024681n000000s.tππ.L ℙa[pρ(n) = keep] = 1∕2
Gauss — Figure 4.21: Comparison of our method with Laplace-based and Gaussian-based thresholding, with $ε = 1.1$ and $δ = 1 0^{- 5}$ . For our method and Laplace-based thresholding, we split the privacy budget in $Δ$ equal parts. For Gaussian-based thresholding, we use the formula in Theorem 16 (Section 4.3.3.0) to add noise, and split $δ$ between noise addition and thresholding in a way that minimizes the threshold.

LINKPREV LINKUP LINKNEXT