Lowering the cost of anonymization

a PhD thesis

4.3.1  Beyond counts and sums

Differentially private counts and sums are enough to solve a large class of use cases requiring anonymized data aggregation. However, analysts sometimes need more complex aggregations, and some of them raise interesting questions. These questions fall into two categories: how to fit these aggregations in the two-step aggregation model of our framework (per-user aggregation first, cross-user aggregation second) in a usable fashion, and how to implement the related primitives in a scalable fashion.

Average

Averages are a first simple example. At first glance, they seem to be nothing more than a sum divided by a count. However, to compute them this way, the user would need to specify three parameters: a lower and upper bound for the sum, and an upper bound of the amount of contributions. Thus, they would need to know not only the range of typical values, but also the number of values that each user typically contributes. This is a lot to ask of an analyst who simply wants to know the average of a column. We could use automatic bounds determination for both operations, but the privacy budget cost would then be particularly high.

To compute averages, an alternative option is to first compute a non-private per-user average, and then to average the results across all users in a private way. This changes the semantics of the operation: we are now computing an average of averages, instead of an average of all values. If there is a correlation between the distribution of values and the number of contributions (for example, each user contributes either few low values, or many large values), this might introduce bias. This change of semantics might seem shocking at first glance, but note that this is not limited to averages: clamping, or contribution bounding, also change the semantics of a query.

In practice, averaging averages does not create significant quality issues. Thus, in scenarios where usability is most important, like the SQL implementation of our framework, we use this method. For implementations of the framework that tend to be used by engineers with more knowledge about their data and with the ability to run exploratory analyses, we opt for the mechanism that performs a global sum and divides it by a global count.

Note that in that case, the output of the per-user aggregation is different than the overall result. Each per-user aggregation must return a partial sum and a partial count, while the cross-user aggregation will only return an average value. Thus, we have to modify our framework to remove the simplifying assumption that the per-user aggregation is simply the non-private version of the cross-user aggregation.

Variance

The variance operation has the same basic problem as the average operation, although the right decision is even more clear-cut: computing a per-user variance, and taking the variance of the result, is obviously not a valid way to compute the total variance between all values. In fact, there is no obvious way to compute a per-user partial aggregation that would allow to compute the variance of values. The only option is to preserve the original values, and for the per-user “aggregation” step to be only a sampling step, that selects up to a fixed number of values associated to each user. This underscores the importance of extending our framework to make sure that the per-user aggregation step can be arbitrarily different from the cross-user DP aggregation.

Quantiles

Quantiles present interesting challenges. First, they raise the same question as averages: is it appropriate to use the non-private version of the aggregation as a per-user aggregation step? For some values of quantiles, the answer is clearly positive: for example, to compute the minimum value of a dataset, we can take the per-user minimum, and run the differentially private minimum on this set. This limits the contribution per user to 1, which is good for utility, and does not change the semantics of the aggregation.

However, the situation is different for non-extremum quantiles. For medians, taking the per-user median value and computing the median of those can introduce the same kind of bias as for averages. Compute near-extremum quantile values (like the 1st and 99th percentile) per-user is also highly non trivial if each user only contributes a handful of values. Thus, like for averages, it might be a good idea to simply sample a fixed number of records as a per-user “aggregation” step; although deciding when to make this choice is not obvious.

Second, privately computing quantiles in a scalable way does not seem to be a solved problem. Most common options implicitly assume that all values can fit in memory, which is unsuited to large-scale datasets and massively parallel computation frameworks. This is the case for Bayesian binary search  [BOH08KK07], for smooth sensitivity  [NRS07], and for the exponential mechanism  [DL09]. Existing methods have been proposed in the literature for the non-private case, like KLL  [KLL16], but those do not provide privacy guarantees, and naively adding noise to their result does not provide good utility, since their global sensitivity is unbounded. A naive option is to randomly sample values when there are too many values to fit them all in memory, but this option likely provides very suboptimal accuracy.

Third, all the methods that we could find in the literature on differentially private quantiles also implicitly assume that each user contributes at most one value. They can be adapted to the case where each user contributes multiple values by bounding the total number of values contributed by each user, and using composition theorems, but there is no reason why this naive strategy would provide optimal or even good utility.

Finally, many practical use cases for quantiles actually require to compute multiple quantiles for the same series of values. For example, it is common to compute latency at the 50th, 95th and 99th percentile  [api]. When naively using the private methods mentioned above, each such percentile must be computed on a separate privacy budget, which is likely far from optimal. We leave it to future work to improve the state-of-the-art of differentially private quantiles.

Counting distinct values

Counting distinct values corresponds to the following operation.

 
SELECT 
  website, 
  COUNT(DISTINCT browser_agent) 
FROM access_logs 
GROUP BY website  
Listing 4.11: Count distinct operation

This is easy to do in a scalable way if we want to count unique user identifiers, but if we want to count distinct values (here, browser agents), it presents an interesting scalability challenge. If the total number of distinct values is small enough to keep all of them in memory, the problem is trivial: simply add Laplace noise to the exact count, scaled by the maximum number of contributions for any single user. However, if linear memory usage is not an option because of scalability requirements, we need a smarter approach.

Many non-private sketching algorithms have been proposed for this problem, and a number of them are listed in in Section 3.3.1. How to make their output differentially private? Note that this is a fundamentally different problem than the one tackled in Section 3.3: we do not want to protect the sketches themselves, but only their output. A noisy variant of Bloom filters was proposed for a related problem in  [ACG18], and in  [CDSKY20], authors show that LogLog  [DF03] achieves differential privacy “for free” if we assume the hashes used in the algorithm are computed after adding a random salt to the input values. This method can serve as a first approach to implement this important primitive.

In the generic case where each user contributes multiple values to the initial dataset, another natural problem emerges: how to choose the values that each user contributes in an optimal way? For example, assume each user contributes two values: one fixed value which is the same across all users, and a value that is different for each user. If the contribution bound is fixed to , choosing a value at random between and for each user is very suboptimal: we should, instead, try to pick rare values to increment the total count. This raises a problem similar to the one tackled in  [GGK20]: the choice for each user should depend on the choice for other users, but this obviously makes it trickier to calculate sensitivity. We are not aware of any research on this specific point; we leave this as an open research question.

All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).