Lowering the cost of anonymization

a PhD thesis

4.3.3  Improving coverage and utility

In Section 4.2, we presented a framework for computing differentially private statistics, and we explained how it could be implemented as an extension of a general-purpose SQL engine. SQL has major usability advantages: the ability to write SQL queries is more common than the ability to write code in a traditional programming language. The declarative aspect of SQL also gives us a large freedom to rewrite or optimize the way queries are run, and semantic analysis is much easier to perform on SQL queries rather than in a Turing-complete programming language.

However, more powerful programming languages and computation models also come with a number of distinct advantages. In this section, we quickly present an alternative implementation of our framework on Apache Beam  [apa], then we illustrate the possible benefits that can be gained by such an alternative.

Privacy on Beam

SQL is the tool of choice of many data analysts, but many engineers who build pipelines for large-scale systems opt for more powerful computation frameworks, like Apache Beam  [apa], an evolution of prior technologies like MapReduce  [DG10]. Apache Beam is a programming model in which a client can describe a data processing pipeline, and run it in a massively parallel fashion without having to implement parallelization themselves, or worry about technical details like redundancy.

Apache Beam has client libraries written in Java, Go and Python. They allow clients to specify complex data transformations that would be difficult to express in SQL, either because of syntactic inelegance (e.g. chaining a large number of search & replace operations on a string) or lack of language support (e.g. computations that require loops, or calling an external service). Such examples are very common for complex data pipelines, which explains the popularity of frameworks like Apache Beam in organizations running large-scale computational tasks. Furthermore, even though frameworks like Apache Beam have a declarative aspect (high-level API functions do not provide interface guarantees on their execution plan), it is often easier to understand which actual computation tasks are executed when running a pipeline, allowing engineers to finely optimize their performance if needed.

Migration costs are one of the main obstacles to the wide-scale adoption of privacy technologies: nobody likes to hear that they will need to change their practices and infrastructure entirely to add a layer of privacy protection, and asking people to do so is usually a losing battle. This is especially the case if their use case requires using a more powerful framework than the one we are trying to push them towards. Thus, it is crucial to implement differentially privacy features in the tools that people are already using. This is a core reason why we invested in adapting the model we presented in Section 4.2 to frameworks like Apache Beam. We have done so and published an open-source implementation in Go, available at  [prib]. In this section, we discuss the advantages of using a more powerful framework, and the challenges that came with this project.

Overview

Let us give an example, freely inspired from the public Codelab for Privacy on Beam  [pria]. The following Apache Beam pipeline, using the Go SDK, assumes that the input data is a collection of structs of type Visit, which encapsulate information about the visit of individuals to a given shop.

 
type Visit struct { 
  VisitorID  string 
  Day        time.Date 
  EurosSpent int 
} 

Suppose we want to compute the total euros spent by weekday. A traditional Apache Beam pipeline would look like the following, assuming that the initial collection is stored in a variable named input.

 
func extractDayAndSpend(v Visit) (time.Date, int) { 
  return v.Day, v.EurosSpent 
} 
 
// Initialize the pipeline and its scope 
p := beam.NewPipeline() 
s := p.Root() 
input := readInput(s, "path/to/file") 
 
// Extract day and spend from each Visit 
extracted := beam.ParDo(s, extractDayAndSpend, input) 
// Sum the spend of each user per day 
totalSpent := stats.SumPerKey(s, extracted) 
 
// Determine where to results and execute the pipeline 
writeOutput(s, output) 
runner.Execute(context.Background(), p) 

In the code above, input is a PCollection<Visit>: it is conceptually similar to an array, except elements are not directly accessible, as a PCollection can represent a very large collection stored across multiple machines. Then, extracted and totalSpent are PCollections with (key, value) type <time.Date,int>.

Now, to make this pipeline differentially private, we need to first encapsulate the input into a PrivatePCollection, which will implicitly track user identifiers.

 
// Encapsulate the input 
epsilon, delta := 1.0, 1e-10 
privacySpec := pbeam.NewPrivacySpect(epsilon, delta) 
pcol := pbeam.MakePrivateFromStruct( 
  s, input, privacySpec, "VisitorID" 
) 
 
// Extract day and spend from each Visit 
extracted := pbeam.ParDo(s, extractDayAndSpend, pcol) 
// Sum the spend of each user per day 
sumParams := pbeam.SumParams{ 
  MaxPartitionsContributed: 5, 
  MinValue: 0, 
  MaxValue: 100, 
} 
totalSpent := pbeam.SumPerKey(s, extracted, sumParams) 

The MaxPartitionsContributed, MinValue, and MaxValue parameters are conceptually the same as the , L and U introduced in Section 4.2: how many different partitions a single user can contribute to, and how many euros can they contribute at most in a single partition.

Interface design

A SQL query is conceptually easy to understand: there is a clear input and output, and it is relatively easy to check that it satisfies all the constraints of our framework (see Section 4.2.3). By contrast, an Apache Beam pipeline can have multiple outputs, and contains transformations with arbitrarily complex logic implemented in a Turing-complete language. Thus, it is much more difficult to automatically analyze its semantics, or to use rewriting operations like in SQL. This makes the adaptation of our framework difficult.

We solve this problem by introducing PrivatePCollection. The high-level intuition is that a PrivatePCollectionencapsulates a PCollection, and provides a simple yet powerful interface guarantee: all PCollections that are output from this PrivatePCollection are differentially private, with the parameters and user identifier given during initialization. Not only does this cleanly delineates between the private and the non-private part, but it also prevents library users from mistakenly copying data that has not been made differentially private yet.

This interface is also easy to understand for programmers already familiar with Apache Beam: PrivatePCollection can be used just like PCollection, with additional constraints to ensure privacy guarantees. Ideally, an operation is allowed by the interface if and only if it is compatible with our privacy model.

Behind the scenes, a PrivatePCollection<Visit> is actually a PCollection of type <userID,Visit>: we keep track of the user identifier associated with each record, and this user identifier is then used for all privacy-sensitive operations. This plays the same role as the subquery constraints from Section 4.2.3, which also guarantee that each record belongs to a single user. Importantly, per-user transformations are still allowed: using pbeam.ParDo, the equivalent of beam.ParDo, a client can use arbitrary logic to transform a record, including filtering records out and creating multiple records from a single one: the outputs of a pbeam.ParDo transformation on a record owned by a given user ID will also be owned by the same user ID.

Improved expressibility

The power of a programming language like Go is convenient to clients of the library, as it enables them to implement complex transformations. It also presents an opportunity for us to surface more options to the client, so they can fine-tune the behavior of their differential privacy pipeline and optimize the utility of the data they generate.

We briefly touched on an example in Section 4.3.1: in the SQL implementation of our framework, we compute averages as the average of per-user averages, for usability reasons: many aggregations have the same two options (lower and upper bound), so having a third option for averages would be confusing and error-prone, as SQL options are specified as unnamed arguments to the corresponding function. When using Apache Beam, this problem is somewhat mitigated: we can use structs with named fields as options for each aggregation, which makes the code easier to read.

The possibilities offered by a powerful framework like Apache Beam do not stop there. In the SQL version of our framework, a single value of was used within all aggregations of a query, and the privacy budget was shared equally across aggregations. Again, usability is a main reason for these technical choices: surfacing too many options is awkward and confusing. But these choices are not optimal in general, and Apache Beam offers more customization options so pipeline owners can tweak these values to optimize the utility of their data more finely.

Finally, in our SQL system, the partition selection primitive was silently added next to the other aggregations present in a query. By contrast, in Privacy on Beam, generating a list of partitions can be a standalone primitive, and differentially private aggregations can use a fixed list of partitions. This is not only useful to manually set the parameters used within this primitive (exact mechanism, privacy budget, number of partitions each user can contribute to), but also allows an analyst to manually specify a list of partitions that is not data-dependent, and skip thresholding altogether.

Gaussian noise

The Laplace mechanism is, by far, the most common way of building differentially private algorithms. Its simplicity makes it particularly easy to reason about and use in a framework like ours: it provides perfect -DP (with ) for a single metric, it scales linearly with the sensitivity of each aggregation, and with the maximum number of partitions that a single user can contribute to. This last fact is a direct application from the basic composition theorem (Proposition 5, Section 2.1.6), which is known not to be tight  [DRV10KOV17]. In particular, in  [KOV17], authors prove the following composition theorem, which is tight in general.

Theorem 15 (Theorem 3.3 in  [KOV17]). For any and , the sequential composition of independent -DP mechanisms satisfies -DP, with:

for all integers , where:

Thus, a natural option for improving utility is to use this result to split the privacy budget between different aggregations, and between the different contributions of a single user across partitions in each aggregation. However, this first idea is less attractive than it initially appears. First, it only brings significant utility gains for large values of (larger than 20), which limits its usefulness in practice, as the maximum number of contributions a single user contributes to is often lower. Second, it is very tricky to implement: one needs to reverse the formula above to split a given privacy budget, and we very quickly encounter floating-point issues when doing so.

Note that this tight composition theorem is generic: it holds for any -DP mechanism. But in our framework, we really only care about the noise mechanism used in practice. If we can find a better composition result for Laplace noise, or even change the type of noise we are using, we might get stronger results and pay a smaller complexity cost.

The Gaussian mechanism turns out to be a strong candidate for such an optimization: instead of using noise drawn from a Laplace distribution, the Gaussian mechanism adds Gaussian noise to the result of the original query. We first introduce this mechanism, then discuss its implementation, and finish by pointing out a few natural open questions.

Definition

The Gaussian mechanism was first proposed in  [DR14], and is a fundamental building block to differentially private machine learning  [BST14ACG16]. Let us first introduce the concepts necessary to define this mechanism formally.

Definition 85 (Gaussian distribution). The Gaussian distribution of mean and of variance is the probability distribution with probability density function:

Using the Gaussian distribution provides benefits besides the utility gains we will discuss shortly. This distribution plays a central in many scientific theories and practices, so most data analysts and engineers are already familiar with it and its basic properties. It also has a much more “predictable” behavior, with tails decaying extremely fast: the probability that a random variable sampled from a Gaussian distribution of mean 0 and variance ends up outside is less than one in a million.

Crucially, the Gaussian mechanism depends on a different definition of sensitivity than the Laplace mechanism: the sensitivity.

Definition 86 ( sensitivity). Let be a deterministic function. The sensitivity of is the smallest number such that for all datasets and differing only in one record:

where is the norm.

We can now introduce a tight analytical result from  [BW18], that quantifies the variance of the Gaussian mechanism necessary and sufficient to obtain differential privacy based on the -sensitivity of a function.

Theorem 16 (Theorem 8 in  [BW18]). Let be a deterministic function with finite sensitivity . The mechanism defined by , where is a random vector of where each coordinate is independently sampled from a Gaussian distribution of mean and with variance , is -differentially private iff:

where is the cumulative distribution function of a Gaussian distribution of mean 0 and variance 1.

This formula looks complex, but it is straightforward to implement a binary-search-based algorithm to find out the lowest possible that gives -DP given the sensitivity .

Importantly, this formula is invariant if and are scaled by the same number: the standard deviation of the noise required for -DP is a linear function of . This insight is the main reason why Gaussian noise is particularly well-suited to contexts where each user can contribute to many partitions. Indeed, recall that the norm of a vector is defined by:

Each partition corresponds to a particular dimension of : to compute , if a user can contribute to at most partitions, all but of these coordinates are going to be . If the maximum change to each partition is , then : the magnitude of the noise scales with the square root of the number of partitions contributed to. With typical values of , this makes Gaussian noise a better choice for values of on the order of or larger.

Implementation

Implementing Gaussian noise properly presents a few challenges. First, naively implementing this continuous probability distribution on finite computers will also introduce vulnerabilities. This fact is well-known for Laplace noise  [Mir12], but a quick analysis of common random number generation libraries used in various languages shows that the problem would arise for Gaussian noise as well if implemented naively: to generate 64-bit floating-point numbers, Java uses the Marsaglia polar method  [MB64] with only 53 bits of randomness  [javbjava], Python uses the same method  [pytapytb], Go uses the Ziggurat method, which only uses 32 bits of randomness 99% of the time  [gon], etc. Thus, the noisy values will necessarily be sparse within the space of all possible 64-bit floating-point values, which will create the same vulnerability than the one described in  [Mir12]. It is therefore crucial to focus some attention to this problem, and come up with robust implementations that are proven to uphold -DP guarantees. A handful of solutions have recently been proposed  [CKS20TeaProc], so we assume that this problem can be solved.

Another challenge is that we do not want to have Gaussian noise as the only option in our framework: some applications require pure -DP, and for low values of , using Laplace noise is much better for utility, even for relatively large values of . So we need to design a system in which we can easily switch one type of noise for another. However, Laplace noise and Gaussian noise use different sensitivities, which makes this non-trivial. With Laplace noise, we could simply use the standard composition theorem to split the privacy budget across different partitions, and consider each partition independently. With Gaussian noise however, we have to have a high-level overview of what each aggregation is doing in each partition, we cannot simply rely on splitting the value of in equal parts.

A natural way to solve this design question is to pass the number of partitions that each user contributes to as a parameter to each aggregator, and pass the value of associated to the entire aggregation. This way, each aggregator can compute the appropriate value of the noise. Note that yet another subtlety appears here: Theorem 16 assumes that there is a uniform bound on the per-partition contribution of individual users, but this is not always the case in practice, for example if we use approximate bounds determination (Section 4.2.5) per-partition. Luckily, it is relatively straightforward to extend this result in the case where a user can contribute more in some partitions than others.

Proposition 31 (Generalized analytical Gaussian mechanism). Let be a deterministic function such that for all databases and differing in a single record, and for all :

where denotes the -th coordinate of . The mechanism defined by , where is a random vector of where the -th coordinate is independently sampled from a Gaussian distribution of mean and with variance , is -differentially private if for all :

Proof. Note that this mechanism is equivalent to computing , dividing each coordinate by , adding i.i.d. Gaussian noise according to Theorem 16 with , then multiplying each coordinate by .

Once reformulated this way, the result is almost immediate: the first rescaling gives a mechanism of sensitivity , the noise addition step guarantees -differential privacy according to Theorem 16, and the result follows directly by post-processing. □

Finally, note that the core insight behind using Gaussian noise to improve the utility of differentially private queries is to consider the release of multiple metrics together, as if they were a single query outputting a single output in , instead of considering them separately. Doing this for a single histogram is natural. An additional possible extension is to make the same reasoning across entirely different queries. Consider, for example, the following queries.

 
SELECT 
  browser_agent, 
  COUNT(DISTINCT uid) 
FROM access_logs 
GROUP BY browser_agent  
Listing 4.13: Query counting users, grouped by browser agent
 
SELECT 
  hour_of_day, 
  COUNT(DISTINCT uid) 
FROM access_logs 
GROUP BY hour_of_day  
Listing 4.14: Query counting users, grouped by hour of day

Both queries have a per-partition sensitivity of , and we might choose different cross-partition contribution bounds for the two queries (say, and ). To make the result of both queries -differentially private using Gaussian noise, it is natural to split the budget between the two. But for the same reason as before, it is more optimal to consider them as a single query, where each user can contribute to partitions: this allows us to scale the noise by instead of . This leads to a natural trade-off between utility and design best practices: such optimizations require a simultaneous view of all queries and the way each of them is implemented. This goes against typical good practices in design, which require compartmentalization of a system into simple parts.

All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).