Lowering the cost of anonymization

a PhD thesis

4.2.1  Introduction

Query engines are a major analysis tool for data scientists, and one of the most common ways for analysts to write queries is with Structured Query Language (SQL). As a result, multiple query engines have been developed to enable data analysis while enforcing DP [42, 209, 233, 273], and all of them use a SQL-like syntax.

However, as we discuss in Section 4.2.2, these differentially private query engines make some implicit assumptions, notably that each individual in the underlying dataset is associated with at most one dataset record. This does not hold in many real-world datasets, so the privacy guarantee offered by these systems is weaker than advertised for those datasets. To overcome this limitation, we introduce a generic mechanism for bounding user contributions to a large class of differentially private aggregate functions. We then propose a design for a SQL engine using these contribution bounding mechanisms to enforce DP, even when a given individual can be associated with arbitrarily many records or when the query contains joins.

Our work goes beyond this design and accompanying analysis: we also describe the implementation of these mechanisms as part of a SQL engine, and the challenges encountered in the process. We describe the testing framework we use to increase our level of trust in the system’s robustness. To aid in reproducing of our work and encourage wider adoption of differential privacy, we release core components of the system, as well as a distinct implementation of this framework, as open-source software.

Requirements and contributions

To be useful for non-expert analysts, a differentially private SQL engine must satisfy at least the following requirements.

  • It must make realistic assumptions about the data, specifically allowing multiple records to be associated with an individual user.
  • It must support typical data analysis operations, such as counts, sums, means, percentiles, etc.
  • It must provide analysts with information about the accuracy of the queries returned by the engine, and uphold clear privacy guarantees.
  • It must provide a way to test the integrity of the engine and validate the engine’s privacy claims.

In this work, we present a differentially private SQL engine that satisfies these requirements. Our contributions are as follows.

  • We detail how we use the concept of row ownership to enforce the original meaning of differential privacy: the output of the analysis does not reveal anything about a single individual. In our engine, multiple rows can be associated with the same “owner” (hereafter referred to as a user, although the owner could also be a group), and the differential privacy property is enforced at the user level.
  • We implement common aggregations (counts, sums, medians, etc.), arbitrary per-record transforms, and joins on the row owner column as part of our engine. To do so, we provide a method of bounding query sensitivity and stability across transforms and joins, and a mechanism to enforce row ownership throughout the query transformation.
  • We detail some of the usability challenges that arise when trying to use such a system in production and increase its adoption. In particular, we explain how we communicate the accuracy impact of differential privacy to analysts, and we experimentally verify that the noise levels are acceptable in typical conditions. We also propose an algorithm for automatic sensitivity determination.
  • We present a testing framework to help verify that -DP aggregation functions are correctly implemented, and can be used to detect software regressions that break the privacy guarantees.

We hope that this work, and the associated open-source release, can increase the appropriate adoption of differential privacy by providing a usable system based on popular tools used by data analysts.

Multiple differentially private query engines have been proposed in the literature. In this work, we mainly compare our system to two existing differentially private query engines: PINQ [273] and Flex [209]. Our work differs in two major ways from these engines: we support the common case where a single user is associated with multiple rows, and we support arbitrary GROUP BY statements.

Another line of research focuses on building frameworks to define differentially private algorithms: examples include are Airavat [334], Ektelo [404] and OpenDP’s programming framework [156]. These are building blocks that help write correct differentially private algorithms, but require significant changes in how programs are written, and we argue that they cannot be used as is without prior expertise on differential privacy.

In these systems, a single organization is assumed to hold all the raw data. Query engines can also be used in other contexts: differential privacy can be used in concert with secure multiparty computation techniques to enable join queries between datasets held by different organizations, systems such as DJoin [293] and Shrinkwrap [42] tackle this specific use case.

A significant amount of research focuses on improving the accuracy of query results while still maintaining differential privacy. In this work, for clarity, we keep the description of our system conceptually simple, and explicitly do not make use of techniques like smooth sensitivity [303], tight privacy budget computation methods [213, 278], variants of the differential privacy definition (Section 2.2), adjustment of noise levels to a pre-specified set of queries [249], or generation of differentially private synthetic data to answer arbitrarily many queries afterwards [49, 232, 233]. We revisit certain design choices and outline possible improvements later, in Section 4.3.

The testing framework we introduce in Section 4.2.6.0 is similar to recent work in verification for differential privacy [47, 109, 167], and was developed independently. Other approaches use semantic analysis, possibly augmented with automated proving techniques, to certify that algorithms are differentially private [32, 33, 292].

Our work is not the first to use noise and thresholding to preserve privacy: this method was originally proposed in [174, 230] in the specific context of releasing search logs with -DP; our work can be seen as an extension and generalization of this insight. Diffix [155] is another system using similar primitives; however, it does not provide any formal privacy guarantee, and has been shown to be vulnerable to attacks [79, 80, 157], so a meaningful comparison with our work is not feasible. In Section 4.2.4, we provide a comparison of query accuracy between our work, PINQ, and Flex.

Preliminaries

We introduce here the definitions and notations used throughout this section, which are also summarized on page 38. Because a significant part of this work specifically focuses on the case where a single user contributes multiple records to the dataset, we no longer consider a dataset as a sequence of records in but as a sequence of rows, where each row is a pair : each row is a record associated with a specific user.

Definition 78 (Distance between datasets). We denote row-level change the addition or removal of a single row from a dataset, and user-level change the addition or removal of all rows associated with a user. Given two datasets and , we denote the minimum number of row-level changes necessary to transform into , and the minimum number of user-level changes necessary to transform  into ; we call them row-level distance and user-level distance respectively.

In the original definition of -differential privacy, each user is implicitly assumed to contribute a single record. Since we want to consider the case where this is not true, we define two variants of differential privacy to make this distinction explicit.

Definition 79 (-row-level and user-level differential privacy). A randomized mechanism satisfies row-level -DP (respectively row-level -DP) if for all pairs of datasets that satisfy (respectively ), .

As previously, -DP is an alias for -DP, and denotes -indistinguishability (see Definition 14).

Note that this notion is technically unbounded differential privacy, defined in [227] and mentioned in Section 2.2.3.0: we only allow neighboring datasets to differ in a row (or all rows associated with a user) that has been added or removed, not changed. Up to a change in parameters, it is equivalent to the classical definitions, but we found that this choice significantly simplifies the analysis and the implementation of differential privacy tooling.

Finally, let us define -sensitivity, defined on functions that take a dataset as input and return a vector in , for some integer .

Definition 80 (-Sensitivity). Let be the -norm on . The global -sensitivity of a function is the smallest number such that:

for all and such that . The user-global -sensitivity of is the smallest number such that the above holds for all and such that .

All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).