Lowering the cost of anonymization

a PhD thesis

4.2.2  A simple example: histograms

Before describing the technical details of our system, let us first give an intuition of how it works using a simple example: histogram queries. Consider a simple dataset that logs accesses to a given website. An analyst wants to know which browser agents are most commonly used among users visiting the page. A typical query to do so is presented in Listing 4.1.

 
SELECT 
  browser_agent, 
  COUNT(*) AS visits 
FROM access_logs 
GROUP BY browser_agent  
Listing 4.1: Simple histogram query

How would one make this simple operation -differentially private? One naive approach is to add Laplace noise of scale to each count, as described in Section 2.1.6. Unfortunately, this solution suffers from several shortcomings.

First pitfall: multiple contributions within a partition

It might look like the naive approach correctly hides the existence of individual records from the dataset: each record of the access log will only influence one of the returned counts by at most , and it is well known  [DMNS06] that this basic mechanism provides -DP. However, it fails to protect the existence of individual users: the same user could have visited the example page many times with a particular browser agent, and therefore could have contributed an arbitrarily large number of rows to visits for a particular GROUP BY partition, violating our assumption that the query sensitivity is 1.

In PINQ and Flex, the differential privacy definition explicitly considers records as the privacy unit. Because we instead want to protect the full contribution of users, we need to explicitly include a notion of a user in our system design. In this work, we do this via the notion of a user identifier, or user ID.

Because Listing 4.1 has unbounded sensitivity, adding noise to counts is not enough to enforce differential privacy; we need to bound user contribution to each partition. This can be addressed is by counting distinct users, which has a user-global sensitivity of 1, instead of counting rows. Although this modifies query semantics, we chose this approach to keep the example simple. We present the modified query in Listing 4.2, which assumes that the column containing user ID information is named uid.

 
SELECT 
  browser_agent, 
  COUNT(DISTINCT uid) + Laplace(1/
FROM access_logs 
GROUP BY browser_agent  
Listing 4.2: Per-partition contribution bounding

In other contexts, it might make more sense to allow a user to contribute more than once to a partition. For example, we might want to count up to five visits from each distinct user with each distinct browser agent. In this case, we would need to further modify the query to allow multiple contributions and increase sensitivity to match the maximum number of contributions.

Second pitfall: leaking partitions

Even if we bound the contribution of a user to each partition and adapt noise levels accordingly, the query is still not -DP. Indeed, suppose that the attacker is trying to distinguish between two datasets differing in only one record, but this record is a unique browser agent : this browser agent does not appear in , but appears once in . Then, irrespective of the value of the noisy counts, the GROUP BY partitions themselves are enough to distinguish between the two datasets simply by looking at the output: will appear in the query output for but not for .

A simple solution to this problem was proposed in  [KKMN09]: the idea is to drop from the results all partitions associated with a noisy count lower than a certain threshold . is chosen independently of the data, and the resulting process is -DP with . We call this mechanism -thresholding. With a sufficiently high , the output rows with partitions present in but not (and vice-versa) will be dropped with high probability, making the partitions indistinguishable to an attacker. A longer discussion on the relation between , and can be found in Section 4.2.3. This approach is represented in SQL in Listing 4.3.

 
SELECT 
  browser_agent, 
  COUNT(DISTINCT uid) + Laplace(1/AS c 
FROM access_logs 
GROUP BY browser_agent 
HAVING c >=   
Listing 4.3: GROUP BY filtering

PINQ and Flex handle this issue by requiring the analyst to enumerate all partitions to use in a GROUP BY operation, and return noisy counts for only and all such partitions1. This enforces -DP but impairs usability: the range of possible values is often large (potentially the set of all strings) and difficult to enumerate, especially if the analyst cannot look at the raw data. Some data synthesis algorithms have been proposed to release -DP histograms over an unbounded set of partitions  [McSa], but those seem to be limited to datasets subject to hierarchical decomposition. Our approach is simpler and more generic, at some cost in the privacy guarantee.

Third pitfall: contributions to multiple partitions

Finally, we must consider the possibility of a user contributing to multiple partitions in our query. Imagine a user visiting the example page with many different browsers, each with a different browser agent. Such a user could potentially contribute a value of 1 to each partition’s count, changing the sensitivity of the query to be the number of partitions, which is unbounded!

Because both PINQ and Flex consider records as the privacy unit, this is not an issue for their privacy models. So long as they are only used on datasets where that requirement holds true, and where the sensitivity and stability impact of joins (and related operations) are carefully considered, they will provide adequate DP guarantees. However as shown in  [McSb], these conditions are not always true.

Instead of adding strict requirements on the nature of the underlying dataset and on how joins are used, we introduce a novel mechanism for bounding user contribution across partitions. Concretely, we first choose a number , and for each user, we randomly keep the contributions to partitions for this user, dropping contributions to other partitions. This operation allows us to bound the global sensitivity of the aggregation: each user can then influence at most unique counts, and we can adapt the noise level added to each count, by using Laplace noise of scale .

The final version of our query is shown in Listing 4.4. It uses a non-standard variant of the SQL TABLESAMPLE operator, which supports partitioning and reservoir sampling, to represent the mechanism we introduced. This final version satisfies -differential privacy for well-chosen parameters.

 
SELECT 
  browser_agent, 
  COUNT(DISTINCT uid) + Laplace(/AS c 
FROM ( 
  SELECT browser_agent, uid 
  FROM access_logs 
  GROUP BY browser_agent, uid 

TABLESAMPLE RESERVOIR ( ROWS PARTITION BY uid) 
GROUP BY browser_agent 
HAVING c >=   
Listing 4.4: An -DP query

In the remainder of this section, we formalize this approach, and adapt it to a larger set of operations. In particular, we extend it to arbitrary aggregations with bounded sensitivity, and we explain how to make this model compatible with joins.

1The open-source implementation of Flex  [JN], however, does not appear to implement this requirement.

All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).