Lowering the cost of anonymization

a PhD thesis

4.2  Building a usable differentially private query engine

In the previous section, we saw that risk analysis use cases, where datasets are generated via an unknown process and we are trying to quantify their risk, cannot be easily tackled with mechanism-centric definitions like differential privacy. This is only a partial answer to the original question of this chapter: for all cases where we can control the data generation process, why is differential privacy not more widely used? Since its introduction, differentially private variants have been proposed for most common data analysis tasks: from simple aggregations, like counts  [DMNS06], sums, or medians  [NRS07DL09], to more complex analyses, like -means clustering [NRS07] or empirical risk minimization  [CMS11]. Yet, real-world deployments of such algorithms seem to be few and far between.

In this section, we argue that one of the main reasons for this state of affairs is usability. On paper, differential privacy is relatively simple, but in practice, few implementations are readily available, and few people will go through the trouble of implementing an algorithm from a scientific paper by themselves. The lack of existing implementations is partially explained by a misalignment of incentives: academics get more recognition and career progress out of scientific papers than usable open-source software. But this is only a partial explanation: it is also surprisingly difficult to convert algorithms from the scientific literature into a usable, scalable, and solid implementation.

The potential issues are numerous. Basic primitives such as noise addition are often modeled using continuous numbers with infinite precision; and translating them naively to finite computers using floating-point numbers can lead to significant vulnerabilities. Many scientific papers make assumptions that are unrealistic in many practical use cases, like the implicit premise that each user contributes a single data point to the input dataset, that the aggregations happen along partitions that are known in advance, or that the input dataset can fit entirely in memory. Production applications also require a high degree of trust in the implementation, and it is non-trivial to adapt software reliability techniques like unit testing to randomized algorithms. Further, a tool for differential privacy must be usable by non-experts, and this requires to put significant thought into interface design.

In this section, we propose a generic and scalable framework to perform differentially private aggregations on databases, and use it as an example to introduce possible approaches to tackle the problems listed above. This framework, which we express as an operator in relational algebra, provides differential privacy even when each individual can each be associated with arbitrarily many records. To validate this system, we test the utility of typical queries on industry benchmarks, and verify its correctness with a stochastic testing framework. We highlight the benefits provided by such a query engine and the pitfalls encountered in the course of its deployment. Finally, we publish one of its implementations as open-source software.

All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).