Lowering the cost of anonymization

a PhD thesis

4.3  Further improvements and open questions

In the previous section, we presented a generic system to answer SQL queries with user-level differential privacy. This system is remarkably simple: it simply analyzes what are the possible ways in which a typical SQL query can break differential privacy, and proposes simple building blocks to overcome these obstacles. Importantly, it solves most use cases we see in practice: counts and sums are enough to cover the vast majority of aggregation queries that analysts and engineers typically need to run on sensitive data. For these typical use cases, our simple system performs well, scales to arbitrary data size, and can be implemented and tested in a robust, production-ready fashion. We hope that this simplicity, and the open-source publication of the implementation, will encourage further research and engineering contributions in this direction, and increase the adoption of differential privacy.

In this section, we list possible improvements and extensions to our framework. We raise many questions and provide only a few answers, most of which are partial: the road to build a general-purpose, usable, scalable, and robust differentially private query engine is long and many aspects of this problem are still largely unexplored.

First, in Section 4.3.1, we discuss how to fit other kinds of aggregations within our framework. For some of them, this task is not trivial, and requires to make the computation model more generic, which presents interesting design challenges. Further, some of these primitives are simply not well-studied in the existing literature, especially under strict scalability requirements, and under the assumption that each user can contribute multiple data points.

Second, in Section 4.3.2, we come back to the problem of partition selection, introduced in Section 4.2.2.0 and solved using Laplace noise and thresholding in Section 4.2.3.0. We introduce an optimal partition selection primitive in the common special case where each user contributes to a single partition only, and we discuss its extension to other cases.

Third, in Section 4.3.3, we discuss other possible utility and usability improvements: we present Privacy on Beam, an open-source implementation of the framework described in Section 4.2 making slightly different design trade-offs, and we explain how Gaussian noise can be used to boost utility when a single user can contribute to many partitions.

Fourth, in Section 4.3.4, we focus on operational questions that arise when rolling out differential privacy tooling in practice, and discuss policy topics such as parameter choice or privacy budget tracking.

Finally, in Section 4.3.5, we take a step back and discuss other possible directions for future research on differentially private query engines.

All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).