Lowering the cost of anonymization

a PhD thesis

4.3.5  Other possible research directions

The discussion throughout Section 4.3 makes it obvious that we are only at the very beginning of the path towards building usable, general-purpose, and robust differential privacy tooling and infrastructure. We introduced many open problems, most of which were about adding additional features, or improving the utility of specific primitives. But this work also suggests a number of larger open questions and possible research directions.

For example, on the utility side, it might be possible to compensate the data loss due to contribution bounding and thresholding: we are dropping records according to a simple probability distribution, so it might be feasible to automatically correct for this sampling and rescale the results in a private way to limit the bias introduced. Optimizing the algorithms used for specific sets of queries, or caching some results to allow people to re-run queries for free, are also natural options. Results on amplification by sampling might also be used in query frameworks like ours to obtain better privacy/accuracy trade-offs.

More generally, we believe that future work on DP should consider that realistic data typically includes multiple contributions from a single user. Taking this into account is not easy: in particular, one of the main obstacles is that asymptotic results are then likely much more difficult to derive. If every user can contribute multiple records, what is the distribution of the number of records per user? Depending on this distribution, different algorithms may have very different behaviors and privacy/accuracy trade-offs. Comparisons with the state-of-the-art will necessarily be less straightforward, and will involve more experimental validation. Considering this question while designing differentially private algorithms is, however, of utmost importance: many real-world datasets simply do not satisfy the assumption that the input dataset only has a single record per user.

This remark about experimental validation underscores the importance of a crucial and largely unexplored area of differential privacy research: benchmarking. Nowadays, scientific papers often provide comparisons with prior work on ad hoc datasets or on synthetic data. The evaluation we presented in Section 4.2.4 is no exception. TPC-H was designed as a general-purpose performance benchmark for database systems [87]; its queries are not representative of the type of queries that are typical in situations where applying differential privacy makes sense. There are a few benchmarks today that focus on differential privacy, but they only consider specific problems like one- and two-dimensional histograms [189, 190], and they do not consider multiple contributions by the same user. We argue that a good benchmark for general-purpose differentially private query engines should have several important characteristics.

  • The datasets used must be publicly available.
  • The data and the queries run must be representative of real-world use cases for differentially private query engines, including aggregations besides histograms.
  • The datasets and queries used must be large enough to mirror practical scalability requirements.
  • The system under test must provide user-level differential privacy, even when a single user can contribute multiple records.

Building a benchmarking system that satisfies these requirements will not be easy: one of the challenges is that differential privacy is typically applied on sensitive data, which is typically not public. Besides the requirements listed above, defining a set of good scoring metrics is also going to be highly non-trivial. However, we believe that such a benchmark could encourage valuable research, making it a promising and worthwhile research direction.

We made a number of statements about improving usability, but we have been defining this concept very loosely, and our remarks where purely based on practical experience with helping engineers and data analysts use differential privacy. This anecdotal evidence, however, hardly constitutes scientific evidence that one method is better than another from a usability perspective. To get a better understanding of what helps people use differential privacy in practice, systematic qualitative and quantitative user research is certainly needed.

Finally, we hope that the discussion on operational challenges can serve as a starting point for honest and genuine discussions around best operational practices within the community of anonymization practitioners. However, reaching consensus on these questions will take more than technical expertise: questions like “how to explain anonymization best practices to non-experts” cannot be answered by a clever mathematical formula, and a multidisciplinary approach is needed. Academic work can strengthen our understanding of the link between parameter choice and attack success, or take an experimental approach to find out which policies produce the best outcomes, but an open and transparent discussion across the industry is also needed to define anonymization good practices. Perhaps standard bodies, or open-source collaborations, could be good places to host such discussions?

All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).