Lowering the cost of anonymization

a PhD thesis

4  From theory to practice

Without a plane, what was I supposed to do? Math the problem to death?
(Mary Robinette Kowal, The Calculating Stars)

Despite the academic success of differential privacy, practical applications have largely lagged behind. Scientific papers on DP often cite a few high-profile use cases to show the relevance of this field of research, but these examples almost always come from very large companies or organizations: Google [141], Apple [360], the US Census Bureau [4, 159], Microsoft [107, 224], Uber [210], and others. Is differential privacy a luxury that only large organizations can afford?

Of course, there might be some selection bias in play in the previous observation. Researchers might think that use cases from large organizations are the best way to demonstrate practical relevance, and smaller organizations using DP might not have the incentives nor resources to publish about their anonymization practices. Still, using state-of-the-art privacy techniques makes for good press, and the cost of writing a blog post is not high. Despite this, in 2020, looking up differential privacy in a search engine still surfaces mostly academic resources and use cases from large organizations.

There are a few exceptions, most of which are fairly recent. Some startups are using differential privacy as a core part of their offering [19, 192, 241, 370], and a few efforts have been initiated to implement differential privacy libraries and publish them as open-source software [105, 173, 308, 356]. Even in those cases, the people behind those projects frequently have an academic background: only organizations that can afford to hire or train domain experts get to use differential privacy. This lag between theory and practice feels surprising: as we showed in Section 2.1.6, the basic intuition behind DP is simple, and it seems easy to generate differentially private statistics by combining simple mechanisms and using the composition properties.

What are then the obstacles that prevent a wider adoption of differential privacy? In this chapter, we present a number of results and insights from four years of working on anonymization at a large company, in a role encompassing consulting, policy, and engineering.

First, we show that certain practical problems simply cannot be solved by applying differential privacy. In particular, this is the case for risk analysis, where the goal is not to make data anonymous, but estimate “how risky” a given dataset is, under some risk models. Tackling this class of problems can be the first step in a company’s journey towards stronger anonymization practices: before deciding to implement better techniques for data sharing and retention, data owners want to understand the risk associated with their existing practices. This setting will force us to revisit syntactic privacy definitions presented in Section 2.1, since in cases where we only have access to the output of a certain algorithm, without any information about the algorithm itself, we cannot know whether it is differentially private.

Second, we present a differentially private SQL engine, whose primary goal is to make it easy for people—especially non-experts—to generate differentially private statistics. Safety, scalability and usability are the three main requirements of this system. Taking our SQL engine as an example, we explain how each of these requirements can be met. Safety considerations require a thoughtful and realistic attacker model, going beyond the classical “untrusted analyst” assumption from the scientific literature. Scalability is obtained via a massively parallelizable design, which can be reproduced in other contexts than this SQL engine. To make the system usable, we design an interface that mirrors the ones data analysts are already familiar with. This requires the use of novel methods to remove roadblocks that classical differential privacy mechanism would otherwise require.

Third, we raise a number of natural open questions which emerged as we were building and rolling out this differentially private SQL engine, and partially answer them when feasible. Some of these questions are about supporting a wider range of differentially private aggregations, or about improving the utility of particularly crucial feature; in particular, we present a novel algorithm to improve the utility of partition selection primitive. Other questions are about operational aspects: how to choose anonymization parameters, and how to provide helpful guidance for users of differential privacy tools?

Taken together, we hope that the insights gathered in this chapter can prove helpful to current and future anonymization practitioners.

All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).