Lowering the cost of anonymization

a PhD thesis

4.3.4  Operational aspects of anonymization

Technical contributions can only get us so far. Building the right tooling is only half the battle: rolling out this tooling in practical scenarios presents an array of operational challenges. In this section, we discuss two such challenges, which we encountered during our efforts to scale the number of uses of differential privacy in a large tech company.

The first challenge is about anonymization parameters: how to pick values of , , or even the unit of privacy in a given setting? The second is about policies: how to best help engineers understand the guarantees provided by our tooling, and correctly apply differential privacy principles to their pipelines? In both cases, we only provide partial answers to these difficult questions, and these answers are largely unscientific. We still hope that they can be a valuable contribution to a wider conversation between anonymization researchers and practitioners about these challenges.

Setting anonymization parameters

As we mentioned in Section 2.1, choosing reasonable anonymization parameters for a given use case has always been a tricky problem. Differential privacy gives a clearer semantic view to its parameters than past definitions, allowing to express the meaning of using Bayesian inference (Section 2.1.6) or hypothesis testing (Section 2.2.6). These guarantees are stronger than past attempts at quantifying the risk of e.g. certain choices of for -anonymity  [EED08], but do not resolve this issue entirely: choosing parameters remains difficult.

The first question around parameters has to be: what are protecting exactly? Do we assume that each user contributes exactly one record to the dataset, like is generally done in the literature? Do we protect all contributions from each user, like what we argued in Section 4.2? But the question is deeper: what unit of privacy should we protect? As the list of variants in Section 2.2.3 shows, there is nothing universal about this choice.

Protecting individuals might be too weak: for example, if statistics are collected about families or households, protecting entire households might make more sense. Using device identifiers, or account identifiers, can be one way of implementing this goal, for example when everyone in a given household interacts with the same voice assistant device. But protecting individuals can also be too strong to give reasonable utility: Facebook’s URL dataset  [MDH20], for example, protects single actions taken on the website (e.g. a single user viewing a single post or sharing a specific URL).

The authors justify this choice by explaining that this allows them to “add significantly less noise compared with [user-level] differential privacy”. This policy view is not entirely incompatible with user-level DP: from action-level DP, the authors derive user-level guarantees that depend on the number of actions each user has taken. They guarantee a given user-level of for 99% of users, and derive the parameters for action-level DP accordingly. The 1% of users who contribute more have a lower degree of protection. This two-step approach is certainly worth considering, although it is not without risks: in  [MDH20], adding inactive users to the dataset would implicitly make the privacy guarantees worse for everyone, as more users would end up in the 1% of particularly active users.

In other releases, the privacy unit has a time component: Google’s Community Mobility Reports  [ABC20] and Symptom Search Trends  [BDD20] protect users’ contributions in a single day, while LinkedIn’s Audience Engagements API  [RSP20] protects the contributions of each user during each month. When releasing new data regularly for an unbounded period of time, having a temporal dimension in the unit of privacy is a requirement; this is sometimes called “renewing the privacy budget”  [TKB17]. Doing so means that no hard limits can be given on the total privacy loss for each user throughout the unbounded period.

How risky is such a choice, then? The few practical examples of reconstruction attacks  [XTL17Abo18a] (where the attacker tries to find out full original data records) or membership inference attacks  [SSSS17NSH18HMDDC19PTC18MHKH19] (where the attacker tries to guess whether a data record was part of the original dataset) do not seem to be immediately applicable to such long-lived DP publications. Since correlations in the data are often exploited in attacks  [XTL17PTC18], maybe the tooling introduced in Chapter 3 can be leveraged to quantify the intuition that some data releases present a sufficiently low level of risk, even without formal guarantees? Bridging the gap between strong formal guarantees, like user-level DP with a small , and the feasibility of practical attacks, remains an open question.

Choosing is also a challenging question. We found that the Bayesian interpretation of differential privacy (Section 2.1.6) can be very useful to help people understand the impact of on the level of protection. Concrete examples like randomized response (Section 2.1.6) also help. Perhaps provocatively, we argue that when trying to roll out differential privacy at scale, looking for a perfect way to determine is not a good use of resources; rather, choosing a default value with a reasonable order of magnitude is a better strategy. Having a fixed default value as the starting point for discussions saves a lot of time and effort: use cases whose utility constraints are compatible with the default value can simply use it, limiting the resources spent on these lower-risk projects. Other uses cases, for which the default value does not seem to work, typically fall into three categories.

First, the use case might be fundamentally incompatible with any form of anonymization, e.g. if the data is high-dimensional and cannot be aggregated further. Other options must then be explored: a slightly larger will not help.

Second, the anonymization strategy might need to be optimized. For example, there is no need to aggregate metrics at a fine-grained temporal granularity (say, every minute), even though larger granularities (say, hourly or daily metrics) would be enough to solve the original problem. Then, adapting privacy parameters is not the right answer: modifying the aggregation process itself is the correct approach. This can also happen when many aggregations seem to be necessary, but all of them be derived from a smaller subset of aggregations that can anonymized with the default value of .

Third, there are projects for which small variations in values make a difference in feasibility. In this case, finer-grained analyses are needed, and one can add certain conditions under which the default value can be relaxed, depending on e.g. data sensitivity, exposure, or retention times. Importantly, a reasonable default value makes these discussions rare: time is better spent working towards improving tooling, usability, and utility of the vast majority of use cases, rather than discussing parameters for specific projects.

Finally, many practical use cases use -DP with a non-zero . What are reasonable values for this parameter? A common suggestion is to use , where is the number of users in the dataset: a mechanism that releases a uniformly random record is -DP for and , since two datasets that differ in a single record would output this record only with probability , and some other record with the exact same probability of each. To avoid classifying this obviously bad mechanism as private, should be significantly smaller than .

This reasoning, however, only applies to this specific mechanism, which would presumably not be used in practice. Practical -DP mechanisms do not release a record at random. For example, the from Laplace-based partition selection (Section 4.3.2) only applies to partitions with a single element in them, and the from Gaussian noise (Section 4.3.3) only means that the privacy loss can be higher than , but not significantly with a large probability. Further, worse mechanisms (like releasing the entire dataset with probability ) can also be -DP, but not seem reasonable for any non-zero value of   [McS17b].

Thus, we argue that should be chosen differently depending on the exact mechanism. Vanishingly low values (say, ) are necessary for some rare use cases so a catastrophic event is guaranteed to never happen in practice, but larger values make sense in other scenarios: Facebook’s URL dataset, for example, uses for the associated with Gaussian noise  [MDH20]. The could also conceivably depend on the data itself: for example, for Laplace-based partition selection, since the risk of the only depends on the number of partitions with a single user, we could first estimate the number of such partitions, and select the accordingly.

This suggests a way to simplify the interface of differentially private tooling, by only surfacing to tool users, and have be chosen automatically. A default can be chosen for some algorithms, while for others, more information could be asked of the analyst (e.g. an order of magnitude of the number of partitions with a single user), or determined automatically in a data-dependent way, possibly using a small fraction of the budget.

Attack models and policies

Rolling out strong anonymization practices to a large number of use cases takes more than building tooling and picking reasonable parameter values. It is also essential to provide education and operational guidance to users of this tooling: even when the interface is simplified as much as possible, differential privacy tooling is still easy to apply incorrectly. In this section, we focus on a particular use case: long-lived anonymization pipelines, in the trusted curator model.

Note that this scenario is different from what is assumed in much of the DP literature, where an analyst is given differentially private query access to a private database. Relying solely on DP for this last use case presents immense challenges: such a system must track privacy budget and rate-limit queries, possibly split the budget across analysts, handle of query errors in a way that does not reveal information but still allows well-meaning analysts to debug queries, or prevent side-channel attacks  [HPN11AKM15]. LinkedIn’s Audience Engagements API  [RSP20] is perhaps the only public system actually operating under this model; it uses very large privacy budgets and a fairly restricted class of allowed queries.

Instead, we focus on a scenario in which analysts have some level of access to the original, raw data (presumably, with a number of security measures in place) and are trying to publish or share some anonymized statistics over this data. This is the model used in data releases such as the 2020 US Census  [Abo18b] or Facebook’s URL dataset  [MDH20]. In addition, we assume that this release is long-lived: the raw dataset grows over time, and new data is regularly published; this is the case for e.g. Google’s Community Mobility Reports  [ABC20] or Symptom Search Trends dataset  [BDD20].

In this scenario, the people running differentially private algorithms write one or several non-adversarial queries, make sure that they provide differential privacy guarantees with a fixed privacy budget, and release the output of those queries. Mistakes can still happen: for example, floating-point attacks from  [Mir12] can cause unexpected privacy loss even without ill intentions from the person running the tool. Perhaps more importantly, honest mistakes can also weaken or break privacy guarantees: users must be supported with education and operational guidance to make sure they understand how to use the tooling correctly.

Let us take a simple example. If a user repetitively runs a differentially private pipeline, and no budget tracking mechanism is implemented across runs, the total privacy loss is unbounded, and so differential privacy no longer provides any formal guarantee. Preventing even simple mistakes like these from happening is more complicated than it appears. Tracking privacy budget in the tooling itself, and setting a cap at how much one can query the data, causes significant usability issues for legitimate users. Analysts trying to define which anonymization strategy to use for a specific data release often need a lot of experimentation, incompatible with such hard limits.

Instead, analysts must understand the concept of a privacy budget: the collection of many individual outputs of anonymization pipelines is “less anonymized” than each in isolation. The data we publish cannot be easily recomputed and re-published: otherwise, the initial privacy budget computations will need to be adjusted. This is far from natural to non-experts. No other privacy or security property works this way: the union of encrypted data is still encrypted, adding data to an access-controlled directory does not change its access level, etc.

This underscores the need for policies that specifically address data generated for experimental purposes. Although such data might technically qualify as anonymized, good practices should still be followed, such as limiting access controls lists to those who need access and setting short retention periods.

This aspect also has important implications on planning, testing, validation, and anomaly detection practices: finding out before publication that some data is erroneous is crucial, and since manual validation does not scale, this requires significant investment into automated testing. Further, it is also a good practice to plan for extra privacy budget in case something does need to be recomputed: technical changes, like underlying inference algorithms being improved over time, can unexpectedly impact the utility of the published metrics, and testing is always imperfect. Techniques like scaling factors  [ABC20] can help reduce the privacy cost of such events, but not avoid it entirely.

Finally, the need to experiment and to validate the output data also requires a somewhat relaxed view of what goes in the privacy budget. The very idea of doing experiments on the raw data to develop a final anonymization strategy might sound shocking to theorists! With a extremely principled approach to differential privacy, every decision we make based on the data will influence what eventually gets published (and when!), and should be counted as part of the privacy budget. So if multiple anonymization strategies are compared, and one ends up being chosen, all of them should count towards the privacy budget, or the choice itself should be done in a differentially private manner. Similarly, since the result of automated testing ends up being visible to a potential attacker (who could notice if data is held back), testing should itself be differentially private, and count towards the budget.

We argue that such a view is entirely incompatible with practical requirements. Nobody will agree to publishing noisy data without first making sure that the result is usable, which typically requires many rounds of trial-and-error. Good testing involves not relying on any single piece of infrastructure or logic to validate the data: testing the accuracy of differentially private data by using noisy data, when raw data is available, is nonsensical to data analysts who require strong guarantees on data quality.

Simply not counting this towards the privacy budget feels somewhat uncomfortable. As DP mechanisms get more and more complex to optimize privacy/accuracy trade-offs, they also get more hyperparameters, which can be fine-tuned to maximize the mechanism utility on certain data. Running a very large number of simulations to find the best value for these parameters seems unwise: the hyperparameter values themselves might leak private data…

One compromise is to count automatic tuning of parameters towards privacy budget calculations, while manual experimentation should not. For example, the data-dependent method for choosing introduced at the end of Section 4.3.4 would not be counted towards the total privacy budget if the user manually provides an order of magnitude of the number of small partitions in the data, but would consume some if this determination is done automatically (and, presumably, more precisely).

In the case of testing, this criterion does not exactly work: automated testing is necessary for long-lived anonymization pipelines, and as we mentioned before, cannot realistically be done in a DP manner. How to avoid inadvertent information leakage? One mitigation strategy is to ensure that the effects of testing are coarse enough to not reveal significant insights. For example, publishing all metrics that pass some quality tests and holding back others seems unwise. However, having some global tests and preventing the publication of all metrics when those tests fail is, intuitively, very unlikely to cause any privacy issue.

These considerations must be implemented in education and guidance. This requires significant investment: one cannot simply build the right tooling and consider the problem solved. This should not be surprising: in any technological system providing security or privacy properties, honest mistakes are much more likely to happen than ill-intentioned attacks, and relying on technology alone often fails to provide the desired outcomes.

All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).