Part of a series on differential privacy. You might want to start with the previous articles below!
- Why differential privacy is awesome presents a non-technical explanation of the definition.
- Differential privacy in (a bit) more detail introduces the formal definition, with very little math.
- Differential privacy in practice (easy version) explains how to make simple statistics differentially private.
- Almost differential privacy describes how to publish private histograms without knowing the categories in advance.
- Local vs. global differential privacy presents the two main models of differential privacy, depending on who the attacker is.
- The privacy loss random variable explains the real meaning of \((\varepsilon,\delta)\)-differential privacy.
- The magic of Gaussian noise introduces Gaussian noise and its shiny properties.
- Why not differential privacy? explores what it means for an algorithm to not be differentially private.
- Getting more useful results with differential privacy (this article) presents five simple techniques to improve the utility of your anonymized data.
This post was co-written by Daniel Simmons-Marengo and myself.
So you decided to use differential privacy (DP) to publish or share some statistical data. You ran your first pipeline or query1, all excited, but then… You're getting useless results. Maybe all your data seems to have disappeared. Or the statistics look very unreliable: the noise completely drowns out the useful signal. Don't lose hope! This situation is common the first time people try to use differential privacy. And chances are that you can fix it with a few simple changes.
In this post, we'll cover five basic strategies to improve the utility of your anonymized data. These are far from the only tricks you can use, but they're a good place to start. And none of them requires you to sacrifice any privacy guarantees.
Aggregate as much data, as coarsely, as possible
DP algorithms produce better results when run over more data. Remember: the noise they introduce is proportional to the contribution of a single person. It doesn't depend on the size of the input data. So, the more people you have in a statistic, the smaller the relative noise will be. Individual contributions will "vanish into the crowd", and so will the added uncertainty.
Increasing the total amount of input data will improve utility, but you may not be able to get more data. Luckily, there are other ways to take advantage of this property. What matters is the amount of data that contributes to each statistic you produce. In other words, the finer you slice and dice your data, the worse your utility will be. If you can, slice the data into coarser partitions. For example, calculate weekly statistics rather than daily statistics. Or aggregate your data per-country rather than per-city. You get the idea.
Another common trick is to slice by fewer dimensions at the same time. Suppose that your query calculates the number of visitors by country and language. Do you need the combination of both dimensions? Many combinations of country and language are rare, and will only have a few visitors. Instead, calculating them separately might work better: you will get more users in each statistic, so the overall impact of the noise might be lower.
Minimize the number of aggregations
DP bounds the total amount you can learn about any individual from a series of data releases. Every statistic you calculate reveals some information about individuals in the data. To bound the global privacy cost of a data release, you have to set a privacy budget. This is the total \(\varepsilon\) and \(\delta\) cost of your set of calculations.
Each statistic spends part of this privacy budget. So if you have a fixed privacy budget, and you want to calculate more statistics, each one must reveal less. That means more noise must be added to it. Limiting how much noise is needed to protect many statistics is an active area of research. The scientific literature is full of clever tricks to that end. But the best solution is often the simplest one: calculate fewer statistics.
OK, this is a bit abstract. How can you decrease the number of statistics you calculate, in practice? Here are some common strategies.
- Remove metrics. For example, if you're calculating both the number of page views and the number of unique visitors… Could you, instead, use only one of the two?
- Remove dimensions. Do you need to calculate the number of visitors per country and per language? Or would only one of the two get you the information you need?
- Remove time periods. Do you need to calculate the number of unique visitors in the past day, week, month and year? Or would one or two of these statistics be enough?
- Remove "sliding windows". What's a sliding window? Suppose that every day, you calculate e.g. the number of visits in the past week. In that case, each data point will count towards seven separate statistics… Would calculating that metric only once a week do the trick, instead?
Split the privacy budget unevenly
This trick is related to the previous one. Suppose that you reduced the number of aggregations, but you still have several. The idea is to split your total privacy budget unevenly between them. Say your total privacy budget is \(\varepsilon=1\) and you have five statistics. You don't have to allocate \(\varepsilon=0.2\) to each of them! You could instead use \(\varepsilon=0.8\) for one statistic, and \(\varepsilon=0.05\) for all others.
Splitting the privacy budget unevenly is useful in two common situations.
- You care much more about the accuracy of some statistics than others. In that case, you might want to allocate a bigger portion of the privacy budget to the most important ones.
- Some of your statistics are more fine-grained than others. For example, suppose that you calculate both daily and weekly statistics. On average, weekly statistics will have 7 times more data than daily statistics: you could use a budget that is 7 times smaller for them. In doing so, the relative impact of the noise will be about the same for both.
Set aggressive contribution bounds
Most DP algorithms bound each individual's contribution to each statistic. For example, if you're counting the number of visits per web page, you need to bound:
- the number of different pages each individual can contribute to;
- and the number of times each individual can visit each page.
How should you pick these bounds? It’s tempting to use generous bounds that will cover any conceivable input. But this is usually bad for utility: the magnitude of the added noise grows with the bounds. There is a tradeoff between two sources of error:
- larger bounds will lose less data, but require more noise;
- smaller bounds might lose more data, but require less noise.
Often, reducing the level of noise on your whole dataset is worth truncating a few outliers. The best cut-off depends on the distribution of your dataset: the 95th percentile might work well for one dataset, while another might do better with the 98th. In most use cases though, you’ll reach optimal utility when part of your dataset exceeds the bounds.
Note that some systems don't make you specify these bounds. Instead, they can be automatically calculated by an auxiliary DP algorithm2. But this operation uses some part of the privacy budget! In that case, if you specify the bounds by hand instead, you can save that part of the budget, and get less noisy results.
Use public partitions
Most DP pipelines produce many statistics, grouped by partitions. Partitions are like the buckets in a histogram: in the example where we count the number of visits per web page, each page is a partition. By default, pipelines only release statistics for some partitions in the dataset. Typically, the partitions with most people in them are kept, and the ones with few users are dropped. This process can reveal information about individuals: it must be done in a DP way3. Like before, we need to use part of the privacy budget for this, and add extra noise to the statistics to compensate.
You can skip this step by listing all the partitions you want to appear in your output before you run the query. If you do so using non-private data, it is no longer necessary to choose partitions in a DP manner. This allows you to save budget, and return more partitions.
There is a downside: all partitions that you specified will appear in the output, even if they have little or no data in the dataset. In that case, they can be extremely noisy. Still, if you can list the partitions you want in advance, this is often an excellent technique. You can see how it works in practice in e.g. the Privacy on Beam codelab. Note that you're not allowed to look at the private data to build your list of public partitions!
More clever ideas
There are a myriad other techniques out there to squeeze more utility out of DP pipelines. Most of them are more complex than the ones listed in this post. Some might be the topic of future blog posts! In the meantime, here are three other, more generic suggestions.
- You can try looking up your problem in your search engine of choice, and add "differential privacy". This will often dig up relevant literature. Unless you're lucky, you won't find a readily available implementation. But you might get valuable ideas or insights!
- You can send a friendly message to one of the communities out there working on DP tooling. Between Google, OpenMined, or OpenDP, someone might be happy to help! (And try to convince you to use their tooling :D)
- You can also try to think about how sensitive your problem is to individual changes in the data. If a single person changes their data, will you come to different decisions or results? If yes, your use case might be fundamentally incompatible with DP, and no clever trick will fix that.
You did not roll your own tooling, right? Doing so is generally unwise; implementing DP correctly is much more tricky that you'd expect. There are some excellent open-source tools out there that you should use instead. Like these libraries that my team at Google published! ↩