Ted is writing things

On privacy, research, and privacy research.

Getting more useful results with differential privacy

— updated

This post is part of a series on differential privacy. Check out the table of contents to see the other articles!

This post was co-written by Daniel Simmons-Marengo and myself.

So you decided to use differential privacy (DP) to publish or share some statistical data. You ran your first pipeline or query1, all excited, but then… You're getting useless results. Maybe all your data seems to have disappeared. Or the statistics look very unreliable: the noise completely drowns out the useful signal. Don't lose hope! This situation is common the first time people try to use differential privacy. And chances are that you can fix it with a few simple changes.

In this post, we'll cover five basic strategies to improve the utility of your anonymized data. These are far from the only tricks you can use, but they're a good place to start. And none of them requires you to sacrifice any privacy guarantees.

Aggregate as much data, as coarsely, as possible

DP algorithms produce better results when run over more data. Remember: the noise they introduce is proportional to the contribution of a single person. It doesn't depend on the size of the input data. So, the more people you have in a statistic, the smaller the relative noise will be. Individual contributions will "vanish into the crowd", and so will the added uncertainty.

Increasing the total amount of input data will improve utility, but you may not be able to get more data. Luckily, there are other ways to take advantage of this property. What matters is the amount of data that contributes to each statistic you produce. In other words, the finer you slice and dice your data, the worse your utility will be. If you can, slice the data into coarser partitions. For example, calculate weekly statistics rather than daily statistics. Or aggregate your data per-country rather than per-city. You get the idea.

Visual aid showing the data represented as a square cut into four big pieces
on the left, and the same square cut into sixteen pieces on the right, with an
arrow going from left to right between both

Another common trick is to slice by fewer dimensions at the same time. Suppose that your query calculates the number of visitors by country and language. Do you need the combination of both dimensions? Many combinations of country and language are rare, and will only have a few visitors. Instead, calculating them separately might work better: you will get more users in each statistic, so the overall impact of the noise might be lower.

Minimize the number of aggregations

DP bounds the total amount you can learn about any individual from a series of data releases. Every statistic you calculate reveals some information about individuals in the data. To bound the global privacy cost of a data release, you have to set a privacy budget. This is the total \(\varepsilon\) and \(\delta\) cost of your set of calculations.

Each statistic spends part of this privacy budget. So if you have a fixed privacy budget, and you want to calculate more statistics, each one must reveal less. That means more noise must be added to it. Limiting how much noise is needed to protect many statistics is an active area of research. The scientific literature is full of clever tricks to that end. But the best solution is often the simplest one: calculate fewer statistics.

Visual aid showing many round circles representing aggregations, and an arrow
pointing to the right with fewer

OK, this is a bit abstract. How can you decrease the number of statistics you calculate, in practice? Here are some common strategies.

  • Remove metrics. For example, if you're calculating both the number of page views and the number of unique visitors… Could you, instead, use only one of the two?
  • Remove dimensions. Do you need to calculate the number of visitors per country and per language? Or would only one of the two get you the information you need?
  • Remove time periods. Do you need to calculate the number of unique visitors in the past day, week, month and year? Or would one or two of these statistics be enough?
  • Remove "sliding windows". What's a sliding window? Suppose that every day, you calculate e.g. the number of visits in the past week. In that case, each data point will count towards seven separate statistics… Would calculating that metric only once a week do the trick, instead?

Visual aid showing fourteen successive days, with many arrows on the top
covering all possible one-week periods within these fourteen days. An arrow
points to the right, where there are only two arrows for two successive

Split the privacy budget unevenly

This trick is related to the previous one. Suppose that you reduced the number of aggregations, but you still have several. The idea is to split your total privacy budget unevenly between them. Say your total privacy budget is \(\varepsilon=1\) and you have five statistics. You don't have to allocate \(\varepsilon=0.2\) to each of them! You could instead use \(\varepsilon=0.8\) for one statistic, and \(\varepsilon=0.05\) for all others.

Splitting the privacy budget unevenly is useful in two common situations.

  1. You care much more about the accuracy of some statistics than others. In that case, you might want to allocate a bigger portion of the privacy budget to the most important ones.
  2. Some of your statistics are more fine-grained than others. For example, suppose that you calculate both daily and weekly statistics. On average, weekly statistics will have 7 times more data than daily statistics: you could use a budget that is 7 times smaller for them. In doing so, the relative impact of the noise will be about the same for both.

Visual aid showing one ε represented as a rectangle split in four equal parts
labeled ε₁, ε₂, ε₃ and ε₄ on the left, and an arrow pointing right to the same
rectangle split in four parts, with the same labels but very different

Set aggressive contribution bounds

Most DP algorithms bound each individual's contribution to each statistic. For example, if you're counting the number of visits per web page, you need to bound:

  • the number of different pages each individual can contribute to;
  • and the number of times each individual can visit each page.

How should you pick these bounds? It’s tempting to use generous bounds that will cover any conceivable input. But this is usually bad for utility: the magnitude of the added noise grows with the bounds. There is a tradeoff between two sources of error:

  • larger bounds will lose less data, but require more noise;
  • smaller bounds might lose more data, but require less noise.

Often, reducing the level of noise on your whole dataset is worth truncating a few outliers. The best cut-off depends on the distribution of your dataset: the 95th percentile might work well for one dataset, while another might do better with the 98th. In most use cases though, you’ll reach optimal utility when part of your dataset exceeds the bounds.

Note that some systems don't make you specify these bounds. Instead, they can be automatically calculated by an auxiliary DP algorithm2. But this operation uses some part of the privacy budget! In that case, if you specify the bounds by hand instead, you can save that part of the budget, and get less noisy results.

Visual aid showing two histograms of contribution size per user, with an arrow
pointing from the left to the right histogram. A horizontal line representing
the contribution bound splits the histogram in two parts, the top part (which
represents the part of the contribution that is dropped) is of a paler color.
The bound is high on the left and lower on the

Use public partitions

Most DP pipelines produce many statistics, grouped by partitions. Partitions are like the buckets in a histogram: in the example where we count the number of visits per web page, each page is a partition. By default, pipelines only release statistics for some partitions in the dataset. Typically, the partitions with most people in them are kept, and the ones with few users are dropped. This process can reveal information about individuals: it must be done in a DP way3. Like before, we need to use part of the privacy budget for this, and add extra noise to the statistics to compensate.

You can skip this step by listing all the partitions you want to appear in your output before you run the query. If you do so using non-private data, it is no longer necessary to choose partitions in a DP manner. This allows you to save budget, and return more partitions.

There is a downside: all partitions that you specified will appear in the output, even if they have little or no data in the dataset. In that case, they can be extremely noisy. Still, if you can list the partitions you want in advance, this is often an excellent technique. You can see how it works in practice in e.g. the Privacy on Beam codelab. Note that you're not allowed to look at the private data to build your list of public partitions!

Visual aid showing two histograms of number of people per partitions, with an
arrow pointing from the left to the right histogram. A horizontal line labeled
"threshold" splits the left histogram in two: the partitions below the threshold
are paler, to represent them being removed from the output. The right histogram
has no threshold, however additional partitions are added to the output, to
represent the public partitions that are not present in the data but are still
present in the output.

More clever ideas

There are a myriad other techniques out there to squeeze more utility out of DP pipelines. Most of them are more complex than the ones listed in this post. Some might be the topic of future blog posts! In the meantime, here are three other, more generic suggestions.

  • You can try looking up your problem in your search engine of choice, and add "differential privacy". This will often dig up relevant literature. Unless you're lucky, you won't find a readily available implementation. But you might get valuable ideas or insights!
  • You can send a friendly message to one of the communities out there working on DP tooling. Between Google, OpenMined, or OpenDP, someone might be happy to help! (And try to convince you to use their tooling :D)
  • You can also try to think about how sensitive your problem is to individual changes in the data. If a single person changes their data, will you come to different decisions or results? If yes, your use case might be fundamentally incompatible with DP, and no clever trick will fix that.

Want to keep learning more about differential privacy? Head over to the table of contents of this blog post series to see its other articles.

  1. You did not roll your own tooling, right? Doing so is generally unwise; implementing DP correctly is much more tricky that you'd expect. There are some excellent open-source tools out there that you should use instead. Like these libraries that my team at Google published! 

  2. Section 5.1.1 of our paper describes such an algorithm, implemented e.g. here

  3. This operation is described in a previous blog post, and is also the topic of a paper I co-authored

All opinions here are my own, not my employer's.   |   Feedback on these posts is very welcome! Please reach out via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy) for comments and suggestions.   |   Interested in deploying formal anonymization methods? My colleagues and I at Tumult Labs can help. Contact me at oi.tlmt@neimad, and let's chat!