Don't worry, your data's noisy
This post is part of a series on differential privacy. Check out the table of contents to see the other articles!
Here is a cold, hard, inescapable truth: your data has noise in it. No, we're not talking about differential privacy (yet)! Nobody added random numbers to your statistics so far. But still, your data is noisy. It's wrong. It's not 100% accurate. It's uncertain. Worse, there are two kinds of uncertainty.
Two kinds of uncertainty
Let's start with the first kind. Here's a diagram.
The point on the left is how most people think about statistics. Precise numbers that corresponds to exact truths. Take, for example, a database with eye color information in it. Here is a statistic: « there are 4217 people with brown eyes in this database ». What did you learn about the world when reading this statistic? One answer could be « well, I've learned that the dataset contains 4217 people with brown eyes ». That sounds reasonable enough. Right?
At best, you've learned that the number of people with brown eyes in the database is around 4217. It might be the best estimate you've got. But is this statistic 100% accurate? Would you bet money on it being 100% accurate? What would it take for you to be willing to take such a bet?
I can hear your objections already. « Whoa, wait a second. What does 100% accurate means? How is eye color defined exactly? How do we double-check the number? And how was this statistic generated, anyway? Did each person answer a question about their own eye color? Or did someone else do the classifying? Is eye color information available for 100% of the people in the database? » And so on. You probably have many other valid questions. Behind those questions lies the deep, uncomfortable truth: this statistic is almost certainly noisy.
At best, you need to add error bars if you want to represent this statistic in an honest way. This is what happens on the right side of the previous picture. We've got an exact number, represented by a point, and a confidence interval (or error bars) around this point. This statistic no longer says « there are exactly 4217 people with brown eyes in this database ». Instead, it says something more complex, but more accurate. « With 95% certainty, there are between 4209 and 4226 browned-eyed people in the database. The best estimate we've got is 4217. »
Wait a second.
Right. How was this confidence interval computed? What do we mean by 95% certainty? What is the uncertainty capturing? Did we miss some sources of uncertainty? Is there uncertainty in this uncertainty estimate1?
This leads me to my second point, even more devastating than the first. Your data is noisy, and also, you probably don't even know how much noise is in it. The statistics you have are best guesses, not much more.
The figure on the right represents this unknown uncertainty. You suspect that the number you have isn't too far removed from the actual number. But you can't quantify how far.
Let's take a closer look at both kinds of uncertainty. Quantifiable uncertainties can take several forms.
- Your data might be a uniform sample of a larger population. In this case, you can calculate the sampling error of your statistics.
- A large-scale data collection process might miss a few events. This happens, for example, when servers crash, or network problems occur. But you might be able to estimate how often that happens, and say the uncertainty is below e.g. 1%.
- The accuracy of certain algorithms can be measured. For example, you train machine learning algorithms on a training set, and evaluate them on a test set.
- If humans are labelling data by hand, you can have several analysts label the same data. This way, you can get an idea of how often they agree, and estimate uncertainty this way.
After quantifying the uncertainty, you can take it into account in the analyses you run. For example, you can propagate the error bars. Or only return results if they are very likely to still be valid despite the sources of error.
Meanwhile, unquantifiable uncertainties can also come from several places.
- In practice, statistical sampling is often not uniform: your statistics might suffer from selection bias or survivorship bias. You can guesstimate the effect of those on your data, but it's hard to be 100% sure.
- When you ask people questions, their answers might not be accurate. This is called reporting bias. There are many reasons why it can happen, and the overall impact is also difficult to estimate.
- People sometimes dig into data until they find something interesting to report. This practice, called data dredging, generates good-looking, but completely meaningless results. This is made worse by the fact that positive results are more likely to get published. Both phenomena lead to real problems in practice! They create inherent uncertainty in the results reported in the scientific literature. And again, it's hard to quantify this uncertainty.
Sometimes, the situation is even worse than that. People might use ad hoc protection techniques that add some fuzziness to the data, and not tell you about it. Or they might give you an idea, but no details. This is what the US Census Bureau did for their 2000 and 2010 releases: they randomly swapped records, but didn't publish any details about how that procedure worked. Sadly, it didn't even succeed in protecting the data. But it did make the entire data noisy in ways nobody could find out, nor take into account.
Often, both quantifiable and unquantifiable effects end up affecting your data. So you end up with something like this.
The quantifiable uncertainty, for which you can draw confidence intervals, is in blue. But there is still some unquantifiable uncertainty, here in brown: you should take the whole thing, including the error bars, with a grain of salt.
You probably see what's coming next. What if we add noise to the statistics, to get differential privacy guarantees?
Adding differential privacy to the mix
Differential privacy is typically obtained by adding noise to statistics. We pick a number at random according to some distribution, and add it to the statistics. This distribution isn't secret, only the random number is. So this is the nice kind of uncertainty: the one we can quantify. If you already have some quantifiable uncertainty, you can combine both, and get a single confidence interval. And of course, the whole thing might still be uncertain in ways we can't quite estimate.
We didn't change the situation much. We only made the confidence interval a little bit wider. The quantifiable uncertainty increased a little, and we could quantify how much.
Some DP algorithms are more complex, and add noise to data in more creative ways. Computing the uncertainty for these algorithms can be difficult. But most building blocks can be analyzed to find out confidence intervals. This might be a bit of a pain, so in an ideal world, you don't have to do this by hand: the tooling you use does it for you.
Sometimes, though, you get the not-so-nice kind of uncertainty. For example, you might need to limit the contribution of each person in your database. This can take several forms, like clamping or subsampling. This creates additional error, or uncertainty, on top of the noise itself. But this one is hard to quantify: in theory, someone could contribute a billion things, and we would only count 5 of them. In that case, the error due to this single outlier would be huge.
If you want to quantify this effect, you have two options. The people publishing the data can tell you the magnitude of data loss due to clamping. If that didn't happen, you can make some reasonable assumptions on these outliers: how many there are, and how much data was dropped. In both cases, note this clamping is often positive for data quality: robust statistical analyses shouldn't be sensitive to huge outliers.
Differential privacy's effect on data isn't as world-ending as you might think. Your data didn't go from perfect to noisy. It was noisy all along! DP only made it a little more so. And you can quantify the effect of this new noise, which is nice. (At least, nicer than some of the uncertainty sources that were already there.)
This might come off as a surprise. You might have been using your data as if it was a perfect source of absolute truth. Not the most scientific approach, but maybe that was good enough for your application. In that case… you can probably continue doing the same with DP data? If the people who designed the DP process did a decent job, the statistics shouldn't be too far from the real data. Unless you were looking at very small statistics, of course. But in that case, you were looking at pure statistical noise in the first place.
If you were already taking uncertainty into account, DP doesn't change much. It only adds a new, quantifiable element to this uncertainty. With one caveat: you need to know exactly what was done to the data. Privacy parameters aren't enough for you to guess. You need to know the shape and magnitude of the noise, and all other ways in which the data was altered.
This also means that DP practitioners should give you this information. People creating and implementing DP algorithms share part of this responsibility, too. Novel algorithms should come with a way to calculate arbitrary confidence intervals. Tools should return uncertainty information along with the output: nobody wants to compute it by hand.
In a few words: transparency is key. It brings trust. It makes the data more useful. And it can show that the accuracy impact of DP is smaller than people might think!
For more musings on differential privacy, head over to the table of contents of this blog post series.
I'm thankful to Cynthia Dwork and danah boyd for their helpful feedback on drafts of this blog post.
This is not a silly question! And it can go further. I have seen physicists calculating confidence intervals for confidence interval bounds for confidence interval bounds. It is a real thing that very serious scientists sometimes do. ↩