Ted is writing things

On privacy, research, and privacy research.

Research highlights: Privacy attacks on statistical data

If you publish many statistics about a sensitive dataset, and these statistics are reasonably accurate, then an attacker can reconstruct part of the original dataset, using only the statistics. This is the statement of a simple but devastating theorem, proven by Irit Dinur and Kobbi Nissim twenty years ago in a seminal paper.

A diagram with, on the left, a database icon labeled "Sensitive data about
individuals". An arrow labeled "A program generating statistics (possibly
noisy)" goes from this icon to a box labeled "A bunch of statistics".
An arrow labeled "A mathematical process (e.g. solving equation systems)" goes
from this box to a database icon labeled "Reconstructed data, matching the
private data for many rows".

This fact, later called the Fundamental Law of Information Recovery, is bad news: it means that there is some kind of inherent privacy leakage whenever one publishes information about a dataset. It also means that this leakage still exists even when information doesn't seem too revealing, like aggregate statistics.

After this work was published, the scientific community started looking for better ways to understand this trade-off between statistical utility and privacy leakage. Differential privacy was invented a few years later, and everyone moved to this more robust approach to releasing insights from sensitive data. This solved the thorny problem of privacy-safe data sharing and publication once and for all.

… just kidding. Instead, people outside of academia largely ignored this line of work. It was easy to do so: the fundamental law of information recovery is a pretty theoretical attack. Its applicability to real-world data releases was unclear, so it didn't seem very urgent to move to provably robust approaches.

This all changed a few years ago, when the U.S. Census Bureau ran a reconstruction attack on their own data. This attack got a lot of attention for multiple reasons.

  • It was performed on a real dataset (the data collected by the Census).
  • The queries used in the data publication were not chosen by the attacker, but corresponded to a real-world workload (the 2010 Census tabulations).
  • The attack was very successful, demonstrating worrying rates of reconstruction and re-identification.

Nonetheless, the attack was criticized by some researchers. The U.S. Census Bureau initially did not provide a lot of technical details, so there were misunderstandings about the methods and the results. And there were some philosophical objections as well:

  • "The demographic data used as input does not seem very sensitive. Is the attacker really learning anything problematic?"
  • "Sure, the attacker can reconstruct some of the data. But they have no way of knowing whether each reconstructed record is accurate! It doesn't really feel like a privacy breach if there's still a lot of uncertainty."
  • "These attacks only work when a ton of statistics are published from the same data. The majority of real-world data releases are much simpler, so they should be safe."

Recent work shows that these arguments are, sadly, overly optimistic. Over the past few years, multiple papers have made it clear that the privacy risk from statistical releases is real and serious. In this blog post, I'll list some of these recent developments, with links to the original sources if you want to learn more.

How sensitive is the information

The attack run by the U.S. Census Bureau took the example of an attacker learning someone's declared race and ethnicity. This is the kind of information that must be protected according to U.S. law, so the agency saw this as a real issue. Some people, however, have argued that such a privacy failure did not translate to real-world harm, and that the risk was minimal. This raises the question: what other kinds of harm can come out of privacy attacks on demographic datasets?

Two follow-up papers answered that question, and showed that such statistical data releases can actually reveal extremely sensitive information.

In The Risk of Linked Census Data to Transgender Youth, Os Keyes and Abraham D. Flaxman show that badly-protected statistics can reveal people's transgender identity. The idea is scarily simple: reconstruct data from successive releases, and re-identify the records of people who declared a different gender from one release to the next. In the current political climate, it's not hard to imagine how individuals targeted by such an attack could suffer severe consequences.

A diagram with three elements. On the top, a database icon labeled
"Demographic data for some year", an arrow going from this icon to a box labeled
"Demographic statistics", and an arrow going from this box to a database icon
labeled "Reconstructed and reidentified data for some year". At the bottom, the
same diagram, but "some year" replaced with "a later year". Between the two
reconstructed data boxes, a red arrow labeled "Compare successive data releases,
find who reported a different gender (or
sex)".

Another example is presented in Quantifying Privacy Risks of Public Statistics to Residents of Subsidized Housing, by Ryan Steed, Diana Qing, and Zhiwei Steven Wu. The paper shows how public statistics can be attacked to reveal which households violate occupancy guidelines in subsidized housing. If a working-class family grows to have three children in a single bedroom, or if a household is providing temporary accommodation to a family member, they could be evicted by their landlord or property manager.

A screenshot of Figure 1 in the paper mentioned above, outlining the
experimental approach.

In both cases, the privacy risk also leads to a data quality issue: vulnerable populations are more likely to provide false information when answering surveys, to avoid the very concrete harms that could result from privacy breaches.1

Increasing the attacker's confidence

In the Census attack, the idea was to reconstruct individual records with demographic data, then link them to real-world identities. But an attacker running such an attack might bump into an issue: how to know whether the reconstructed records are accurate? There might be randomness in the process generating the statistics, or in the attack itself. So the information learned might not be real, which would hardly qualify as a convincing privacy attack!

Two papers provide compelling counterpoints to this argument.

The first one is Confidence-ranked reconstruction of census microdata from published statistics, by Travis Dick, Cynthia Dwork, Michael Kearns, Terrance Liu, Aaron Roth, Giuseppe Vietri, and Zhiwei Steven Wu. I love the conceptual simplicity of their approach. Instead of running one reconstruction attack, they suggest running many, and looking at the results. Do certain records appear in every single reconstructed dataset? If so, they're much more likely to be correct than records that only appear in 10% of them. Generalizing this idea gives the attacker a ranked list of reconstructed records: the ones on top of the list are very likely to be correct, the ones near the bottom are much more uncertain. This allows the attacker to focus on the records that are most likely to be accurate in later steps (like re-identification).

A diagram illustrating confidence-ranked reconstruction attacks. On the left
side, a box labeled "A bunch of statistics". An arrow labeled "Run the
reconstruction process many times to obtain different results due to its
inherent randomness" goes from this box to a bunch of database icons labeled
"Reconstructed data, matching the private data for some rows". An arrow labeled
"Count how many times each reconstructed record appears across reconstructions"
goes from there to a database icon labeled "Ordered list of reconstructed
records, where the most likely to be correct appear
first"

The second one is the longer description of the Census attack: A Simulated Reconstruction and Reidentification Attack on the 2010 U.S. Census, by John M. Abowd, Tamara Adams, Robert Ashmead, David Darais, Sourya Dey, Simson Garfinkel, Nathan Goldschlag, Michael B. Hawes, Daniel Kifer, Philip Leclerc, Ethan Lew, Scott Moore, Rolando A. Rodríguez, Ramy N. Tadros, and Lars Vilhuber. The paper does not only consider the percentage of correctly re-identified records, but also measures solution variability: how many possible reconstructions are there for a given set of statistics? When this solution variability is 0, there is only one possible solution, so the accuracy is maximal. The authors found more than 97 million records can be reconstructed exactly, with 100% certainty — proving once and for all that the privacy risk of this data release was not overstated.

A screenshot of Table 2 in the previously mentioned paper, which is a table
listing the solution variability by block percentile. A part of the table is
highlighted to point out that it shows Reconstruction with 100% certainty for
70% of Census blocks (or 97 million
people).

Attacking smaller data releases

So, statistical releases with a very large number of tables and aggregates are severely at risk of reconstruction and re-identification attacks. What about more modest data publications, in which only a smaller number of output statistics are made available? Maybe the reconstruction stage will admit a much larger number of possible solutions, and the attacker will be unable to reconstruct records with high confidence?

Sadly, this is not the case, as shown in a paper titled Generate-then-Verify: Reconstructing Data from Limited Published Statistics, by Terrance Liu, Eileen Xiao, Adam Smith, Pratiksha Thaker, and Zhiwei Steven Wu. In this work, the authors design a different kind of attack, working in two stages. In the generate stage, the attacker lists possible "claims", which are statements like "there is exactly one record with age 70 in this Census block". Then, in the verify stage, the attacker uses integer programming to prove that some of the claims are true for every possible reconstruction of the dataset. If the claim is about a single person, then the attacker successfully singled out a specific record in the dataset, with 100% certainty.

A screenshot of Figure 1 in the paper mentioned above, illustrating the claims
studied in this work.

Crucially, the attack works even when there are multiple possible reconstructed datasets. Previous approaches would never reach 100% certainty in those cases. By contrast, this attack succeeds in learning sensitive information about some records, and this information is guaranteed to be correct.

What should we do, then?

Say you're a data steward at an organization that publishes statistical data releases, or share aggregate data with third parties. Reading this might have helped you realize that these use cases are likely revealing more than you intend about the original data. What should you do about it?

One answer is to assume that the risk is real and severe, and work towards mitigating it. The best practice is to use robust privacy-enhancing technology, like differential privacy: this gives you a principled approach to quantify and control disclosure risk, in a way that will stand the test of time. Think of it like using standardized algorithms to encrypt data: something validated by experts as the best option going forward.

It can also be useful to run privacy audits: get an expert to quantify the practical risk of your existing data releases, for example by performing such attacks. Doing so will help you get a better idea of how risky your existing use cases are, and prioritize mitigation work. Think of it like penetration tests in security: you want to know the ways in which your existing practices can fail before someone with ill intentions uses it to harm people.

If you have a use case that could benefit from a privacy audit or a robust anonymization strategy, and are looking for an expert to assist you, hit me up! My independent consultancy, Hiding Nemo, provides exactly this kind of service. I'd love to hear about your use case and discuss how I could help.


Thanks to Antoine Amarilli, Daniel Kifer, Lars Vilhuber, Ryan Steed, and Travis Dick for their helpful comments on previous versions of this post.


  1. This is, by the way, not a theoretical issue: studies show that a number of people don't list all their children on census forms for fear of reprisal from various institutions. 

Feedback on these posts is welcome! Reach out via e-mail (se.niatnofsed@neimad) for comments and suggestions.
Interested in using privacy-enhancing technology to do more with your data, with respect and compliance built-in? I can help! Check out the website of my independent consultancy, Hiding Nemo, to learn more.