Demystifying the US Census Bureau's reconstruction attack
This post is part of a series on differential privacy. Check out the table of contents to see the other articles!
Every 10 years, the US Census Bureau asks every American household a series of questions. How many people are living here? How old are they? What is their race and ethnicity? What is their relationship to each other?
The collected answers have very high quality, for two reasons. First, all households have to respond truthfully to these questions, by law. Second, the US Census Bureau has a legal duty to keep the answers secret for 72 years after each survey. Both aspects are key to convince everyone to answer truthfully. Appealing to people's sense of civic duty also helps!
What is the data used for, then? The Census Bureau aggregates it and publishes statistics about the US population. A lot of statistics: in 2010, it released over 150 billion statistics about the US population. These statistics then have many applications: scientific research, informing public policy, determining election districts, etc.
Confidentiality is central to the Census Bureau's mission. By law, they have to make sure that nobody can use their data to find out information about individuals. Disclosing such information even comes with criminal penalties! This has led to the creation of an entire field of study: statistical disclosure control. It predates even the oldest definition of privacy mentioned on this blog (k-anonymity).
How did statistical disclosure control work in practice? Before 1990, the method was pretty blunt: they removed the tables with fewer than five individuals or households in them. Then, from 1990 to 2010, the Census Bureau used a more complex technique called swapping. Swapping works in two steps. First, it selects households at random in small geographic areas: those are the ones most at risk of re-identification. Then, it exchanges records between these households and others before generating the statistics. The details of this swapping process were secret, to make it harder for people to design attacks.
Nowadays, the Census Bureau is moving towards formal notions: the statistics published for the 2020 Census will be differentially private. If you've read the previous articles of this blog, this might not surprise you. Differential privacy (DP) is designed to solve this exact problem: publishing statistics without revealing information about individuals. But this choice was far from obvious! Adding noise to statistics is quite scary for data users1. Using DP for such a complex release is also unprecedented.
So, what convinced the Census Bureau to take this decision? Their scientists ran an attack on some of the 2010 data, to better understand the privacy risks. And they realized that the attack was much more successful than they expected. The attack is simple, not very expensive, and pretty scary. The scientists then measured how much more swapping it would take for the attack to fail… and observed that the utility loss would be catastrophic. Older techniques like suppression were also ineffective. The only way to mitigate the risks and limit accuracy loss was differential privacy.
This choice, of course, has other benefits. Differential privacy provides quantifiable guarantees. It composes well, and protects even against very powerful attackers. It's also good for transparency: for the first time, the Census Bureau will be able to publish the details of their process. But these nice theoretical properties weren't the main factor in their decision. Instead, the choice of DP was pragmatic: it worked best to prevent realistic attacks without sacrificing too much utility.
In this blog post, we'll take a look at the details of this compelling attack, at the core of this decision. The attack has two stages: reconstruction and reidentification.
Reconstruction
The first step in the attack is to reconstruct records, using statistical data. The statistical data is as follows. In each geographic area (like Census blocks), we can get the answers to questions like: how many people of age 47 live there? How many people between the ages of 25 and 29 self-identify as white? How many households with 2 adults and 2 children are there?
How can we use this kind of information and reconstruct the original records? Let's take a simplified example. Say that a hypothetical Census block has four people in it.
Age | Race2 |
---|---|
17 | White |
17 | Asian |
43 | White |
47 | Asian |
Now, suppose that we have the following statistical data about this Census block.
- There are four people in total.
- Two of these people have age 17.
- Two of these people self-identify as White.
- Two of these people self-identify as Asian.
- The average age of people who self-identify as White is 30.
- The average age of people who self-identify as Asian is 32.
This data is statistical in nature: these numbers are all aggregated over several people (here, two). Yet, it's not difficult to guess what the original table looks like based on the numbers.
Not obvious yet? Take the two people aged 17. Points 1, 3 and 4 tell us that:
- either they both self-identify as White,
- either they both self-identify as Asian,
- either one of them self-identifies as White and the other as Asian.
The first option is impossible: if they both self-identified as White, then their average age should be 17, not 30 (point 5). The second option is also impossible, for the same reason (with point 6). So the third option is correct. We now know the first two records, and we can find the age of both others using the average age of each race group. It's like a fun puzzle!
That's the idea behind reconstruction attacks: taking statistical information and using simple reasoning to reverse-engineer the original records. Of course, when we have billions of statistics to work with, we don't do this by hand. Instead, we convert the data into a massive system of equations. Each piece of individual information is an unknown variable. The data gives us the relationships between them. Then, we can use a big computer to find a possible solution to this large system. This solution, in turn, gives us reconstructed records.
So, the team at the Census Bureau did exactly that, with statistical data from the 2010 Census. They transformed it into many equations, and used Gurobi to reconstruct the raw data. The records they obtained matched 46% of the original records exactly. That's pretty impressive! Especially since only a small fraction of the statistics were used in the attack (6.2 billion out of 150 billion). Swapping was not enough to prevent accurate reconstruction.
At first glance, that result looks pretty scary. But let's take a step back: how bad is it really? What does this 46% number actually tell us?
How bad is reconstruction?
Imagine that a given Census block has a particularly homogeneous population: out of 100 habitants, 95 all self-identify as White, and are evenly distributed between the ages of 20 and 393. The other 5% of people in this block do not belong to this demographic, and we don't have any information about them.
Can we "reconstruct" this dataset with high accuracy? Yes, and it will be easier than we expect: simply use the data on the majority group, and ignore the minority population. In practice, for each age between 20 and 39, we output 5 records with this age, and White as a race.
Given the statistics on this block, it's likely that our naive guess is pretty close to the truth. After all, only 5% of people don't belong to the majority demographic, and we know the distribution of the other 95%. Recall the success metric from the previous section: the percentage of matching records between real data and reconstructed data. According to this metric, our naive strategy performs very well! Accuracy is about 95%, if the age distribution is very uniform.
Has this process revealed sensitive information? Well… not really, right? All reconstructed records are identical across many people. So they don't seem to reveal very sensitive information… And the people in demographic minorities are safe from the attack.
It's also not clear yet how problematic these reconstructed records are. All we get is a list of records that are compatible with the published statistics. But how do we know which ones are actually correct? And how can we use them to learn something new and sensitive about individuals?
These questions show that on its own, the 46% number from the Census Bureau doesn't tell us much. But the Census Bureau didn't stop at reconstruction. The attack has a second step, re-identification, and this one gets much scarier results. Let's take a look at how it works.
Re-identification
In the Census Bureau attack, the reconstruction step outputs records with five characteristics: Census block, age, sex, race, and ethnicity. The idea of the re-identification attack is simple. First, they buy the kind of commercial data that an attacker could have access to. Second, they link this information with reconstructed records.
This "commercial data" is exactly what it sounds like: information about people, compiled by random companies, and made available to anyone who pays for it (or steals it). These companies, called "data brokers", are largely unregulated in the US. Their privacy practices are about as awful as you can imagine4. For their attack, Census obtained data from five different companies. Their goal was to simulate what an attacker would likely have access to in 2010.
These commercial datasets typically contain people's names, associated with demographic information: location of residence, age (or date of birth), and sex5. We will use these three characteristics to re-identify reconstructed records.
The technique is simple. We look at each record of the commercial dataset, one after the other. And we try to match this record's characteristics with the reconstructed record. Is there a single reconstructed record with the same location, age, and sex? If so, we link both records together. Here is a visualization of that process (with made-up data). The commercial dataset is in blue, the reconstructed Census records are in green.
Linking datasets in this way achieves two goals.
- It confirms that the reconstructed record corresponds to a specific individual. In fact, it allows us to re-identify the reconstructed record. Here, the commercial data tells us that this person is James Link.
- It gives us more information about this person, which the commercial data didn't have. Here, we learn James Link's race and ethnicity.
How do we quantify the success of this attack? We can look at two numbers. First, how many records can be linked between datasets in this way? Second, out of these linked records, how many are accurate? After all, some of these re-identifications might be wrong: both datasets might be incomplete or inaccurate. The percentage of correct answers is the precision of the attack:
So how does the attack perform? Census Bureau scientists linked 138 million records between both datasets. And their average precision was 38%. This means that the attack successfully re-identified 52 million records. Scary!
One could argue that the attacker can't know which records are correctly re-identified. Some of them will be false positives! After all, 38% aren't great odds. But the analysis from Census Bureau scientists doesn't stop there. Two further considerations make the attack even more compelling.
- The precision of re-identification goes up for people in small Census blocks: it's 72% on the smallest Census blocks (in which 8 million people live). This makes sense: statistics across a few people are more revealing than aggregates over large groups. It's not surprising, but it's still bad news: it means that folks in minority populations are more at risk of being re-identified. But disclosure avoidance is precisely trying to protect these people!
- The precision also goes up if one has better-quality data. Census Bureau scientists use a neat trick to find worst-case guarantees: they use the raw Census data itself as an approximation of the best possible data an attacker could find. Using this, they show that the global precision increases to 75%, and even goes up to 97% on small Census blocks.
Is this assumption of high-quality data unrealistic? No, for two reasons.
- The commercial data used in the attack is what someone could have had access to in 2010 — more than 10 years ago. Data brokers have much better data available for sale by now.
- The re-identification step only requires high-quality data about our targets. The attack works just fine even if we're trying to re-identify a single person, or a handful of people. With high-quality information about them, the precision goes up to these worst-case numbers.
There's nothing preventing businesses from running this attack on their employees or customers. In this kind of context, access to high-quality data isn't an optimistic assumption: it's a given.
Conclusion
This attack is bad news.
It proves two points beyond doubt. First, anyone can reconstruct Census records. Many of these reconstructed records are accurate, especially in small Census blocks. This does not require massive computing power: you can run smaller-scale attacks in minutes on your average laptop. Second, high-quality data about some people is enough to re-identify their Census records. This re-identification has high precision, and reveals previously-unknown information about these people.
In the attack, the attacker could learn race and ethnicity information. Data brokers might want to do this to augment their datasets, for example. But the risks can get even more tangible. Remember: the attack only used a small fraction of all published statistics. The targeted tables had only demographic information. What if someone were to attack household tables instead? This could likely reveal intimate details about the people you live with. Do you have a have a partner of the same sex? Children of a different race? Anyone with basic information about you — employer, acquaintance, data broker… — might find out.
Some of that information is particularly sensitive, and otherwise unavailable via data brokers. For example, reconstruction also works on children's data, which is illegal to sell in the US. Someone could combine household reconstruction with re-identification, and locate kids of a specific race, sex and age range in a given area. No need to spell out the possible harms this could enable.
Risks to individuals can translate to risks to data quality in the long run. Suppose Census data can no longer be considered as confidential. How will it impact people's willingness to answer future Census surveys? Minority groups are already more likely to have privacy concerns about their answers. This will get worse if the Census Bureau can no longer guarantee the confidentiality of the data.
This explains the Census Bureau's decision to move to differential privacy. Everything we described is just one possible attack, and it used only a fraction of the released data. The Census must protect all other attacks that people might come up with in the future! And that's exactly what differential privacy provides: provable privacy guarantees, even against attacks that haven't invented yet.
If you'd like to learn more about differential privacy, I have good news for you: this blog also contains an entire article series about this notion, introducing it in friendly, non-technical terms. Head over to the table of contents to see the other posts!
Nice, you made it all the way to the end of this article! Here are some more links and disclaimers.
This article is almost entirely sourced using documents from a lawsuit. The main one is this declaration from John Abowd, the Census Bureau's chief scientist. It explains the history and context behind the use of DP for the 2020 Census. It starts on page 85 of this PDF, and Appendix B (starting on page 147) describes the attack in more detail. Some of the numbers also come from Abowd's supplemental declaration.
This last declaration has been filed as a response to a filing by a couple of demographers. They make an argument similar to the one outlined in the second part of this article: you can reconstruct records by picking them randomly, so reconstruction doesn't mean anything. Hopefully, this post managed to convinced you that this argument has two flaws: it ignores the higher risks for minority groups, and it fails to address re-identification, the second part of the attack.
Still hungry for more legal filings? Many leading data privacy experts filed an amicus brief supporting the use of DP for the 2020 Census. It's clear, concise, and makes a lot of great points. Worth a read!
In case this wasn't obvious, this article made a lot of simplifications. I rounded all numbers so they would look nicer. I optimized the reconstruction example for clarity and fun, not for accuracy: in particular, the Census Bureau doesn't actually release statistics like "average age". The real attack doesn't use only a system of equations for reconstruction, but inequalities as well; I also suppose that there is some optimization done, but the details aren't public. The linking attack has some additional logic to do fuzzy matching. And I'm sure I made other shortcuts along the way. If you're looking for more accurate information, you will probably find it in the documents linked above.
There's a lot more to say about swapping, too. For example, it has surprising negative effects on data quality! Since folks in demographic minorities in each location are more likely to have their record swapped, the process biases the data: it makes all areas seem more homogeneous than they actually are. Recall that the details of swapping are secret: data users couldn't quantify such effects to take them into account in their analysis! Interestingly, the attack run by Census Bureau scientists ignored swapping entirely. A more clever attack might take it into account, and attempt to reverse it. This could make the attack even more accurate, especially for folks in demographic minorities.
Finally: I have not been involved in any way with Census work. I'm thankful to Aloni Cohen, Cynthia Dwork, Thomas Steinke, Kunal Talwar, and Yuan Yuan Zheng for helpful comments and suggestions on drafts of this post. Of course, if there are inaccuracies or errors left, that's entirely on me — please let me know if you find any!
-
The people using the data: scientists, people drawing electoral district boundaries, public agencies, businesses, etc. ↩
-
We're using the classification from the Census here. Census data also has a separate Hispanic origin field, called "ethnicity". Don't ask me to explain this, I have no idea. ↩
-
This hypothetical Census block is basically a tech company. ↩
-
To learn more about this industry, this EFF paper is a pretty solid resource. ↩
-
Or gender, depending on the dataset. These are not the same thing, but the Census asks about sex, so we'll assume the commercial datasets use the same notion. This inevitably introduces errors and feels a bit icky. ↩