Ted is writing things

On privacy, research, and privacy research.

A list of real-world uses of differential privacy

— updated

This post is part of a series on differential privacy. Check out the table of contents to see the other articles!


That's pretty much all there is to this article: a list of real-world deployments of differential privacy, with their privacy parameters. One day, we might have a proper Epsilon Registry, but in the meantime…

First, some notes.

  • The main list only includes projects with a documented value of the privacy parameters. This includes information about what the privacy unit is. Projects that don't publish this information are listed at the end.
  • All use cases use central DP unless specified otherwise.
  • The list is sorted by alphabetical order of the organization publishing the data.
  • When a project uses open-source differential privacy tooling, I added a link to it.
  • I also added some caveats and general comments at the end of this post.

If you'd like to add or correct something, don't hesitate to hit me up! My contact info is at the bottom of this page.

Apple

An architecture diagram taken from Apple's differential privacy paper

Apple uses local DP to collect some data from end-user devices running iOS or macOS. The process is documented in a high-level overview document and a detailed paper. All use \(\varepsilon\)-DP, the values of the privacy parameter are described below, with a privacy unit of user-day.

  • QuickType suggestions learns previously-unknown words typed by sufficiently many users, using \(\varepsilon=16\).
  • Emoji suggestions calculates which emojis are most popular among users, using \(\varepsilon=4\).
  • Lookup hints collects data on actions taken from iOS Search suggestions. (I think. It's not very explicit.) It uses \(\varepsilon=8\).
  • Health Type Usage estimates which health types are most used in the HealthKit app, using \(\varepsilon=2\).
  • Safari Energy Draining Domains and Safari Crashing Domains collect data on web domains: which domains are most likely to cause high energy consumption or crashes, respectively. Both features use a common budget of \(\varepsilon=8\).
  • Safari Autoplay Intent Detection collects data about websites that auto-play videos with sound: in which of these domains are users most likely to mute vs. keep playing the video? It uses \(\varepsilon=16\).

The documented privacy unit is each data collection event. The devices send a limited number of such events per day: I translated all guarantees to use a privacy unit of user-day. Apple also does some de-identification and shuffling (see in Section 3.2.2 of their paper). Taking this into account would presumably lead to tighter central DP guarantees.

Facebook

Full URLs Data Set

The Full URLs Data Set provides data on user interactions with web pages shared on Facebok. The privacy unit is each individual action: this can be e.g. "Alice shared URL foo.com", or "Bob viewed a post containing URL bar.org". For each type of action, the privacy parameter is chosen to protect 99% of users with \((\varepsilon,\delta)\)-DP, for \(\varepsilon=0.45\) and \(\delta=10^{-5}\). Across all metrics, 96.6% of users are protected with \((\varepsilon,\delta)\)-DP with \(\varepsilon=1.453\) and \(\delta=10^{-5}\).

Behind the scenes, this uses \(\rho\)-zero-concentrated DP, with \(\rho=0.0052\) for 99% users for each action type, and an overall \(\rho=0.0728\) for 96.6% of users. The paper refers to two additional DP operations:

  • URLs that have not been shared by enough users (according to a DP count) are discarded;
  • the algorithm also calculates the 99% percentile of each action in a DP way.

It does not quantify the privacy budget used for these two operations.

Movement Range Maps

An animated map of the "Stay Put" metric in Facebook's Movement Range Maps

The Movement Range Maps quantify the changes in mobility of Facebook users during the COVID-19 pandemic. There are two metrics: how much their users move during each day, and how many people are generally staying at home. Each metric uses a daily value \(\varepsilon=1\), so the overall privacy budget is \(\varepsilon=2\) with user-day as a privacy unit.

The blog post also mentions that regions with fewer than 300 users are omitted. This process doesn't appear to be done in a DP way.

Google

All three data releases listed here use Google's open-source libraries.

Community Mobility Reports

The Community Mobility Reports quantify changes in mobility patterns during the COVID-19 pandemic: how many people went to their workplace or to specific kinds of public places, and how long people spent at home. Each metric uses \(\varepsilon=0.44\) per day, and each user contributes to at most six metrics per day. Thus, the total privacy budget is \(\varepsilon=2.64\), with user-day as a privacy unit.

The paper also mentions using more privacy budget used to update the way the metrics are computed. This additional budget isn't quantified exactly.

An animated visualization of searches for Fever in the US through 2020, using Google's Search Trends Symptoms Dataset

The Search Trends Symptoms Dataset measures the volume of Google searches related to a variety of symptoms. It uses \(\varepsilon=1.68\), with a user-day privacy unit.

Vaccination Search Insights

The Vaccination Search Insights quantify trends in Google searches related to COVID-19 vaccination. It uses \((\varepsilon,\delta)\)-DP with \(\varepsilon=2.19\) and \(\delta=10^{-5}\), with user-day as a privacy unit.

LinkedIn

Labor Market Insights

The Labor Market Insights measure trends in people changing their occupation on LinkedIn. There are three types of reports.

  • Who is hiring? lists the companies who are hiring most. It uses \((\varepsilon,\delta)\)-DP to protect each hiring event (a LinkedIn user changing their occupation), with \(\varepsilon=14.4\) and \(\delta=1.2\cdot10^{-9}\).
  • What jobs are available? enumerates the job titles that most people are being hired for. It also uses \((\varepsilon,\delta)\)-DP to protect each hiring event, with \(\varepsilon=14.4\) and \(\delta=1.2\cdot10^{-9}\).
  • What skills are needed? lists the most popular skills for the jobs above. It protects each LinkedIn user's skills information during a single month with \(\varepsilon=0.3\) and \(\delta=3\cdot10^{-10}\).

This suggests a total \(\varepsilon=28.8\) and \(\delta=2.4\cdot10^{-9}\)-DP for hiring events, and \(\varepsilon=0.3\) and \(\delta=3\cdot10^{-10}\) for skill information during a single month. However, there are many subtleties involved in the above analysis. It's very possible to interpret the paper differently.

  1. The privacy parameters listed in the paper are three times smaller. However, each report covers 3 months of data, and reports are published monthly: a single hiring event will appear in three distinct reports.
  2. For What skills are needed?, each monthly report looks back at 5 years of data. So if skill data for a user doesn't change during a 5-year period, the total budget eventually reaches \(\varepsilon=6\) and \(\delta=6\cdot10^{-9}\).
  3. Adding the \(\varepsilon\) and \(\delta\) values together, like I did, is simple, but only give loose bounds on the overall privacy budget. We can probably find tighter bounds using advanced composition theorems or other methods for privacy accounting.
  4. The paper also indicates that 95% of people in the dataset have at most one hiring event in a 3-month period.
  5. The What skills are needed? report also uses a non-DP pre-processing step. This makes it technically impossible to provide an exact DP guarantee.

Audience Engagements API

An architecture diagram from LinkedIn's Audience Engagements API paper

The Audience Engagements API is the only interactive query system in this list. It allows marketers to get information about LinkedIn users engaging with their content. Each query returns \((\varepsilon,\delta)\)-DP with \(\varepsilon=0.15\) and \(\delta=10^{-10}\), with a user as a privacy unit. Each analyst can send multiple queries, but a monthly cap limits how many: the total \((\varepsilon,\delta)\) budget is \(\varepsilon=34.9\) and \(\delta=7\cdot10^{-9}\), with a privacy unit of user-month-analyst.

The system also implements additional measures to prevent averaging attacks: new data is loaded daily, and seeded noise is used so the same query on the same day will always return the same answer.

Microsoft

The U.S. Broadband Coverage Dataset quantifies the percentage of users having access to high-speed Internet across the US. It uses \(\varepsilon\)-DP with \(\varepsilon=0.2\), the privacy unit is a user. The data was privatized using OpenDP SmartNoise.

OhmConnect

The Energy Differential Privacy project enables sharing of smart meter data. In one project, Recurve helped OhmConnect share data from their virtual power plant. This project uses \((\varepsilon,\delta)\)-DP with \(\varepsilon=4.72\) and \(\delta=5.06\cdot10^{-9}\), with user as a privacy unit. The project uses both custom open-source code and Google's open-source DP libraries.

The privacy parameters appearing in the technical paper are different. The accounting uses amplification by sampling, with a sampling factor of \(\eta=0.124\). However, the paper converts a pre-amplification \(\varepsilon_{orig}=6.8\) into \(\varepsilon=\eta\cdot\varepsilon_{orig}=0.843\). The correct formula is \(\varepsilon=\log\left(1+\mu\left(e^{\varepsilon_{orig}}-1\right)\right)\) (see Theorem 9 in summary of results), which gives \(\varepsilon=4.72\). The \(\delta\) listed above is also amplified (with $\delta=\mu\delta_{orig}), the one reported in the paper is not.

Note that the amplification result assumes uniformly random sampling with replacement. But the paper also mentions a stratified sampling methodology, which is slightly different: it's unclear whether the amplification result still applies. If not, then the privacy parameters are \(\varepsilon=6.8\) and \(\delta=4.08\cdot10^{-8}\).

United States Census Bureau

Post-Secondary Employment Outcomes

The Post-Secondary Employment Outcomes provide data about the earning and employment of college graduates. The technical documentation mentions two statistics using \(\varepsilon\)-DP with \(\varepsilon=1.5\), for a total privacy budget of \(\varepsilon=3\). The privacy unit is a person in the dataset, and the methods are described in detail in this paper.

2020 Census Redistricting Data

A screenshot from the 2020 Census Demographic Data Map Viewer

The 2020 Census Redistricting Data contain US population data and demographic information. It is protected with \((\varepsilon,\delta)\)-DP with \(\varepsilon=19.61\) and \(\delta=10^{-10}\), where the privacy unit is a person in the dataset. This uses custom code that was published on GitHub.

The privacy accounting is done with \(\rho\)-zero-concentrated DP, with a global budget of \(\rho=2.7\).

Other deployments

This list is almost certainly incomplete. Again, don't hesitate to reach out if you'd like me to add or correct something!

  • Apple and Google's Exposure Notification framework has an analytics component that uses shuffled DP. The paper mentions a local \(\varepsilon=8\) and corresponding central values of \(\varepsilon\) depending on how many users participate and on the central \(\delta\) chosen. However, it does not specify the privacy unit, the number of aggregations, nor the minimal number of participating users.
  • Cuebiq's Mobility Dashboards (Mobility Index and Evacuation Rates) surface differentially private data. This presentation mentions \(\varepsilon=1\) and \(\delta=10^{-10}\) per-aggregation, but does not specify what the privacy unit is, nor how many aggregations are performed. It uses OpenDP SmartNoise.
  • Google mentions using DP in Google Maps features: the first quantifies how busy public places are during the day, the second which restaurant's dishes are most popular. It does not specify the privacy parameters used nor the exact method used to generate the data.
  • Google shared mobility data with researchers, using DP to anonymize it. The resulting paper mentions \((\varepsilon,\delta)\)-DP with \(\varepsilon=0.66\) and \(\delta=2.1\cdot10^{-29}\), but does not specify explicitly what the privacy unit is.
  • Google's RAPPOR used to collect browsing information in Google Chrome with local DP. It is now deprecated.
  • The Internal Revenue Service and the U.S. Department of Education, helped by Tumult Labs, used DP to publish college graduate income summaries. The data was published on the College Scorecard website. The project is outlined in this post, but no specific privacy parameters are given.
  • Microsoft collects telemetry data in Windows using local DP. No information is given about privacy parameters.
  • Microsoft's Assistive AI automatically suggests replies to messages in Office tools. It provides \((\varepsilon,\delta)\)-DP with \(\varepsilon=4\) and \(\delta<10^{-7}\), but does not specify what the privacy unit is.
  • Microsoft also mentions using DP in Workplace Analytics: this allows managers to see data about their team's interactions with workplace tools. No specific information about privacy parameters is given.
  • The US Census Bureau published OnTheMap in 2008: this was the first-ever real-world deployment of DP. It provides statistics on where US workers are employed and where they live. The DP process is described in paper, but I haven't found the privacy parameters published anywhere.

There are (many) other examples of companies and organizations saying they use DP. I only added them here if they point to a specific project or feature.

Finally, many scientific papers report experimental results on real datasets. Most don't mention whether the system was deployed. I did not attempt to list those.

Caveats & comments

What's a user?

Most of these projects have user as part of their privacy unit. This can mean slightly different things depending on the project: a device (for telemetry collection), an account (for online services), a household (for smart meter data), and so on. This means that an individual who uses multiple devices or accounts on the same online service might get weaker privacy guarantees. This subtlety is not always made explicit.

Replacement vs. addition/removal

In differential privacy, the definition of the two neighboring datasets can be of two types. Do you change the data of one person? Or do you add or remove a user? This subtlety is also not always explicit, and I've ignored it in the list above.

Comparing projects

You should not use this list to make broad statements or comparisons about the privacy posture of different organizations. Differential privacy parameters are a very small part of the story, even for these specific projects. How was the data collected? How long is it kept? How sensitive is it? Who has access to the input and output data? Answering these questions is crucial to put each DP deployment and its parameters in context.

In addition, different privacy units also make simple comparisons fairly meaningless. Even across time periods, the semantics are subtle. As an example, consider two DP processes.

  • Process \(A\) uses a privacy unit of user-day with \(\varepsilon_A=0.2\).
  • Process \(B\) uses a privacy unit of user-month with \(\varepsilon_B=3\).

Can we simply multiply \(\varepsilon_A\) by \(30\) to compare it to \(\varepsilon_B\)? Well, not really. The data of a user during a single day is protected by Process \(A\) with \(\varepsilon_A\), which is worse than what Process \(B\) can guarantee (at most \(\varepsilon_B\)). But with process \(A\), the data of an entire month is only protected with \(30\varepsilon_A=6\) with Process \(A\), so Process \(B\) has better guarantees. And this is without the possibility of using better privacy accounting methods, to get tighter parameters for the monthly guarantees of Process \(A\).


Thanks to Ashwin Machanavajjhala, Erik Tobenek, Lars Vilhuber, Marc Paré, and Tancrède Lepoint for their helpful comments and suggestions.

All opinions here are my own, not my employer's.   |   Feedback on these posts are very welcome! Please reach out via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy) for comments and suggestions.   |   Interested in deploying formal anonymization methods? My colleagues and I at Tumult Labs can help. Contact me at oi.tlmt@neimad, and let's chat!