Ted is writing things

On privacy, research, and privacy research.

Local vs. central differential privacy

— updated

This post is part of a series on differential privacy. Check out the table of contents to see the other articles!

When people talk about differential privacy, they don't always have the same thing in mind. People agree on the definition, but context also matters a lot. There are two "flavors" of differential privacy: the central model (or global model) and the local model. In this post, I'll first explain the difference between the two options. Then, I'll give a glimpse of an exciting new idea that combines them.

What do I mean by "context"? In abstract terms, a differentially private algorithm transforms an input into an output.

Diagram "raw input" → "anonymized output", with "magic" on the arrow.

The main difference will be: who has access to the raw input?

Central differential privacy

In the central model (or global model) of differential privacy, a central aggregator has access to the real data. What is this "aggregator"? Typically, it's a service or a research organization collecting data about individuals. In this model, each user sends their data to this aggregator without noise. The aggregator takes this data, and transforms it with a differentially private mechanism.

Diagram "users" → "aggregator" → "anonymized output", with "magic" on the second arrow.

The differentially private mechanism is only applied once, at the end of the process. The aggregator can then e.g. publish the result or share it with third parties.

This model has one big advantage: accuracy. In the central model, you usually don't need to add a lot of noise to get valuable results with a low \(\varepsilon\). Remember when I explained how to release statistics or histograms using differential privacy? These examples were using this central model. It worked pretty well: only a little noise was needed to hide someone in a count.

Where's the catch, then? Well, the central aggregator needs to know the real data. In the scenarios above, we added noise to real counts. This is only possible if we know the true numbers in the first place… To enable this, each user has to trust the aggregator enough to share data with it. That might be difficult: the aggregator can be an untrusted company or government. Also, with the central model, all the data is collected in one place. It increases the risk of catastrophic failure, for example if the aggregator gets hacked and leaks all the data.

The most famous real-world example of the central model is probably the US Census. In 2020, the US Census will use differential privacy to anonymize the data before publication. This is pretty exciting! You can read more about it here.

Local differential privacy

What's the alternative, then? It's the local model of differential privacy. In this model, there is still an aggregator, but they no longer have access to the real data. Instead, each user applies a differentially private mechanism to their own data. And they only send their data to the aggregator once it's already anonymized.

Diagram "users" → "aggregator" → "anonymized output", with "magic" on the first arrow.

After collecting this noisy data, the aggregator can compute some statistics, and publish them. This last step doesn't need to be differentially private: the data is anonymous to begin with. In theory, the aggregator could publish the entire dataset they collected.

The big advantage of this model is that it no longer requires trust. Since each user is protecting their own data, they're safe even if the aggregator is malicious. This makes the local model well-suited to situations where trust is difficult to get. And we already saw an example of this! Remember the survey about drug use that used randomized response to gather data. The scheme allowed subjects to answer honestly without admitting to breaking the law. This is a typical application of the local model.

Can you guess the drawback of this model? Since each user must add noise to their own data, the total noise is much larger. You typically need many more users than in the central model to get useful results. To mitigate this problem, practical applications often use high values of \(\varepsilon\).

Besides randomized response, the most famous example of this model is probably RAPPOR. This clever scheme was invented to collect differentially private data in Google Chrome. Another example, a bit more recent, is the mechanism that Apple uses to collect data on the iOS keyboard.

The best of both worlds

Until very recently, there was no middle ground between the two options. The choice was binary: either accept a much larger level of noise, or collect raw data. This is starting to change, thanks to recent work on a novel type of architecture. This new design is called ESA: Encode, Shuffle, Analyze. In the two previous models, there was only two types of participants: users and aggregator. By contrast, the ESA architecture has three elements.

  • The encoder is a fancy name to say "user". It collects the data, encrypts it twice, and passes it to the shuffler.
  • The shuffler is an intermediary process. First, it removes identifiers, and groups similar pieces of data together. Then, it passes these pieces of data to the analyzer if there are enough of them.
  • The analyzer actually decrypts the data, and computes the statistics we're interested in.

Diagram "users" → "shuffler" → "analyzer" → "anonymized output", with "magic" on the shuffler and analyzer boxes.

In ESA, the data is encrypted twice, in two layers. The shuffler can only decrypt the first layer. It contains the user IDs, and something called "group ID". This group ID describes what kind of data this is, but not what is the actual value of the data. For example, group ID could be a label like "app latency for iOS", while the actual data would be the latency value.

The shuffler then groups all group IDs together and counts how many users are in each group. If there are enough users in a group, it passes them all to the analyzer. The analyzer can then decrypt the second layer of the data, and compute the output.

Having two layers allows to separate the data and its metadata. The shuffler can see IDs, but not the actual data. Meanwhile, the analyzer only sees the data in batches, and cannot know who sent what. The magic comes with tweaking what the shuffler does. You add some randomness in the process, in well-chosen places… And it ensures that the final output is differentially private.

This nice property comes from the separation between shuffler and analyzer. Of course, if the same entity runs both, that benefit disappears. This design can still be useful! In particular, catastrophic failures are much less likely. But from a trust perspective, it becomes like the central model. So there are two options:

  • A different organization can run the shuffler as a service. In principle, a nonprofit like the EFF could play this role.
  • The shuffler can run on secure hardware, using e.g. remote attestation. This way, the client can verify that the shuffler does what it's supposed to.

Both options are probably a few years away, but I find them pretty exciting. They could, in theory, bring the best of both worlds: the low noise levels of the central model, and the trustless aspect of the local model. If you want to learn more about it, you can check out the paper that introduced this design.

Interested in learning more about differential privacy? Head over to the table of contents of this series to see its other posts. Or you can directly go to the next article in the series, which is somewhat paradoxical: it explores what it means for an algorithm to not be differentially private.

All opinions here are my own, not my employer's.   |   Feedback on these posts is very welcome! Please reach out via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy) for comments and suggestions.   |   Interested in deploying formal anonymization methods? My colleagues and I at Tumult Labs can help. Contact me at oi.tlmt@neimad, and let's chat!