Part of a series on differential privacy. You might want to start with the previous articles below!
- Why differential privacy is awesome presents a non-technical explanation of the definition.
- Differential privacy in (a bit) more detail introduces the formal definition, with very little math.
- Differential privacy in practice (easy version) explains how to make simple statistics differentially private.
- Almost differential privacy describes how to publish private histograms without knowing the categories in advance.
When people talk about differential privacy, they don't always have the same thing in mind. People agree on the definition, but context also matters a lot. There are two "flavors" of differential privacy: the global model and the local model. In this post, I'll first explain the difference between the two options. Then, I'll give a glimpse of an exciting new idea that combines them.
What do I mean by "context"? In abstract terms, a differentially private algorithm transforms an input into an output.
The main difference will be: who has access to the raw input?
Global differential privacy
In the global model of differential privacy, a central aggregator has access to the real data. What is this "aggregator"? Typically, it's a service or a research organization collecting data about individuals. In this model, each user sends their data to this aggregator without noise. The aggregator takes this data, and transforms it with a differentially private mechanism.
The differentially private mechanism is only applied once, at the end of the process. The aggregator can then e.g. publish the result or share it with third parties.
This model has one big advantage: accuracy. In the global model, you usually don't need to add a lot of noise to get valuable results with a low \(\varepsilon\). Remember when I explained how to release statistics or histograms using differential privacy? These examples were using this global model. It worked pretty well: only a little noise was needed to hide someone in a count.
Where's the catch, then? Well, the central aggregator needs to know the real data. In the scenarios above, we added noise to real counts. This is only possible if we know the true numbers in the first place… To enable this, each user has to trust the aggregator enough to share data with it. That might be difficult: the aggregator can be an untrusted company or government. Also, with the global model, all the data is collected in one place. It increases the risk of catastrophic failure, for example if the aggregator gets hacked and leaks all the data.
The most famous real-world example of the global model is probably the US Census. In 2020, the US Census will use differential privacy to anonymize the data before publication. This is pretty exciting! You can read more about it here.
Local differential privacy
What's the alternative, then? It's the local model of differential privacy. In this model, there is still an aggregator, but they no longer have access to the real data. Instead, each user applies a differentially private mechanism to their own data. And they only send their data to the aggregator once it's already anonymized.
After collecting this noisy data, the aggregator can compute some statistics, and publish them. This last step doesn't need to be differentially private: the data is anonymous to begin with. In theory, the aggregator could publish the entire dataset they collected.
The big advantage of this model is that it no longer requires trust. Since each user is protecting their own data, they're safe even if the aggregator is malicious. This makes the local model well-suited to situations where trust is difficult to get. And we already saw an example of this! Remember the survey about drug use that used randomized response to gather data. The scheme allowed subjects to answer honestly without admitting to breaking the law. This is a typical application of the local model.
Can you guess the drawback of this model? Since each user must add noise to their own data, the total noise is much larger. You typically need many more users than in the global model to get useful results. To mitigate this problem, practical applications often use high values of \(\varepsilon\).
Besides randomized response, the most famous example of this model is probably RAPPOR. This clever scheme was invented to collect differentially private data in Google Chrome. Another example, a bit more recent, is the mechanism that Apple uses to collect data on the iOS keyboard.
The best of both worlds
Until very recently, there was no middle ground between the two options. The choice was binary: either accept a much larger level of noise, or collect raw data. This is starting to change, thanks to recent work on a novel type of architecture. This new design is called ESA: Encode, Shuffle, Analyze. In the two previous models, there was only two types of participants: users and aggregator. By contrast, the ESA architecture has three elements.
- The encoder is a fancy name to say "user". It collects the data, encrypts it twice, and passes it to the shuffler.
- The shuffler is an intermediary process. First, it removes identifiers, and groups similar pieces of data together. Then, it passes these pieces of data to the analyzer if there are enough of them.
- The analyzer actually decrypts the data, and computes the statistics we're interested in.
In ESA, the data is encrypted twice, in two layers. The shuffler can only decrypt the first layer. It contains the user IDs, and something called "group ID". This group ID describes what kind of data this is, but not what is the actual value of the data. For example, group ID could be a label like "app latency for iOS", while the actual data would be the latency value.
The shuffler then groups all group IDs together and counts how many users are in each group. If there are enough users in a group, it passes them all to the analyzer. The analyzer can then decrypt the second layer of the data, and compute the output.
Having two layers allows to separate the data and its metadata. The shuffler can see IDs, but not the actual data. Meanwhile, the analyzer only sees the data in batches, and cannot know who sent what. The magic comes with tweaking what the shuffler does. You add some randomness in the process, in well-chosen places… And it ensures that the final output is differentially private.
This nice property comes from the separation between shuffler and analyzer. Of course, if the same entity runs both, that benefit disappears. This design can still be useful! In particular, catastrophic failures are much less likely. But from a trust perspective, it becomes like the global model. So there are two options:
- A different organization can run the shuffler as a service. In principle, a nonprofit like the EFF could play this role.
- The shuffler can run on secure hardware, using e.g. remote attestation. This way, the client can verify that the shuffler does what it's supposed to.
Both options are probably a few years away, but I find them pretty exciting. They could, in theory, bring the best of both worlds: the low noise levels of the global model, and the trustless aspect of the local model. If you want to learn more about it, you can check out the paper that introduced this design.
What comes next
What will we be learning next? I'm not sure yet. I recently wrote a research paper about differential privacy variants, so I'll cover that here at some point. Apart from that, here are a few ideas…
- Differentially private algorithms to publish things that are more complex than counts or sums. (Medians are interesting!)
- The trade-off between privacy and utility, and how to give people an intuition about accuracy loss. (How to talk to data scientists when you're a privacy person.)
- The unexpected challenges that occur when using differential privacy in practice. (It's like crypto: so many opportunities to shoot yourself in the foot!)
If you have opinions about what you want to learn next, let me know — my contact info is below.