Local vs. central differential privacy
This post is part of a series on differential privacy. Check out the table of contents to see the other articles!
When people talk about differential privacy, they don't always have the same thing in mind. People agree on the definition, but context also matters a lot. There are three main settings in which differential privacy can be deployed: the central model (or global model), the local model, and the distributed model. In this post, I'll outline these three options, and the differences between them.
First, what do I mean by "context", or "setting"? When you only look at the abstract definition, differential privacy is simple: it applies to a function that transforms an input into an output.
The question we'll be answering in this post is: where is differential privacy applied? Who do we protect the data from, and who has access to the raw data?
Central differential privacy
In the central model (or global model) of differential privacy, a central aggregator has access to the real data. What is this "aggregator"? Typically, it's a service or a research organization collecting data about individuals. In this model, each user sends their data to this aggregator without noise. The aggregator takes this data, and transforms it with a differentially private mechanism.
The differentially private mechanism is only applied once, at the end of the process. The aggregator can then e.g. publish the result or share it with third parties.
This model has one big advantage: accuracy. In the central model, you usually don't need to add a lot of noise to get valuable results with a low \(\varepsilon\). Remember when I explained how to release statistics or histograms using differential privacy? These examples were using this central model. It worked pretty well: only a little noise was needed to hide someone in a count.
Where's the catch, then? Well, the central aggregator needs to know the real data. In the scenarios above, we added noise to real counts. This is only possible if we know the true numbers in the first place… To enable this, each user has to trust the aggregator enough to share data with it. That might be difficult: the aggregator can be an untrusted company or government. Also, with the central model, all the data is collected in one place. It increases the risk of catastrophic failure, for example if the aggregator gets hacked and leaks all the data.
The most famous real-world example of the central model is probably the US Census. In 2020, the US Census will use differential privacy to anonymize the data before publication. This is pretty exciting! You can read more about it here.
Local differential privacy
What's the alternative, then? A major contender is the local model of differential privacy. In this model, there is still an aggregator, but they no longer have access to the real data. Instead, each user applies a differentially private mechanism to their own data. And they only send their data to the aggregator once it's already anonymized.
After collecting this noisy data, the aggregator can compute some statistics, and publish them. This last step doesn't need to be differentially private: the data is anonymous to begin with. In theory, the aggregator could publish the entire dataset they collected.
The big advantage of this model is that it no longer requires trust. Since each user is protecting their own data, they're safe even if the aggregator is malicious. This makes the local model well-suited to situations where trust is difficult to get. And we already saw an example of this! Remember the survey about drug use that used randomized response to gather data. The scheme allowed subjects to answer honestly without admitting to breaking the law. This is a typical application of the local model.
Can you guess the drawback of this model? Since each user must add noise to their own data, the total noise is much larger. You typically need many more users than in the central model to get useful results. To mitigate this problem, practical applications often use high values of \(\varepsilon\).
Besides randomized response, the most famous example of this model is probably RAPPOR. This clever scheme was invented to collect differentially private data in Google Chrome. Another example, a bit more recent, is the mechanism that Apple uses to collect data on the iOS keyboard.
Distributed differential privacy: somewhere in between
The choice between central and local differential privacy can seem harsh: either accept a much larger level of noise, or collect raw data. That's why researchers looked for compromises, and came up with in-between solutions. These models are collectively called distributed differential privacy. They try to get the best of both worlds.
The common idea behind all these models is to add an intermediary step to data collection. The data is not sent directly from users to the aggregator. Instead, some distributed protocol is run, and the aggregator only sees its output.
There are a few different ways to implement this core idea.
- Some system architectures, like Encode-Shuffle-Analyze, introduce a new system component called a shuffler. It removes identifiers, groups similar data points, and returns them in a random order.
- Other systems use cryptographic protocols to aggregate data before the server sees it. The aggregator only learns e.g. the sum of all values, not individual values. This can be used, for example, as part of a federated learning pipeline.
Note that the arrows on the diagram above go both ways. This is because some distributed DP mechanisms are interactive: the protocol requires several rounds of two-way communication between users and aggregator.
Not all distributed DP protocols have the same trust assumptions. Is the aggregator assumed to be honest-but-curious, or actively malicious? Are there multiple participants, and are they assumed not to collude? Do the guarantees only rely on standard cryptographic assumptions? Or do they only depend on the security of hardware components? Different implementations will have different answers to these questions.
Research on distributed DP explores the space of trade-offs between trust and accuracy. It sheds light on even more trade-offs: protocols must also minimize how much computational resources and bandwidth they need. The many possibilities and challenges have led to a flourishing research area. Distributed DP holds a ton of potential for use cases where collecting raw data isn't a viable option!
Interested in learning more about differential privacy? Head over to the table of contents of this series to see its other posts. Or you can directly go to the next article in the series, which is somewhat paradoxical: it explores what it means for an algorithm to not be differentially private.