Ted is writing things

Empirical privacy metrics: the bad, the ugly… and the good, maybe?

2024-06-03T00:00:00+02:00

This post is a transcript of an talk I presented at PEPR in June 2024. The talk was recorded, but the recording has not been published yet; I will add it to this blog post when it’s online.

Hi everyone! I have great news!

We just solved privacy!

In fact, I don’t know why we even need this conference anymore!

All we need to do is take our data, put it through a synthetic data generator, and — tadaaa! We get something that we can use for all of these pesky data sharing or publication or retention use cases.

You know, all the ones where the lawyer told us that we needed to anonymize our data, and we had no idea where to start. Anonymization is hard, but synthetic data is easy!

…

Now, if you’re like me, you don’t take claims like these at face value.

And a natural question you might have is: why does this stuff actually preserve privacy? All these marketing claims… how are they justified?

In some cases, the answer is… eh, you know. It’s synthetic. It’s not real data. That means it’s safe. Stop asking questions.

Now, you’re all privacy pros, so I trust that if someone gives you that kind of hand-wavy non-answer, you would smell the bullshit from a distance.

Sometimes, though, the answer seems to make a lot more sense.

That answer is: we know it’s safe, because we can measure how safe it is.

We can generate some data, do some calculations, and tell you whether this data is “too risky”, or whether you’re good to go.

That sounds great!

Hi. I’m Damien, and today, I’m really excited to tell you all about empirical privacy metrics.

The first question you probably have is: how do they work? What do they measure?

In this talk, I’m going to focus on one kind of metric, which is both the most reasonable-sounding and the most widely used. They’re called similarity-based metrics. The idea is relatively simple.

First, you take your data and you split it in two parts — the train data and the test data, just like what you do in machine learning.
Then, you use only the train data to generate your synthetic data.
Then — and this is where it gets interesting — you compute the distance between the synthetic data and the train data. There are many ways to compute the distance between two distributions; you end up with different metrics depending on the distance you choose. Here, we’ll ignore the details, and just say it’s a measure of how similar the two are to each other.
Then, you compute a second distance, this time between the synthetic data and the test data.

And once you’ve got two numbers, you’re doing the natural thing and compare them with each other. Is the distance to the train data smaller than the distance with the test data?

If yes, that’s… is that bad or good?

Correct! That’s bad. That means we generated records that are close, not just to the real data, but to the specific points that we used for generation. We didn’t just end up matching the distribution well, we overfit to individual data points. That could be a sign that we leaked some private information. So, that’s bad.

Conversely, if the two numbers are roughly the same, or even if the distance to the train data is larger, that means we’re fine. We didn’t leak any sensitive data.

Right?

… Right?

I mean, that does sound reasonable. But I’ve said something about "bad and ugly" before, so you can probably see where this is going.

So let’s get into it. Where’s the bad?

First, it’s really easy to cheat at these metrics. All we need to do is to make sure that the synthetic data isn’t “too close” to the training data. Except if we do that of course, we do leak information — exactly what’s happening in this screenshot of some COVID 19 tracking app. Knowing where real data points are not gives us data about where they actually are.

You could tell me — that’s not a real problem. We’re not making algorithms that do this sort of nonsense. We’re not cheating in real life.

Except… you’re using machine learning algorithms!

You’re giving your data to a neural network, you don’t really understand how it works, but you tell it: go optimize for these metrics. I want good utility and good privacy, and this is how both of these things are defined. Go achieve both objectives.

Guess what? Neural networks are going to cheat! That’s what they do! They might just not be as obvious about it!

Second, the process I described earlier has some inherently random aspects to it. Like, which data points are you using for training vs. for testing? Or what’s the random seed you use as part of your machine learning training?

So what happens if you change those? Does your empirical privacy metric return the same result?

Researchers tried that, and found shocking levels of randomness. Sometimes the metric tells you everything looks good, and then you re-run the same algorithm on the same data and it tells you it’s very bad. So that doesn’t exactly inspire confidence.

OK. There’s worse. I count at least two much more profound issues.

One is that these similarity-based metrics assume an attacker who’s trying to do something really weird. They have synthetic data points, then they also have real data points somehow, and their goal is to link the two together. If they can accurately draw some of these lines, then they win.

But that’s not what attackers do in real life! There can be leakage even if no such line exists! Attackers can do things like reconstruction attacks, exploit the details of your algorithm, use auxiliary information… Sometimes they can even influence your data!

The distances we saw earlier — they don’t model any of that.

Finally, remember how we were computing the distance between two distributions earlier? This single number gave us an average metric about our system. At best, it tells us how well we protect the average person in the dataset.

But — and I cannot stress this enough — everyone needs privacy guarantees! Including outliers! Especially outliers! If your approach works fine for most people, but leaks a ton of data for demographic minorities, that’s bad! I’d argue that it’s even worse than a system that leaks everyone’s information: at least you would notice and fix it!

So these four problems I talked about are serious. Suppose we somehow fix all of those. Does that mean we’re good?

I don’t think so. The design of these empirical metrics is bad, but the way they’re used is much more problematic.

Fundamentally, what are these metrics trying to do?

They’re trying to quantify risk. They tell you: there’s some kind of risk scale. Some end of the scale is great, the other end is bad.

Well, we’ve seen that maybe we’re not exactly measuring risk, more like “risk”.

But more importantly, people building and selling synthetic data are basically telling you: you can generate some data and measure where you are on the scale. Like, for example, there. You’re in the safe zone. You’re fine.

But that’s not what empirical privacy metrics can ever tell you, even if you fix all the problems I talked about!

At most, they can tell you something like: you’re somewhere here.

We know for sure that you’re not on the left of this. Maybe we ran an attack and found that this is the success rate of the attack. So it’s at least that bad.

But we don’t know how much worse this can get! Maybe a better attack would have a much better success rate! We don’t know!

I want you all to keep this framing in mind when people are selling you privacy tech and presenting empirical metrics as the solution to your concerns. They will — I can guarantee it, I read all their marketing — present it as a thing that can allow you to verify that your data is safe.

This is a lie, but the sad thing is — I don’t even think that the people repeating it realize that this framing is dishonest. You get a number, you know? On a scale labeled “Risk”? It’s so tempting to just believe in it!

OK. I promised you bad and ugly. I gave you bad, worse and worst. Where’s the ugly?

Let me ask you a question.

Why is the state of empirical privacy evaluation so bad? Why do people use such garbage metrics, and make such dishonest claims?

I don’t believe in bad people. Whenever something’s broken, my first question is always: what are the incentives at play?

Here, what are the reasons why synthetic data vendors would want to improve their metrics? Like, what would structurally motivate them to do better?

Let’s make a pros and cons list, starting with “why would they not do that”.

Obviously, this is more work. We have metrics today, if we need to change them, that’s feature work that we could use to do something else instead. So that’s hard.
If we make metrics better, they might find more privacy issues. That’s not great, because we sold a whole lot of that stuff as being safe.
Also, making stricter metrics is going to make it harder to design synthetic data generation tools going forward.
This idea that you can generate data that’s privacy-safe, that means you don’t have to worry about compliance anymore… that’s a major selling point. If we start poking holes in this story, it’s going to be harder to sell it.
Finally… by and at large, people don’t really understand this anonymization thing. Synthetic data seems to make sense, and the idea of measuring privacy definitely sounds reasonable.

Here’s something I learned the hard way: when your anonymization technique leads to bad utility, people notice. They bang at your door. They say — this is crap. I can’t use this. But when your technique is unsafe, who’s going to notice? Nobody, before someone with bad intentions does.

OK, so those are reasons why vendors would not spontaneously be incentivized to make things better. What are the pros, though?

No, seriously. I’m asking. What are those? Do you know?

Because I don’t.

There just aren’t a lot of structural incentives pushing folks to do better. Adopting a truly adversarial mindset is hard. This stuff is complicated. The metrics seem to make sense. Why change any of it?

One possible reason is because you, as buyers of this technology, as privacy professionals, as standard bodies and regulators even, are asking for it. My one call to action for you is: please start doing so! The people in your data deserve it.

Now, is there a path to redeem these empirical metrics? Can we ever get good answers to the questions we should ask to synthetic data vendors?

I think so! And I want to believe so. Quantifying risk is a great idea! Having a goal with a number attached to it is a great motivator. It allows people to track progress, to quantify trade-offs.

Estimating empirical risk is also super valuable! We should absolutely run attacks on our privacy-critical systems and measure their success. I, for one, want to know where my system lands on this nice risk scale. So how can we do that in a better way?

For starters, we need better metrics. We need to measure something meaningful. Otherwise, I refer you to Lea’s excellent talk from last year: bad metrics lead to very bad decisions.

The attacker model needs to make sense. It shouldn’t be too easy to cheat. It should capture the risk for the least protected people in the dataset. It shouldn’t be too random.

There are some recent papers that propose new, better ideas on how to quantify privacy risk. We’re far from having a definitive answer there, there’s still a lot of work to do.

Second, we need to frame these metrics better. We need to accept that they’re only giving us part of the story.

These metrics could be great at telling us “hey, there’s a problem there, we can show that the risk is high”. Like an alert, a warning sign. The absence of alerts doesn’t mean everything is fine, but warning signs are still super useful.

Third, we need to use empirical privacy metrics in conjunction with other ways of quantifying risk, that give provable, worst-case guarantees.

Of course, in a complete shock to everybody who knows me, I’m talking about differential privacy here. But I’m not saying that it’s the only answer! Sometimes — often, actually, especially with synthetic data — you need large privacy budgets to get good utility with differential privacy, so relying on the mathematical guarantee alone can feel a little iffy. Complementing that with empirical analyses makes a lot of sense, and can provide a much more complete picture of the risk.

This last part is also important because it’s the only way I know of to align incentives a little better. Again, vendors have no incentive to improve metrics and being more honest in marketing. I hope you’ll call them out on it, that might change the balance a little bit, but still. By contrast, when you quantify worst-case risk, then incentives are much more aligned: doing more work leads better privacy-utility trade-offs. It structurally tends to keep you honest. You have to quantify everything. That’s another reason why we like differential privacy :-)

If you want to hear more about this last thing, come talk to my colleagues and I at Tumult Labs! We help organizations safely share or publish data using differential privacy.

On the right, you can find the links to my LinkedIn and Mastodon profiles, and to my blog post series about differential privacy.

Thanks for listening!

I’m grateful to Gerome Miklau, Ashwin Machanavajjhala, and Hari Kumar for their excellent feedback on this presentation.

Converters between differential privacy variants

2024-06-02T00:00:00+02:00

This is a small collection of interactive converters between differential privacy variants.

Pure DP to zero-concentrated DP

If a mechanism satisfies $\varepsilon$-DP with $\varepsilon=$ , then it also satisfies $\rho$-zero-concentrated DP with $\rho=$ . The converse is not true.

Zero-concentrated DP to approximate DP

If a mechanism satisfies $\rho$-zero-concentrated DP with $\rho=$ , then it also satisfies $(\varepsilon,\delta)$-DP with $\varepsilon=$ and $\delta=$ . The converse is not true.

Gaussian DP to approximate DP

If a mechanism satisfies $\mu$-Gaussian DP with $\mu=$ , then it also satisfies $(\varepsilon,\delta)$-DP with $\varepsilon=$ and $\delta=$ . The converse is not true.

You can also use this to compute the $(\varepsilon,\delta)$-DP guarantees of a Gaussian mechanism of standard deviation $\sigma$ applied to a mechanism of global $L^2$ sensitivity $\Delta_2$, by replacing $\mu$ above by $\Delta_2/\sigma$.

Gaussian DP to zero-concentrated DP

If a mechanism satisfies $\mu$-Gaussian DP with $\mu=$ , then it also satisfies $\rho$-zero-concentrated DP with $\rho=$ . The converse is not true.

Frequently asked questions

What are all these weird definitions?

Pure DP is the name of the original definition of differential privacy. Check out this blog post series for a friendly introduction to the field.
Approximate DP introduces an additional parameter that allows for a small chance of privacy failure. Previous blog posts of mine explain what it can be used for, and provide a more precise characterization of the guarantee provided by this definition.
Zero-concentrated DP gives a bound on the average privacy loss, for many kinds of average at once. You can read more about it in this blog post.
Gaussian DP asserts that the privacy loss of a mechanism is bounded by the privacy loss of a Gaussian mechanism with certain parameters. It is a special case of a larger class of definitions called $f$-DP, introduced in this paper.

Are there conversion results in the reverse direction?

Converting from zero-concentrated DP to pure DP is impossible.
Converting from approximate DP to zero-concentrated DP or to Gaussian DP is also impossible (unless $\delta=0$).
Converting from zero-concentrated DP to Gaussian DP seems like it should be possible, but I don't know of any existing result in the literature.

How do you deal with floating-point issues?

The short answer is "not in a very robust way, don't do this in production".

The long answer is that I did somewhat of an effort to detect and avoid numerical stability issues, but all the math is still done in floating-point space, so there will be approximations errors, and rounding is not done in conservative directions. Here is an overview of what I did for each formula.

For pure DP to zero-concentrated DP, and for Gaussian DP to zero-concentrated DP, the formulas are pretty simple, so I didn't do anything special.
For zero-concentrated DP to approximate DP, there is one main failure mode: the conversion formula can return a $\delta$ that is so small that it rounds to zero. The code checks that this doesn't happen and returns an error if it does.
The conversion from Gaussian DP to approximate DP is surprisingly tricky. First, the CDF of the Gaussian distribution, a critical building block of the conversion formula, is difficult to compute in an accurate way. I use the technique introduced in this paper to get a good approximation. Then, multiple things can go wrong.
1. There's a multiplication by $e^\varepsilon$ in the formula, so this can easily become $+\infty$ in floating-point space if $\varepsilon$ is large. The code returns an error if it happens¹.
2. For some parameters combinations, $\delta$ can round down to zero. The code checks that this doesn't happen and returns an error if it does.
3. The formula requires computing the difference between two terms $a$ and $b$, where $a$ can be very close to $b$. This can be very imprecise when done in floating-space. I'm not sure whether this can happen in this context, but if it does, the code should catch the problem and return an error.

In addition, a number of these conversions are implemented using a binary search. The code checks that the result of the binary search is "close enough" to the target value, and returns an approximation error if it's not.

This is all still very ad hoc, so for production use cases, I recommend using a library that does these conversions using symbolic or arbitrary-precision computation, like Tumult Analytics.

Why didn't you use symbolic or arbitrary-precision computation, then?

I couldn't find a JavaScript library that had support for all the mathematical building blocks I needed.

Why am I getting errors telling me that my parameter choices are bad?

The code contains various validation checks, besides the floating-point stuff outlined above. Some of these checks are for fairly obvious issues: all parameters must be strictly positive, and $\delta$ must be smaller than 1. Some are for more subtle problems: the conversion from zero-concentrated DP to approximate DP only holds for $\varepsilon\ge\rho$, which means that some conversions are impossible. Similar issues arise for the Gaussian DP to approximate DP conversion: $\mu$-Gaussian DP implies $(0,\delta)$ for some $\delta$, so if you specify a smaller $\delta$ than that, the formula will not be able to find a positive $\varepsilon$.

I tried to catch all these possible problems in friendly error messages². Please let me know if you encounter an issue that isn't caught by one.

Are you logging anything?

No. Everything runs locally, in your browser. You can check the source code or run this page while completely offline if you'd like to be sure.

(But please publish your privacy parameters!)

I'm grateful to Clément Canonne, Moshe Shenfeld, and Yu-Xiang Wang for helping me figure out the right conversion results, very grateful to Yaya D. Dia for helping me understand how to compute the formula necessary for Gaussian DP to approximate DP conversion in floating-point space, and extra super grateful to Thomas Steinke for properly writing up the proof of the tight conversion between pure DP and zero-concentrated DP.

I think I could probably rearrange some of the terms in the formula to make it happen less (so, make it work for a wider range of parameters). If you need this, please let me know. (Or send me a patch. The source code is one right click → "View page source" away.) ↩
In total, there are 11 different possible error messages you can get across the different converters. You can probably reach 10 of them fairly easily. If you manage to reach all 11, please let me know! ↩

Paper highlight: Evaluations of Machine Learning Privacy Defenses are Misleading

2024-05-07T00:00:00+02:00

Here's an idea: why don't I use this blog to highlight cool privacy papers with interesting insights? Let's try this out. We'll see if it becomes a more regular thing.

Machine learning models tend to memorize their training data. That's a problem if they're trained on sensitive information, and then pushed to production: someone could interact with them and retrieve exact, sensitive data points.

Researchers have come up with a bunch of ways to mitigate this problem. These defenses fall in two categories.

Some techniques provide differential privacy guarantees, like DP-SGD.
Other approaches are more ad hoc. We can't mathematically prove that they protect against all attacks, but maybe we can show that they work well enough in practice.

In both cases, it makes sense to empirically evaluate how good these defenses are in practice. For the ad hoc mitigations, it's the only way to get an idea of how well they work. For the DP methods, it can complement the mathematical guarantees, especially if the privacy budget parameters are very large. So, many papers introducting defenses dutifully run some attacks on their models, and report success rates. These numbers typically look pretty good, which allows the authors to say that their new mitigation is solid.

That all sounds great, until someone starts taking a closer look at how these evaluations actually work. That's exactly what Michael Aerni, Jie Zhang, and Florian Tramèr did in a new paper, titled « Evaluations of Machine Learning Privacy Defenses are Misleading ».

You can probably guess where this is going: they found that these empirical privacy evaluations are actually pretty terrible. They identify three main problems with existing work.

Average-case privacy. Empirical privacy metrics are defined in a way that measures average risk across the dataset, instead of worst-case risk. So if the approach does a terrible job at protecting outliers data points, you can't see that in the metric. That's not great: everyone deserves privacy protection, not just typical data points!
Weak attacks. Many evaluations only try very simple attacks. They don't use state-of-the-art techniques, and they don't adjust them depending on the mitigation. That's not great: real-world attackers are definitely going to do both!
Bad baselines. A lot of evaluations use DP-SGD as a baseline, but they do so in a way that seems set up to make it fail. First, they don't incorporate state-of-the-art improvements to DP-SGD that improve utility. Second, they select privacy parameters that lead to very bad accuracy. That's not great: it makes newly proposed defenses compare more favorably for no good reason!

To fix all that, the authors introduce better privacy metrics, stronger attacks, and more reasonable baselines. They implement all that, and re-run a bunch of experiments from previous papers introducing new defenses. The findings are summarized in the following chart.

On the x-axis, a bunch of defenses; on the y-axis, a measure of attack success. In dark blue are the original numbers, in light blue are the new results. Two things are immediately apparent:

Making attacks better lead to better success rates. This is unsurprising, but the magnitude of these improvements is kind of amazing : between 7x and 53x better! That says a lot about how brittle these attack scores are. You probably don't want to rely exclusively on empirical risk metrics for real-world use cases. When real people's privacy is at stake, someone finding a way to multiply your empirical risk by a factor of 50 overnight would be seriously bad news.
The DP-SGD baseline, once improved to match the accuracy of other approaches, provides the best empirical risk mitigation. This somewhat surprising, since the privacy parameters used are extremely loose — an $\varepsilon$ of $10^8$ is completely meaningless from a mathematical standpoint. This suggests that DP techniques might still be worth using even if you don't care about formal guarantees, only about empirical risk. Super large $\varepsilon$ values are somehow still much better than infinite ones, it seems.

I'll add one personal comment to these two takeaways. It's maybe a little too spicy to be published in a scientific paper without hard data, but this is a blog post, and who's going to stop me?

This research shows a lot about incentives at play in privacy research based on empirical metrics. I don't think the authors of the ad hoc defenses set out to do meaningless evaluations, and recommend unsafe practices. But none of them had a structural incentive to do better. Coming up with better attacks is more work, and the only possible outcome is that the proposed defenses become less convincing and harder to publish. Same for optimizing baselines, or coming up with stricter risk metrics. To make things worse, when you genuinely think that your defense is reasonable, it's really hard to switch to an adversarial mindset and try to break what you just created! Nothing pushes researchers towards better risk quantification¹. So in a way, it's not very surprising that this leads to widespread underestimation of actual risk.

So, incentives are broken in academic research around these empirical privacy scores. Now, could the same broken incentives also affect other areas? Say, commercial vendors of privacy technology who rely on the same empirical metrics to claim that their products are safe and GDPR-compliant? I'll leave that as an exercise to the reader.

To come back to the paper, here's a little more praise to make you want to read it. The empirical privacy metric makes a lot more sense than most I've seen heard of so far. The attack methodology is both elegant and clever. The "name and shame" counterexample is worth keeping in mind if you design new privacy scores. The examples of the most vulnerable data points give a clear picture of what existing defenses fail to protect. Convinced yet?² Go read it!

And I can't help but note that by contrast, using differential privacy keeps you honest: you have to quantify the privacy loss of everything. You're computing a worst-case bounds, and you can't cheat. Barring errors in proofs — which are easier to catch at review time than, say, subpar implementation of baselines — the number you get is the best you can do. And doing more work can only make your results stronger. ↩
No? Then read the paper just for the spicy fun facts. Here's my favorite: some papers used synthetic data as a defense, and, I kid you not, « argue privacy by visually comparing the synthetic data to the training data ». I couldn't come up with this if I was aiming for satire. ↩

Five stages of accepting provably robust anonymization

2024-04-28T00:00:00+02:00

This post is a transcript of an invited talk I delivered to AnoSiDat in April 2024.

Hi everybody! I’m Damien.

I’m about to do something they always tell you not to do when you give presentations, and that’s “speak about myself for a little while”.

I will to try to make it into a story though, so maybe it’s OK. You tell me.

In 2016, I started splitting my time between my job at Google working in the privacy team, and pursuing a PhD at ETH Zürich.

On both sides, I was working on anonymization. And anonymization had something very puzzling about it.

At Google, and in the rest of the industry, people mostly used notions like k-anonymity. The idea is to people into buckets of sufficiently many other people that they’re “basically anonymous”.

It wasn’t always k-anonymity. It could be other notions based on an intuitive idea of what anonymous data should look like. The general idea was to try and see what could go wrong with a certain method, and if we couldn’t think of anything, we’d say: it’s probably good enough.

In academia, though, especially among computer scientists, everyone seemed to have converged on another notion: differential privacy. DP, as we like to call it, is not just a new criterion to decide whether data “looks” anonymous enough. Instead, it’s a completely different approach, grounded in math. It tells you: there’s bad news and good news.

The bad news is that anything you publish will leak some individual information. You just can’t escape that fact.
The good news, is that you can quantify this leakage and limit it, regardless of the attack.

A lot of people seemed to have seen this new concept and gone like: this is it. Not only is the math satisfying, but this is the right notion, on a political or even moral level. That’s how we really protect the people in the data.

That gap between practice and theory seemed weird to me. It looked like two separate worlds who weren’t really speaking to each other.

So I thought: could we try and bring the two worlds together somehow? That was the big problem I wanted to solve during my PhD. I looked at older notions of anonymization, or practical tools used in industry, and asked: could you quantify their leakage in the language of DP? Maybe show that they’re not that bad, and that they do provide good guarantees in practice?

That didn’t quite work out.

Instead, I became a lot like these folks in academia: I started focusing much more heavily on differential privacy. I became convinced that this was the future, not just in academia, but for real-world use cases in industry as well. At Google, my team even tried to make it into the default approach used for anonymization. We made a ton of progress in that direction!

Now, I’m at Tumult Labs, a startup that tries to make that happen at the scale of the entire industry. Our goal — my job — is to make differential privacy into this thing that everybody uses.

Kind of like cryptography. If you’re not using HTTPS on your website today, that’s bad. People are like: “Yo, that’s unsafe. You’re putting your users at risk. Do better.”

My dream is to live in a world where if you’re claiming to anonymize some data, and you’re not using DP… that’s kind of suspicious.

A few years ago, I thought this industry-wide shift would happen very soon. Any day now.

Except: it’s not really happening. Or maybe it kind of is, but it’s moving much more slowly than I expected. Take a random government agency that publishes statistics about people. Or a random company sharing anonymized data with a third party. What are the chances they’re using differential privacy? Not great.

That divergence between communities I observed 8 years ago? It’s still there. If anything, it got worse.

Many academics and practitioners just… don’t use newer, more robust notions. Instead, they keep doing the thing they’ve done for decades! At least, that’s what it looks like from where I’m standing.

And the differential privacy community — my community — has developed a sort of arrogance about people who haven’t made the switch. To us, differential privacy is so obviously the right approach. If you’re not getting it, there must be something wrong with you.

In fact, you might feel exactly this way about me based what I’ve told you so far. “Urgh, another guy who’s going to lecture me about how I do anonymization.”

Or maybe you are focusing on differential privacy, and I’ve just called you arrogant, and now you’re like “who does this guy think he is?”. I’m making everyone unhappy. This talk is going great. You’re so glad you’re here.

In all seriousness though, I do think that as a community, we’re not doing enough to understand people who keep using ad hoc anonymization. We’re not really listening to what they’re telling us, or being empathetic to their experiences. And… that doesn’t feel right. That doesn’t feel very productive, either.

So in this talk, I’ll try to do better.

I’ll attempt to truly engage with the arguments of critics of differential privacy. Make sure I “get it” by making them mine. Fully understand the challenges in the way of our grand vision, and try to figure out how we, together, can address them.

I’ve named this talk “Five stages of accepting provably robust anonymization”. The stages, are, of course: denial, anger, bargaining, depression, and acceptance.

It’s silly, of course, especially since my understanding is that these “five stages of grief” model is actually kind of outdated in psychology.

But discussing questions like “how do you define privacy” can actually be pretty emotional, for people like you and me. We care about this stuff. We do privacy because we want to do the right thing. We want to build better, more respectful ways of handling and sharing and publishing data. So when someone goes and say “the way you’re doing this is all wrong”, it’s not just an intellectual dispute. It makes us feel things.

So, we’re going to go on a journey, and it’s going to be an emotional journey.

Let’s dive in!

The first phase is Denial.

We don’t actually need differential privacy. Whenever folks doing differential privacy do their “anonymization 101” lecture, it’s always the same story. First they come in, and start explaining some classical reidentification attacks. The governor of Massachusetts reidentified by Sweeney! AOL search queries! The New York taxi dataset! The Netflix prize dataset! Then, they go: “This stuff is bad, and it could happen to you. You need DP to make sure that you mitigate against that.”

But let’s be honest. This isn’t very convincing.

First, these attacks are honestly kind of underwhelming. The risk just seems overblown. Who cares if we figure out that the person who watched some movie also watched this other movie? Or if we can reidentify a couple of celebrities’ taxi trips? OK, this isn’t supposed to happen, but… nobody’s actually harmed in real life by this stuff.

Plus… this is purely academic work. In real life, people won’t investigate your data on the off-chance that they’ll reidentify one person! Unless they’re people are academics trying to prove a point. What are the chances this happens to me?

And also, in all attacks, the datasets were obviously badly protected. Of course removing names is not enough! Of course search engine queries are reidentifying! Their problem wasn’t that they weren’t using differential privacy. Their problem was that they didn’t even try! If they had thought about it for more than 5 minutes, and used any reasonable scheme to protect data, they’d probably have been fine.

These arguments are honestly pretty valid, and I wish we’d admit this more openly in the differential privacy community.

First: most attacks, especially the shiny ones that got press coverage, were done on datasets that were pretty embarrassing. It makes total sense to see this and think: if I do my homework and try to do something reasonable to protect my data, that should be enough to make sure bad things don’t happen.

This can also be the case for compliance-based motivations for anonymization. You know that joke about how, to escape a bear, you don’t have to run faster than the bear, you have to run faster than the slowest person running from the bear? Let’s be real: some people feel this way about data protection authorities. Regulators are still at the stage where they have to tell people “hashing identifiers is not anonymization”! If you can demonstrate to a regulator that you did something reasonable, surely they’re not going to be mad because you didn’t use the gold standard, right?

Finally, it’s totally fair to say: “look, I know I have to do cybersecurity stuff, I’m going to get ransomware otherwise”. The risk is real, it’s tangible. We’ve all received emails like “we deeply care about your privacy, also all your data got stolen by random hackers”. We know it happens all the time. Re-identification attacks… not so much. The risk just isn’t that huge.

First, let’s look at the last argument that risks only exist on “obviously unsafe” data releases. One piece of data that doesn’t quite go in this direction are reconstruction attacks. Like the one the U.S. Census did on their 2010 data.

Here’s a diagram giving a rough ideas of how it works. The idea is remarkably simple: you consider every published statistic as an equation with a bunch of unknown variables — one per unknown attribute in each record. And then you simply… solve that system of equations. If you have enough statistics, you can just do that. This gives you reconstructed records. You went from statistics to actual individual records. If a bunch of those are correct, that seems bad.

The attack doesn’t stop there, though. It has a second step: you use an auxiliary dataset to link each reconstructed record with an actual person, and learn something additional about this person.

It’s fair to say that the attack was far easier, and far more successful, than people expected. It was a real wake-up call inside the U.S. Census Bureau. Plus, it was done on the data from 2010, which used well-researched disclosure avoidance techniques. So, that argument from earlier that attacks only happen on datasets that are “obviously” badly protected… It’s starting to look a little iffy.

There’s still some debate about this attack, though. The initial description of the U.S. Census reconstruction attack did not give a lot of detail about how it worked, and people had opinions about its validity. One common question was around how to quantify the success of attacks, and against which baselines to compare it to. I’m not going to relitigate the argument here. The Census recently published a much more thorough technical description of what they did, which I think puts to rest a lot of the technical debate. You should read it if you want to learn more.

But this debate undoubtedly raised some good points and questions around what it means for an attack to be successful, and how to actually demonstrate attack success.

Another way people have criticized the attack is by saying — let’s be real. Census data isn’t that sensitive. It’s just demographics: gender, age, racial information. It’s not like medical data. Nobody would go through all this trouble just to figure out “oh, this person’s age is 42”. Even if you reconstruct & reidentify a Census record, you don’t really learn anything new and sensitive about that person.

Here’s something you can do with this statistical releases of this kind.

You can take data from one year, reconstruct and reidentify some records…
… then you do the same attack on a more recent data release…
… and you answer a simple question: “who reported a different sex/gender attribute since last time”?

That’s exactly what two researchers did. They showed you could reidentify hundreds of trans people based on statistical releases. In the current political climate… I don’t know about you, but that scares me.

There’s a hard lesson in this: the real world will always find a way to surprise you. Not just in “how could an attacker do this nefarious thing”, but also in what nefarious thing can even be the goal. I’ve seen some bad stuff working on privacy in the industry for 10 years, but this attack still caught me by surprise.

Let’s look at the last point — nobody cares about reidentifying your data.

Here’s a real quote from a demographer in USA, complaining about the use of DP in the 2020 Census. “In past censuses, it was possible to create microdata from statistical tables, it was just a pain.”

This is from a meeting Census folks had with some of their data users. This person is just saying out loud what everyone is doing. People do reconstruct data in order to do gerrymandering, political targeting, or any kind of research that would be made easier if we had microdata. This is an open secret.

Speaking of open secrets… have you heard of the advertising industry? There’s a five-hundred-billion-dollar world who’s predicated on showing ads to precisely the right people, and having measurements that are as precise as possible. Data flows left and right, and companies have major financial incentives to run exactly this kind of attack, to get more information than what they’re supposed to.

The fact that you don’t hear about this kind of thing in the press doesn’t mean it’s not happening. Nobody has any interest in publicly talking about it — not the people doing it, not the people that sell data and then get it exploited, not the people buying or using it. So it’s another one of these open secrets.

Want to know the worst thing I’ve heard? There was a startup in the US that was reconstructing & reidentifying Census records and linking them with medical info to better target health insurance advertising. Again, the real world keeps surprising you…

How can we move people past the denial stage?

First, we need to do a better job at educating people, and especially decision-makers — think data protection officers, regulators, product owners — about the risks of bad anonymization. Both about what can go unexpectedly wrong, and how that can happen. We can’t keep using the same examples we used 10 years ago.

Second, attacks are extremely valuable both to our academic understanding of real-world privacy risk, and to the larger societal conversation about what anonymization should protect against. Go look at what your government is publishing, or what kind of data sharing is done as part of your healthcare system. Try to understand what makes sense as an attack, what makes sense as a baseline, and how to actually demonstrate real-world risk. Just a few examples can go a long way in making people aware of a problem they didn’t think about so far.

The second stage, after denial, is Anger.

So, I’m a data scientist, I’ve been tasked to anonymize some data, so I thought I’d give differential privacy a try. I picked up an open-source library that does differential privacy, and tried to use it to solve my problem.

And I have one question.

Why is it so hard to do anything?!

Why do I need to learn so many new concepts? Why is this not integrated with the tool I’m normally using? Why is this so slow? OK, those were multiple questions. What the hell, though?

Also… it seems like I can only do simple analytics queries? Don’t get me wrong, a large part of what I need to do with data basically boils down to group-by count or group-by sum queries. But sometimes I need to do more complicated things! Like… Linear regression? Naïve Bayes classification? Anomaly detection? What am supposed to do if I need any of these?

Who in the audience has tried using DP libraries? Do you relate with that experience? I see some people nodding… Yeah. The anger is, honestly, justified.

It’s our job to make this stuff easy to use, and we’re simply not doing a great job. You can’t just run an SQL query that does some simple aggregations and get good results. You still have to learn about epsilon, and probably delta, and maybe even rho and other weird parameters.

And even if you do, it’s unlikely that you will be able to solve your problem on your own. Real-world problems still often have requirements for which there is no good guidance, so you’ll probably need to hire an expert to even think about using the stuff. These can be because the necessary algorithms are still cutting-edge research, or simply because the context is somewhat unusual.

Even in the best of cases, where the tool actually solves the problem, there’s are big questions about how to optimize the algorithm and choose its privacy budget. There’s very little guidance and even less tooling that helps people understand trade-offs.

At my company, we develop Tumult Analytics, which we think is the best-in-class DP framework. These problems are top of mind for us, and we try to address them. Things are getting better!

Here are some things we try to do to improve the status quo.

Our tool is in Python, and its API mirrors the API of PySpark or Pandas, which helps people learn fast.
We invested a lot of time making a user-friendly tutorial series that helps people learn what they need to use the tool.
We have a number of unique features that our clients need in their production use cases, and the framework is built for extensibility, allowing ourselves to easily add more capabilities in the future.
Finally, we just shipped the first version of our parameter tuning & optimization features to a customer. Stay tuned for the public launch! (Or if that sounds interesting to you, come say hi!)

How can you help here?

First: talking to people. What do I mean by this? Things like: running usability studies to understand how to make tooling more user-friendly. Interviews with potential users to understand their needs and identify gaps. Writing more simple explanations of this stuff for wider audiences. Collaborating with people outside of your field, especially if you want to apply DP to a new field of application. Building bridges. Solving other people’s problems.

Second, when you make a new thing… why not extend existing open-source tools? Some frameworks, like Tumult Analytics or OpenDP, are designed for extensibility, so adding new things is easier than you’d think! The additional work is not huge, compared to doing the actual research or writing a paper. And when it’s super easy for people to reuse your work, you tend to get a lot more exposure & citations!

Third, if you make up new differential privacy mechanisms… make them hands-free? What I mean by that is “get rid of the hyperparameters”. Every single parameter that needs to be set manually is a big usability hurdle. So it’s worth figuring out — can we spend a little portion of the privacy budget to automatically select good values for these hyperparameters, rather than asking the user? End users will thank you!

The third stage is Bargaining.

I’ve gone through denial already. I realize that there is a real problem, and that it’s worth fixing. After I got angry trying to use DP in practice, I started looking for alternative solutions. And I think some of them make sense! Hear me out.

First, I thought — what if I take some shortcuts? Maybe I can just add the same kind of noise they use in DP, but without doing all the rest of the annoying stuff? Like, adding some Gaussian samples in SQL or Excel to my statistics should be good enough, right?

But then I had an even better idea: synthetic data! If I make synthetic data out of my original dataset, surely that’s safe enough to share and publish, right? The records aren’t real! They don’t actually correspond to anyone! It seems to me that by definition, it breaks linkability, and nothing in there can be reidentifiable.

I had some lingering doubts about the safety of doing that, but then I talked to a synthetic data vendor. And he had a really compelling argument: empirical metrics. He said: “I’m not saying it’s always safe, but we have ways to verifying whether the data you generated it safe enough. We can run tests and compute a privacy score that tells us whether the data you got is anonymized enough.”

That sounds great to me!

Let’s be clear. This is already so much better than pretending the problem doesn’t exist.

Obviously, just adding Laplace noise of scale 1 to a sum of salaries in euros is not doing much. But if you try do DP, but maybe your noise is not cryptographically safe, or your sensitivity calculation is a little wrong… probably it’s better than not trying at all. Similarly, using synthetic data is much better than sharing the original dataset with the identifiers removed, or using a known broken approach. Academics will tell you: this isn’t a silver bullet! There are attacks on synthetic data that can retrieve the original information! But it’s also true that these attacks are not as convincing as the ones on de-identified datasets, or even reconstruction attacks I talked about earlier.

It’s also understandable that people want to do that, when existing DP tools don’t solve their use case, or aren’t integrated with their existing workflows. Rolling your own anonymization by hand can seem much easier. Plus, sometimes, when you tell people that the first step to use DP is that they have to aggregate their data, you kind of lose them from the start. Synthetic data feels much more palatable, understandable. And the user experience can be pretty smooth. Press a button, you’re done.

Finally, efforts to quantify privacy are laudable. Having a metric for something allows people to draw comparisons, make data-driven decisions, rely on something concrete.

Plus, when you’re an engineer or a data scientist, a nice number just makes you feel warm and fuzzy inside. And when the number go up or down? Brrrr. Feelings.

I’m joking, but this isn’t a bad thing! It’s great to use this as an emotional lever to encourage people to take better privacy decisions! We should absolutely measure and quantify things!

Ask a DP expert to push back against these and they’ll tell you “there’s nothing that shows it’s safe, so we should assume it’s not”. Honestly, we find it even annoying that you’d ask us to make this case. It’s so obvious! Look at cryptography! They learned the hard way that everything should be grounded in math. You made a new fancy encryption algorithm but you can’t prove it’s safe? Get out of here!

But let’s be real: the outside world doesn’t see things this way. Especially not with anonymization. Not yet. We’re just not going to win this argument on philosophical grounds.

So instead, let’s look more closely at these bargaining alternatives — synthetic data and empirical privacy metrics. We’ll need to dive a little deeper to figure out how to properly engage with this argument. How does one measure privacy, actually? Where do the warm and fuzzy numbers actually come from? As you can guess from the picture I chose there, it’s going to be somewhat of a rabbit hole. Bear with me.

Most people use something that falls into the umbrella of similarity-based metrics. The idea is relatively simple.

First, you take your data and you split it in two parts — the train data and the test data, kind of like you do in machine learning.

Then, you use only the train data to generate your synthetic data.

Then — and this is where it gets interesting — you compute the distance between the synthetic data and the train data. There are many ways to compute a distance between two distributions; you end up with different metrics depending on the distance you choose. But for this explanation, we’ll ignore the details, and focus on the intuition of this distance: it tries to answer the question “am I generating something that’s suspiciously close to the real data?”

But you don’t answer that question directly. Instead, you compute a second distance, this time between the synthetic data and the test data. Now you’ve got two numbers, so you’re doing the natural thing and compare them with each other: is the distance to the train data smaller than the distance with the test data?

If yes, that’s… Is that bad or good? Think about it.

Correct! That’s bad. That means we generated records that are close, not just to the real data, but to the specific points that we used for generation. We didn’t just end up matching the distribution well, we overfit to individual data points. That could be a sign that we leaked some private information. So, that’s bad.

Conversely, if the two numbers are roughly the same, or even if the distance to the train data is larger, that means we’re fine. We didn’t leak any sensitive data. Right?

… right?

Well… not really.

First, it’s easy to cheat at these metrics. All we need to do is to make the distance to the training data large enough, and bam, the metric says we’ve done well. For example, with numerical data, I could just add a large constant number to every record. This way, the data I’m generating is “far away” from the train data. The metric tells me I’ve done a great job at privacy. But… all the sensitive information that was in the original dataset is still in the “synthetic” dataset. I’ve leaked all the information. That’s bad!

Of course, you’ll tell me: “But, Damien, that’s unfair. We’re not measuring the privacy of evil algorithms made by attackers. These edge cases are meaningless. We don’t cheat in real life!”

To which I say: you’re using machine learning! You’re giving your data to a neural network, you don’t really understand how it works, but you tell it: “Go optimize for these metrics. I want good utility and good privacy, and this is how both of these things are defined. Find a good way of achieving both objectives.”

Guess what? Neural networks are going to cheat! That’s what they do! They’re just doing it in a smarter way that avoids obvious detection! So gameability — how easy is it to cheat at a metric — is something you should be concerned about.

Second: using distances is basically saying that the only thing the attacker can do is “try to find records that are close to real records”. As in, if there’s a synthetic record that matches a real record, then the attacker wins.

But… that doesn’t make any sense! The attacker doesn’t know which are real data points! That doesn’t correspond to something that people would actually do to breach privacy! And attackers can do a ton of things: exploit subtle details of your algorithm, use auxiliary information, or even influence your data sometimes. So using distances is restricting what the attacker is “allowed” to do, without good justification.

Third, you’re computing a single number that captures a distance between distributions. You got data points on the left, data points on the right, and you’re getting an averaged metric about your system. It’s giving you, at best, information about the average level of “privacy” provided by the synthetic data generator.

But — and I cannot stress this enough — everyone needs privacy guarantees! Including outliers! Especially outliers! If your approach works fine for most people, but leaks a ton of data for people in demographic minorities, that’s bad! In fact, it’s even worse than a system that leaks everyone’s information! At least you’d notice and fix it!

Aaaargh! I thought we were past the anger stage but here you go, I got angry. I told you this stuff was emotional!

Fourth, and finally, something fun about the empirical metrics used in practice. Try to generate synthetic data and measure its privacy. Then do it again, but with a different random seed, or different train/test data split.

Are you obtaining the same number? Is the thing telling you whether it’s “safe enough” giving you the same information both times?

I’ll spare you the experiment — the answer is “probably not”. Researchers tried it on platforms offered by various synthetic data vendors and got widely inconsistent results. So even if the metric was more meaningful, it’s really hard to trust it if the “things look OK!” signal is super noisy.

OK, so that’s bad. Bad news, though: it gets worse. Because if you think about it, at its core, what are we trying to achieve?

We’re trying to quantify risk. There’s some kind of scale, some end of the scale is great, the other end is bad, and we want to know where we are on that scale. Well, we’ve seen that maybe we’re not exactly measuring risk, more like “risk”, but let’s set that aside.

People building and selling synthetic data are basically telling you: you generate some data and know where you are on the scale. Like, for example, there. You’re in the safe zone. You’re fine. But that’s not what empirical privacy metrics can ever tell you, even if you fix all the problems I mentioned before!

At most, they can tell you something like “you’re somewhere here.” We know for sure that you’re not on the left of this. Maybe we ran an attack and found that this is the success rate of the attack. So it’s at least that bad. We got a lower bound on risk. But we don’t know how much worse this can get! Maybe a better attack would have a much better success rate! We don’t know!

It’s critical to keep this framing in mind when you evaluate privacy claims based on empirical metrics. There are still tons of papers that introduce a new algorithm, compare it to prior work using empirical privacy metrics, and conclude something like “our algorithm is better for privacy”. No it’s not! You don’t know that! At best, this specific attack seems to be mitigated more effectively in this specific context. There’s a crucial difference!

Mitigating specific attacks is useful. By all means, we should do it. But we can’t jump from that to making general claims that we have no evidence for.

OK, so that’s depressing, and we’re not even at the depression stage yet. Let’s be a bit hopeful. What should we do?

First, we need better metrics with better properties. Metrics should match specific attacks, with clear assumptions about the attacker. They should report the risk to the least well-protected people in the dataset, not the average case. They shouldn’t be too trivial to game, and so on. We’re not yet at a stage where we know what these metrics should be. But the current ones just aren’t cutting it.

Not only do we need better tools, we need to use them better. We should be honest about what they actually tell us, especially if we use them to communicate with people about privacy properties. And we should not use them on their own, but in conjunction with upper bounds on the risk.

Which naturally leads me to synthetic data with differential privacy. There’s a lot of research on the topic already, but there’s so much more to be done. Better tools. Better usability. Better utility for various realistic workloads. Better benchmarks. Capturing functional dependencies, time series. And so on, and so forth. More talking to people who actually need this technology, listening to their needs, addressing them.

There’s a lot more I could say about synthetic data, but you probably don’t want me to at that point. Thanks for indulging me. Now, sadly, we have to move on…

… to Depression.

OK, Damien. I listened to you talk for 30 minutes, and then I was like: Urgh, OK, fine. Stop talking. I’ll do it the right way. I'll use differential privacy, and do whatever is necessary to make it work. No shortcuts.

Except… I’ve been trying for months now.

And there’s just so much garbage in my data. It turns out, my use case is complex. I do have lots of statistics and weird requirements.

I tried all the things that they told me to try in the literature. I used fancy techniques like zero-concentrated DP and dimensionality reduction and adaptive algorithms. I even got some experts to help me. But even with their help, it seems like my problem is impossible to solve. The noise is just too bad.

Unless… well, unless I use, like, an epsilon of 30 or anything. But sadly, I understand the explanation of what epsilon means! Kind of. I know that large privacy parameters are basically providing meaningless protection! So why would I even bother, if the best I can do is a number that tells me that the privacy levels are terrible?

I should probably just stop trying to do anonymization altogether and go to the mountains. Raise some goats or something. That seems nice.

Honestly… yeah.

For many problems, the trade-off between privacy and utility can be frankly brutal. That’s just a fact.

And if you give me an algorithm and the only thing you tell me is “it’s DP with an epsilon of 20”… it doesn’t spark joy. There’s no magic explanation that I could give you that would make it acceptable.

Plus, it’s not like there’s good guidance about this, either. If you can use a low privacy budget, you can understand what’s going on. You can rely on the math. But when your epsilon is, like, 10… the math interpretation stops making sense, and what do you have to replace it?

In an ideal world, there’d be some standard that tells you: in this context, under these conditions, an epsilon of 10 is OK. In this other context, you can go up to 15. Even if it feels a bit iffy from a mathematical standpoint, at least there would be community agreement that this is acceptable. People could defend their choices to their executives, to regulators, and to the people in their data. But there’s nothing like that right now. So when people have to face hard decisions regarding parameter choices, of course that can feel hopeless.

Until we have some form of community agreement or standard on what parameters make sense in which context, we won’t solve that problem entirely. But in the meantime, here are some ways we can look at the current situation under a more hopeful lens.

First, here’s an idea that has a lot of potential: privacy analyses don’t need to be limited to a single number. You can do finer-grained things, like looking at the privacy loss of individual attributes instead of entire records.

Say your total epsilon is 10. That seems not so great. But maybe each person in your dataset has 20 attributes. And maybe you can show that if all you do is change a single attribute, then that information is protected with an epsilon or 0.5. Or maybe 1, or 2, depending on which attribute. That feels better: it gives you interpretable guarantees on smaller pieces of information. Maybe that’s enough to give you some peace of mind.

Another idea that is largely uncharted territory today is to have the privacy loss vary across protected entities, instead of being constant on the entire dataset.

This is particularly useful for economic data that is typically very heavy-tailed: one large supermaket might have 1000 employees and millions in monthly revenue, while the mom-and-pop store next door is 500 times smaller. Protecting both at the same time can be tricky, or outright impossible, if we protect them in the exact same way. But it might make sense to adopt a privacy loss accounting framework where the smaller contributors have a better privacy parameter than the large contributors. We’re doing things along these lines with the U.S. Department of Agriculture and the economic branch of the U.S. Census Bureau — come talk to me if you want to hear more.

Finally, and this might sound funny considering everything I told you so far, but… Even with very large epsilons, differential privacy seems to achieve better practical protection than other approaches?

This is particularly flagrant in machine learning use cases. If you want to train a deep neural network with differential privacy, you typically need very large epsilons to get decent results. But somehow, this seems to not only be enough to mitigate practical attacks — at least, those we know of today — but it also seems to work much better than alternative approaches that don’t provide any formal guarantees. I don’t think we have a great explanation for this. My guess is that DP forces you to do some things — like tracking and bounding sensitivity, and adding noise — that are inherently good for privacy?

But my point is that even if you’re a empiricist through and through and you don’t care about formal guarantees, only about making practical decisions… You may still want to consider using differential privacy! It just seems to work better at mitigating attacks, even when it shouldn’t.

How can you help?

First, what I said about finer-grained privacy analysis — there’s a lot to do here. It’s still largely uncharted territory.

Some of it is systems design: how to make DP frameworks that make it easy to compute per-attribute privacy loss?
Some of it is theory. This idea to vary the privacy loss across protected things: we don’t know what are the best mechanisms to do that in practice, and what are the fundamental limitations of the approach. As always with new ideas, we’ll probably bump into more open problems as we try to use this approach to address real-world use cases.

Also, it would be nice to understand better why DP is so effective at mitigating practical attacks. Maybe it’s because it protects against worst-case attackers, and real attackers are weaker than this? In which case, modeling more realistic attackers seems worthwhile, maybe to gain a deeper theoretical understanding of the advantage they can get. This idea is not new; we actually tried to do exactly that during my PhD, but formalizing it and obtaining convincing results seems surprisingly difficult. There’s probably a lot more to be done there.

We’re reaching the end of this talk with the final stage: Acceptance.

I think most people working on differential privacy in academia did not reach this stage the hard way, by going through all the steps that came before, and persevering. Instead, a lot of them were attracted to the abstract properties of this framework. It felt elegant and robust.

… we just like the math, you know?

But more and more, as we ship DP to practical use cases, we see practitioners get converted. Not because all the nice theory, but because of very practical considerations. I count four of them, but I’m certainly missing some.

The first one is making data negotiations smoother. What do I mean by that?

We work with the IRS — the tax office in the US — to help them share data with the Department of Education. The Department of Education wants to publish income information depending on degree: when students go to this college to do this degree, how much do they earn on average after 2 or 5 years? This is used by policymakers to better allocate federal resources, and by students to make decisions about their education and the way they finance it.

The Department of Education wants as much income data as possible, as precisely as possible. The IRS, on the other hand, has a constitutional duty to not leak tax information of specific people.

Before using DP, there would be dozens and dozens of emails and meetings back and forth where one side would ask for more statistics and the other would be like “eeeeeh, I don’t know, this seems a bit scary, what if we do something else instead”, and so on.

Differential privacy aligned the incentives. It allowed the IRS to say: “OK, here is your total budget. We’re OK with giving you this much epsilon. You decide which queries you want to run, at which level of granularity, and which queries are more important to you and require more budget.” That saved them a ton of time and led to both better data for the Department of Education, and better privacy outcomes for IRS.

The second also has to do with saving time and effort, in another context: automated approval flows.

When I was at Google, my team was in charge of evaluating and approving anonymization strategies for all other teams. People would come to us, and ask: “we want to share statistics about user data with a third party, or keep some metrics forever, can you please help us anonymize that data?”

When I started in this team, we were using ad hoc methods. Because these methods only make sense depending on context, we had to ask a lot of questions to people. Can an attacker influence the data? Are you using windowed statistics? Are you releasing multiple statistics about the same data point? We had to verify a lot of this stuff by hand, and we could never know for sure that a given launch would not have a bad privacy interaction with another launch from a different team using the same data.

After investing in building DP libraries, we were able to tell people: “Use this library with these default parameters. If you’re satisfied with the result, we’ll approve your launch super quickly.”

Just like before, it aligned incentives, putting individual teams in charge of their own success. You do the work because you have a vested interest in making it work for you. We save ourselves review time, which means we have more time to improve the tooling, which benefits everyone.

A third point is that DP allows you almost unlimited flexibility about what you can do. It gives you a very clear, well-defined boundary: you have to track the privacy loss of everything. Within this boundary, you can go wild! You don’t need to second-guess yourself every time you introduce a new aspect to your algorithm, or optimize an existing approach. You just need to do the math to quantify how much privacy budget each step costs. That’s really convenient.

Finally, DP makes it super easy to build on other people’s work. It composes nicely, so you can use individual ideas as building blocks, swap one out for another. If you’re using 5 different basic mechanisms in your overall strategy and one of them gets improved in a new paper, you can start using it and get immediate benefits. And once you start having a good overview of what tools are out there, it’s not super hard to start addressing novel use cases — the basic ideas are very reusable and effective.

In conclusion.

I do believe there that there are some real, scary issues with the use of unsafe anonymization practices. It’s also clear to me that the future lies in provably robust approaches, grounded in math, like differential privacy.

It’s not just an intellectual preference or emotional investment — I believe that this is truly the right thing to do from a moral and political standpoint.

But we can’t expect everyone to move there without solving many major problems, like the ones I tried to talk about today.

I think we can actually do this. And I sure hope that you will help!

The inspiration from this talk came from a talk by Andrej Bauer, who used this "Five stages" structure for a lecture about constructive mathematics, later converted into a paper. I heartily recommend both. I hope I did justice to the format.

Thanks a ton for listening. Here’s my email address, LinkedIn, and Mastodon if you want to stay in touch, and a link to my blog post series about differential privacy if you want to learn more about this stuff.

I'm very grateful to Esfandiar Mohammadi for inviting me to speak to this congress, to Gerome Miklau for supporting my preparing and delivering this talk, and to Hari Kumar for his fantastic feedback on this presentation.

Thanks as well to Antoine Amarilli for his helpful comments on this transcript.

Choosing things privately with the exponential mechanism

2023-10-09T00:00:00+02:00

This post is part of a series on differential privacy. Check out the table of contents to see the other articles!

The goal of most differential privacy mechanisms is to publish statistics: numbers giving some information about groups of people. But to build more complex DP mechanisms, we sometimes need a different kind of building block. In this article, we won't be adding noise to numbers. Instead, we'll make a choice among multiple options, in a differentially private way.

A simple example

Let's say that we're designing a poll to pick the best science-fiction books published in 2020. First, we'll collect a big list of books published that year. Then, we ask people to select the books they liked. Each person can pick as many books as they want, and we want to select the book with most votes. If there is a tie, we select the winner randomly.

How do we publish this in a privacy-preserving way? Let's take a look at the books with the most votes in our voting results.

We can't simply publish the true answer without randomness. Otherwise, we would be publishing the most popular book (here, Network Effect) 100% of the time. But imagine that we add a single person, who only votes for one of the second-most popular books.

In that case, we would want to release either one or the other with equal probability. This is a violation of differential privacy: 50% of the time, we publish The Hidden Girl, and this couldn't happen without this added user!

To avoid this, we need to add more randomness in our process. How do we do that exactly?

Well, we already know how to publish histograms with differential privacy: we add well-calibrated Laplace noise to each of the statistics. So why don't we just do that? If we make the entire histogram private, we can release all of it. In particular, we can see which book has the highest noisy vote count, and declare it the winner.

But we have a problem here. The noise has to be scaled by the number of statistics that each user can contribute to. In our setting, a particularly enthusiastic user could vote for all the books. If our list has 10,000 books to choose from, then we have to multiply the noise scale by 10,000. This seems… not great. And it also feels unnecessary: we don't want to release the entire histogram, we only want to pick a winner. Could we use that fact to inject less noise into the process?

Let's try to think about what a good strategy would look like. Say we are using $\varepsilon=\ln(2)$. Let's look at the votes for the most popular books again.

The true winner is Network Effect. So we want to select this true answer with some probability, hopefully as high as possible.

Now, what is the probability of selecting one of the second-best choices? They both have one fewer vote than the winner. They're not the correct answer, so we want to select them with as small a probability as possible. But we're also constrained by our differential privacy guarantee.

Imagine that we add a single new person to the data, and they vote for Hench. Then, we should be selecting Network Effect and Hench with equal probability.

With DP, we must select Hench with similar probability as in this hypothetical scenario. How similar? We chose $e^\varepsilon=2$, so there can be a factor of at most 2 between these probabilities. We want it to be as small as possible, so let's make it exactly half.

The exact same reasoning holds for The Hidden Girl, which has as many votes as Hench.

What about one of the books that is a little further away from the winner? The Relentless Moon, for example, is 3 votes short: we would need three more votes to get to the winning probability. If we can add one person at a time, we need three steps to get there.

And each time we add one person, we have to respect the differential privacy constraint: we can at most double the probability of selecting The Relentless Moon. To arrive there after these three steps, we need to start from at least $\left(\frac{1}{2}\right)^3=1/8$ of the maximum probability.

We can repeat this idea and draw our full probability distribution. For each book, if it's $k$ votes short of the true winner, its selection probability should be $1/2^k$ of the true winner.

Achieving this for an arbitrary $\varepsilon$ is straightforward: the probability of selecting a book $i$ with $k_i$ votes should be proportional to $\exp\left(\varepsilon \cdot k_i\right)$. After normalizing to make the probabilities sum to $1$, we get the following formula:

$$ \mathbb{P}\left[\text{We choose book }i\right] = \frac{\exp\left(\varepsilon \cdot k_i\right)}{\sum_i \exp\left(\varepsilon \cdot k_i\right)}. $$

We call this DP procedure the exponential mechanism.

A generic statement and a simple optimization

The example above is very simple: each book is simply associated to the number of votes it received. But the exponential mechanism is much more generic, and we can use it in more complex settings. Let's say we have a database $D$, and we have to choose between many items $O_1$, $O_2$, and so on. We assume that each item $O_i$ has a score $s_i(D)$, which depends on the dataset. Let $\Delta$ be the sensitivity of the scoring function: the maximum change to $s_i(D)$ when one person is added to (or removed from) $D$, for all $i$. Then the exponential mechanism $\mathcal{M}$ is defined as:

$$ \mathbb{P}\left[\mathcal{M}(D) = O_i\right] = \frac{\exp\left(\varepsilon\cdot\frac{s_i(D)}{2\Delta}\right)}{\sum_i \exp\left(\varepsilon\cdot\frac{s_i(D)}{2\Delta}\right)}. $$

Proving that it satisfies ε-DP is very easy — the proof in the original paper is just 3 lines long! Try to come up with it by yourself. Or you can also click here:

In our example above, the $O_i$ are the books, and the score of each book is its number of votes. Adding or removing one person modifies the scores by at most one, so $\Delta=1$, and we get the same…

Wait a second. We're not getting the same thing! There's a factor of $2$ in the generic formula that we didn't have in our voting example. So if we were using the generic formula, we would get worse utility: the probability of selecting the winner would be smaller. Can we get rid of this multiplicative factor?

The answer is yes, because our scores are monotonic: if we add a user, they will all get larger. If we remove one, they will all get smaller. That's a common special case, and in that case, you can remove the $2$ factor:

$$ \mathbb{P}\left[\mathcal{M}(D) = O_i\right] = \frac{\exp\left(\varepsilon\cdot\frac{s_i(D)}{\Delta}\right)}{\sum_i \exp\left(\varepsilon\cdot\frac{s_i(D)}{\Delta}\right)}. $$

Again, this is quite easy to prove, especially if you've understood the previous proof.

Note that here, we assumed that we're protecting the addition or removal of a single person in the dataset. If we want to protect any change in a single person's votes instead, the privacy analysis changes: someone could add one vote to a book and remove a vote to another. The scores would no longer be monotonic, and we would need to pay the $2$ factor in our formula.

More results?!

The exponential mechanism is a central building block in differential privacy. It's been studied from many different angles, so there is a lot to say about it. This blog post is long enough already, but here are a few ✨ selected facts ✨. Follow the links if you'd like to learn more!

The exponential mechanism can be implemented in a simple way: add noise from a Gumbel distribution to each score, and choose the item with the highest noisy score.
Its privacy guarantees can be finely analyzed using a notion called bounded range. This allows you to prove that an exponential mechanism calibrated for $\varepsilon$-DP also satisfies $\rho$-zCDP with $\rho=\frac{1}{8}\varepsilon^2$: a lot better than the typical conversion of $\rho=\frac{1}{2}\varepsilon^2$.
Using the exponential mechanism several times? Don't use regular composition theorems! Instead, using the special structure of this mechanism can get you tighter results.
You can do better than the exponential mechanism, with a mechanism called Permute-and-Flip. Its original definition is somewhat complicated, but people found a nice characterization afterwards: add noise from a geometric distribution to each score, and pick the highest noisy score.
However, the exponential mechanism still retains one advantage: it can also be used when the space of possible choices is continuous, like "every real number between 0 and 1"¹. This is very useful, for example to compute the median of values in a dataset.

I'm thankful to Daniel Simmons-Marengo, Liudas Panavas, and PeoriaBummer for helpful feedback on this post.

In that case, the density of the probability distribution on $x$ must be proportional to $\exp\left(\varepsilon\cdot\frac{s_x(D)}{2\Delta}\right)$. The normalization factor (in the denominator) also becomes an integral instead of a discrete sum. ↩

Mapping privacy-enhancing technologies to your use cases

2023-05-04T00:00:00+02:00

This post is part of a series on differential privacy. Check out the table of contents to see the other articles!

This article was first published on the Tumult Labs blog; its copyright is owned by Tumult Labs.

Say you're working on a new project involving sensitive data — for example, adding a new feature to a healthcare app. This feature is bringing new privacy concerns that you're trying to grapple with. Maybe your lawyers aren’t feeling great about the compliance story of the app you're building. Maybe you want to make strong statements to users of the feature, about how you will handle their data. Maybe you’re afraid that sensitive user data might leak in unexpected ways. You’ve been hearing about advances in privacy technologies, and you wonder: should I look into one of those to see if it could solve my problem?

You've come to the right place. In this blog post, we'll walk you through a few key data handling use cases, each involving significant privacy challenges. We'll then map various privacy-enhancing technologies (PETs) to those use cases. Spoiler alert, the overall map of use cases and PETs will look like this:

To better understand these challenges, we will make the adversarial model explicit as we discuss each use case. This means answering two questions:

Who has access to the raw, privacy-sensitive data?
Who are we protecting against; who must not be able to access the raw data?

In each diagram, we will label the entities with access to the data with a ✅, and the adversaries with a ❌.

Let’s go through each of these categories of use cases one by one.

Collecting data privately

For this use case, your goal is to collect data from individual users of your app. For example, let's say that you want to measure some metric related to health information among your user base. But there's a catch: you don’t want to collect personal data. Instead, you want to be able to tell your users: "I am not collecting data about you — I am only learning information about large groups of users."

The adversarial model is as follows.

Only individual users have access to their own raw data.
You — the organization collecting the data — must not be able to see individual data points.

Note that in the diagram above, each user has access to their own data, but presumably not the data from other users.

Can you still learn something about aggregate user behavior in this context? Perhaps surprisingly, the answer is yes! There are two main privacy technologies that can address this use case.

Secure aggregation¹ consists in hiding each individual value using cryptographic techniques. These encrypted data points are then combined to compute the aggregate result.
Local differential privacy consists in adding random noise to each individual data point. This noise hides the data of each person… but combining many data points can still reveal larger trends.

Both technologies can work together, and complement each other well. Local differential privacy provides formal guarantees on the output, at the heavy cost in accuracy. But combining it with secure aggregation can avoid most of this accuracy cost, and boost utility while preserving strong guarantees.

Federated learning is a common use case for these techniques. With this machine learning technique, model training happens on each user’s device. This can be better for privacy than sending the raw data to a central server… but model upgrades from each user can still leak sensitive information! Using secure aggregation and/or local differential privacy mitigates this risk.

Computing on data privately

For this use case, your goal is to have a partner run computations on your sensitive data, but hide the data from this partner. For example, in our healthcare app story, let’s say you collected some sensitive data through the app. A partner company has built a prediction model that you want to use on this data. You want them to run their model on your data, but you don’t want them to be able to access your data directly.

The adversarial model is as follows.

You – the institution collecting the data – have access to the sensitive data.
The organization performing the computation must not be able to access this data.

Two main technologies address this use case.

Homomorphic encryption consists in encrypting the data before performing the computation. The organization must adapt its computation to work on encrypted data. Then, they send you back the result in encrypted form, and you can decrypt it to see the result.
Confidential computing² is a hardware-based approach to encrypt data while in-use. It can be combined with remote attestation: this technique allows you to verify that only the code that you have approved is running on your data.

The guarantee offered by homomorphic encryption is stronger: you do not need to trust that the hardware is correctly secured. However, these stronger guarantees come at a cost: homomorphic encryption often has a very large performance overhead.

Joining data privately

For this use case, your goal is to combine your data with the data from other organizations. For example, in our healthcare app, you might want to count how many of your users also use another app, made by a different company. Or you want to measure correlations between metrics in both apps. But like before, you don’t want anybody else accessing your data directly. And you don’t want to see the data from the other organizations, either!

The adversarial model is as follows.

You have access to your sensitive data (and only yours).
Other organizations must not be able to access it. The platform running the computation (if any) must also not be able to access it.

What about the output of the computation — who can access it? It depends. In some cases, all participating organizations can access the results. In others, only some organizations can see them.

There are two main technologies that address this use case.

Secure multi-party computation consists in each participant first encrypting their own data. Then, participants use a cryptographic protocol to compute the metric of interest.
Confidential computing² uses hardware modules to encrypt data while in-use. Like before, it works best when combined with remote attestation: then, every participant can verify that only approved code is running on their data.

Note that these techniques are sometimes not enough to protect the original data: the result of the computation can in itself leak something sensitive about the data! And this is the perfect transition for our next use case…

Finally, for this use case, your goal is to analyze your data, and share some insights about it. Here, sharing can mean very different things.

Internal sharing: employees from another department of your organization might want to use your app metrics to inform the design of a different product. However, sharing personal data would require explicit consent in your privacy policy: your compliance story requires that you correctly anonymize metrics, even for internal use.
External sharing: researchers from a partner university might want to use data from your app for a scientific study. Your goal is to share insights with them, without allowing them to see individual information.
Publication: you might want to show some aggregated metrics in the app itself as part of a feature. In this case, all users of your app can see these metrics: it’s critical that they don't inadvertently reveal private information.

Removing identifiers is, of course, not enough to mitigate privacy risk. How do you enable such use cases without revealing individual information?

The adversarial model is as follows.

You have access to the sensitive raw data.
People who can see the shared data cannot use it to learn information about individuals.

There is one main technology that addresses this use case. If you're reading this blog post series, you certainly know what it is: differential privacy³. It adds statistical noise to aggregated information and provides strong privacy guarantees. You can use differential privacy for different kinds of data releases:

statistics or other aggregated analyses on the original dataset;
machine learning models trained on the sensitive data;
or synthetic data, which has the same format as the original data.

This is what me and my colleagues at Tumult Labs are focusing on, building open-source software and providing solutions to tailored to our customer's needs. Reach out if that sounds like something you could use!

Final comments

Handling sensitive data comes with many challenges. In this blog post, I've listed a few major use cases, and the privacy technologies that address them. I omitted some other privacy-enhancing technologies, for two distinct reasons.

Some approaches for the use cases we’ve seen do not provide any robust privacy guarantee. For example, some providers address the "joining data privately" use case without provable guarantees: instead, they simply present themselves as trusted third-parties. The situation is similar for the "sharing data privately" use case: some providers focus on ad hoc anonymization techniques. These do not make it possible to formally quantify privacy risk, and often fail in practice.
Some technologies address more niche or infrequent use cases. For example, zero-knowledge proofs are mainly useful in cryptocurrency/blockchain applications. Private information retrieval can make a database accessible to clients, without being able to learn which part of the data these clients are querying. And there are others: privacy technology is a big space, with constant innovation.

Sometimes called federated analytics. ↩
The term "confidential computing" has several synonyms and related concepts.
- Trusted execution environments refer to the hardware modules used in confidential computing.
- Trusted computing uses the same kind of hardware modules as confidential computing. But in trusted computing, the context is different: end users, rather than organizations, are running the hardware module on their devices. Digital rights management is a common use case for this setting.
- Data cleanrooms is a more generic term for confidential computing: it also includes more ad hoc solutions that do not use trusted hardware modules.
↩↩
Which, here, is used as a shortcut for central differential privacy. This isn't the most explicit, but is often done in practice. ↩

What anonymization techniques can you trust?

2023-03-10T00:00:00+01:00

This post is part of a series on differential privacy. Check out the table of contents to see the other articles!

This article was first published on the Tumult Labs blog; its copyright is owned by Tumult Labs.

Let's say that we have some sensitive data, for example about people visiting a hospital. We would like to share it with a partner in an anonymous way: the goal is to make sure that the released data does not reveal anything about any one individual. What techniques are available for this use case?

Randomize identifiers

Obviously, if we leave names, or public user identifiers in our data (like people's telephone numbers or email addresses), then that's not going to be anonymous. So here is a first idea: let's hide this information! By replacing e.g. names with random numbers, identities are no longer obvious. This is called pseudonymization (or sometimes tokenization): identifiers are replaced with pseudonyms (or tokens). These pseudonyms are consistent: the same original identity is always replaced by the same pseudonym.

Unfortunately, "no longer obvious" is very different from "impossible to figure out". Randomizing identifiers often fails to protect the privacy of the people in the data. This can be because this randomization process itself is insecure. A good example is the New York taxi database data. The randomization process was done in a naive way… and this allowed researchers to reverse-engineer license plates from pseudonyms.

But there is a more fundamental reason why such schemes are unsafe: it's impossible to know for sure what can be used to re-identify someone. Direct identifiers are not the only thing that can be used to find out someone's identity. A famous example is the release of AOL search queries. AOL data scientists randomized all the identifiers. But the data itself was problematic: what you search for reveals a lot about you! It only took a few days for journalists to reidentify people, using only their search queries.

Even worse, otherwise-innocuous data can become identifying when combined with additional information. The Netflix Prize dataset provides a striking example of this fact. Netflix published pseudonymized data containing only movie ratings. These do not seem identifying… and yet, researchers could combine them with public reviews and recover users' identities.

Remove identifiers altogether

If pseudonymization doesn't work, what about de-identification? Instead of replacing direct identifiers with random numbers, we could redact them altogether. This technique, sometimes called masking, is very common.

Unfortunately, masking provides little extra protection. The previous problem still applies: how can we know what information to redact and what to keep? Time and time again, data owners underestimate the reidentifiability of their data.

The Massachusetts state government gave us a first example of this phenomenon. In the 1990s, they released medical data about hospital visits, with names redacted. But this patient data contained key demographic information: ZIP codes, dates of birth, and sex. And these are enough to identify a large fraction of the population! Including the then-governor of Massachusetts… More than a little embarrassing. With more demographic attributes, reidentification risk skyrockets to up to 99.98%.

"Isn't this sort of obvious?" – Wired, 2007

A lot of data turns out to be identifying, besides demographic information. Credit card metadata, location information, or social interactions can be just as revealing. The problem is profound: there is no way to know what a malicious person might use to reidentify records in our data. The only safe choice is to redact all the data, which is not very useful.

Apply rule-based techniques

Since simpler techniques fail, we could try more complicated heuristics. Many of them appear in the literature, and are still in use today:

adding some random perturbation to individual values;
making some attributes less granular;
suppressing records with rare values;
and a myriad of others.

These techniques might seem less naive, but they still don't provide a robust guarantee.

The most striking example is probably the release of Medicare records in Australia. To limit reidentifiability risk, the organization went one step further than simply removing identifiers: they perturbed some attributes, and randomly shifted all dates by a random amount. The only data left was medical information, which didn't seem like something an attacker could know about! But for famous people, like politicians or athletes, some of this data can be public! This allows reidentification, and retrieval of additional private medical data.

Even privacy notions from the scientific literature can fail to protect sensitive data. The first and most famous of these definitions is probably k-anonymity. Its intuition seems convincing: each individual is "hidden in a group" of other people with the same characteristics. Sadly, despite this intuition, k-anonymity fails at providing a good level of protection: downcoding attacks succeed at reidentifying people in data releases.

Aggregate the data

It seems like trying to look at each individual record to try to find out what to redact or randomize doesn't work. What if we aggregate multiple records together, instead? Surely releasing statistics across many people should be safe?

Sadly, this is still not the case: there are multiple ways that individual information can be retrieved from aggregated data. One of these ways uses the correlations present in the data. Consider a dataset counting how many people were in specific areas over time. This doesn't seem very identifying… Except human mobility data tends to be predictable: people travel approximately in the same direction between two points. This creates correlations, which attackers can then exploit: researchers managed to retrieve individual trajectories from such an aggregated dataset.

And there is another complication: it is often possible to combine multiple statistics and retrieve individual records. This technique is called a reconstruction attack. The most prominent example was done by the U.S. Census on the 2010 Decennial Census. The results speak for themselves! Worse still, reconstruction attacks are improving over time… so they could become even more of a risk in the future.

What do these attacks have in common?

Let's take a step back and look at all these failures of bad anonymization techniques. Are there some themes we can discern?

Data is often more identifiable than it seems. Even a few innocuous-looking pieces of information can be enough to identify someone. And people tend to underestimate what data can be used to reidentify people in a dataset.
Auxiliary data is a dangerous unknown variable. Information that seems secret might be public for certain individuals, or become known to attackers thanks to an unrelated data breach.
Even "obviously safe" data releases are at risk. Successful attacks happen even on datasets that seem well-protected, like aggregated statistics.
Attacks improve over time, in unpredictable ways. Mitigating only known attacks, or performing empirical privacy checks, is not enough: using e.g. newer AI techniques or more powerful hardware can break legacy protections.

What to do, then?

These failures of legacy techniques prove that we need something better. So, when does an anonymization method deserve our trust? It should at least address the four points in the previous section:

it should avoid making assumptions on what is identifiable or secret in the data;
it should be resistant to auxiliary data — its guarantee should hold no matter what an attacker might already know;
it should provide a mathematical guarantee that doesn't rely on subjective intuition;
and it should protect against possible future attacks, not just ones known today.

It turns out that this is exactly what differential privacy provides.

It makes no assumptions on what a potential attacker might use in the data.
Its guarantees do not depend on what auxiliary data the attacker has access to.
It provides a quantifiable, provable guarantee about the worst-case privacy risk.
And this guarantee holds for all possible attacks, so the guarantee is future-proof.

It has a host of other benefits, too. For example, it can quantify the total privacy cost of multiple data releases. It also offers much more flexibility: many kinds of data transformation and analyses can be performed with differential privacy.

Is differential privacy the right fit for your problem?

2022-07-18T00:00:00+02:00

This post is part of a series on differential privacy. Check out the table of contents to see the other articles!

This article was first published on the Tumult Labs blog; its copyright is owned by Tumult Labs.

Say you have some sensitive data, like a batch of financial information about first-time home buyers applying for mortgages in Chicago. You would like to publish this data, or share it with third parties, for example to facilitate economic research. This financial data has sensitive information about individuals, so you need to make sure that you’re not revealing personal data.

So far, this seems like a perfect use case for differential privacy (DP): publishing trends without revealing information about individuals is exactly what it was designed for. You know that DP will successfully protect this individual data – its guarantees apply regardless of the data distribution. But you might still be wondering: will I succeed in publishing useful data? Will it be accurate enough for the people who will use it?

In this blog post, I’ll help you get an initial idea of whether differential privacy can work for you, using a simple litmus test. Spoiler alert: the process looks like this.

A simple litmus test

Consider the decisions people will make based on the published data, and ask yourself the following question.

Can small changes in the original data lead to completely different decision outcomes?

Take the financial data scenario involving first-time home buyer data. Suppose that a single home-buyer's info was removed from the dataset – would that change the analysis you are doing on the data? If the analysis is about median value of mortgages in Chicago overall, probably not. But if the analysis is about the maximum value of mortgages in just one ZIP code, then removing that maximum value might change the result by quite a lot!

There are three possible answers to this question.

The results of the analysis do not depend too much on small changes in the data. In this case, we say that the data analysis is robust, and differential privacy will likely work for you.
The analysis might be very sensitive to small changes. In this case, it’s worth asking: can we make the analysis more robust? If not, then differential privacy is likely not a good fit.
Finally, it might not be clear what the data will be used for, and whether these analyses will be robust. Then, we need to answer this question first, and learn more about the use case.

Let’s look more closely at these three options.

Robust analyses: well-suited to differential privacy

Robust analyses are those that do not depend too much on individual changes in the data. Many common data analyses are robust; in particular, almost all applications that aim at capturing trends fall in that category. For example, if you are…

… estimating large population sizes (> 100)
… understanding correlations between features in a large dataset
… producing usage metrics for a service with many users
… computing statistics over large groups

… then the result of these analyses won’t be impacted by very small changes in the data.

In that case, differential privacy will likely work for you. Robust analyses are a particularly good fit for DP techniques: you will likely be able to generate high-quality data with strong privacy protections. The decisions made using the DP data will closely resemble those that would have been made on the true data.

This makes sense: DP is all about adding small amounts of jitter to computations to hide the data of single individuals. But DP doesn’t need a lot of jitter: the perturbation’s magnitude is similar to the impact of a single person. If a single person is unlikely to change the result of future data analyses… differential privacy probably won’t change it too much, either.

Note that this litmus test tells you about feasibility. It doesn’t always mean that deploying DP will be very easy. Some use cases, like machine learning, or situations where you want to release a lot of statistics, can be tricky. In any case, my colleagues & I at Tumult Labs can help! Don’t hesitate to reach out.

Analyses that are sensitive to small changes in the data

Some analyses are very sensitive to the data of single individuals: a change in a single person’s data can change the outcome drastically! This typically happens in three cases.

Small populations

Suppose that you are trying to publish the average mortgage value for a specific ZIP code, there are only a few first-time home buyers – say, fewer than 10. In this case, an individual change might have a large impact on the average!

In situations such as this one, individual changes can have a large impact on the decisions made with the data. In this case, the noise added by differential privacy is also likely to change the result of the analysis. This will often be unacceptable: DP will not be a good fit.

Finding outlier individuals

Suppose that you are trying to find which people had mortgages that were significantly above the average in their area. In applications like this one, the goal is to detect outlier behavior. This is at odds with the fundamental goal of differential privacy: hiding information about all individuals, including outliers! In this kind of scenario, another approach might be needed.

Preserving linkability

Suppose that you want to enable other people to run analyses joining your data with their own data, at the level of each individual. In that case, you need a one-to-one relationship between people in the original data and in the output data. This is also at odds with differential privacy: you cannot hide who is present in the sensitive dataset and also preserve linkability. Small changes in the data will be clearly visible, since one user will or will not be part of the output.

Making the analysis more robust

When the analysis is sensitive to small changes in the data, it is worth asking: could we change that? Can we reformulate the problem in a more robust way? Doing so can often be doubly beneficial, and lead to privacy and utility improvements.

Say that the published data will be used to determine the impact of age on the rejection rate for mortgages. A first approach would be to release rejection rates, grouped by age. But some values of age are rarer than others: we might have many data points where the age is 40, but only a handful where the age is 20.

For these outlier values, small changes in the data might lead to large changes. But we are not interested in specific age values, only about the global relationship between age and rejection rate. Thus, we could change our strategy to publish data for age ranges, so each statistic comes from more data, and is more robust to small changes. This would make the released data more trustworthy, and the publication process more amenable to differential privacy.

For use cases that appear to require linkability, ask yourself the question: can we perform the join between datasets before computing the statistics of interest? If so, then using differential privacy might be an option.

What if I don’t know how the data will be used?

Sometimes, the question from our litmus test might be difficult to answer: what decisions will be made based on the published data? You might know that other people want access to this data, but not know exactly what they will want to do with it.

The right thing to do, then, is to try and understand more about their use case. The more you know what they want to do, the easier it will be to design a solution that works for them. This is both to answer our question about feasibility, and to help craft the requirements for a possible DP-based solution. The more you understand the requirements of your stakeholders, the happier you and they will be with the released data.

Thanks to Ashwin Machanavajjhala, Gerome Miklau, and Nicole Le for helpful feedback on this post.

Research post: Differential privacy under partial knowledge

2022-06-27T00:00:00+02:00

Differential privacy (DP) makes very few assumptions on the attacker. The definition holds for all pairs of databases $D_1$ and $D_2$ that differ in one person. This means that even if the attacker knows everyone in the database, except one person, they can't get information about this person. Or, rather, the probabilistic information they can get is bounded by $e^\varepsilon$.

When explaining DP to people for the first time, this "perfect knowledge" assumption often raises eyebrows. This seems overkill, right? If someone knows everyone in your database except one person… This probably means that they have direct access to your raw data. And in that case, you've already lost! It no longer matters how you're anonymizing the data later.

This intuition was central to my PhD proposal, back in early 2016. Then, I was observing two different worlds. In academia, researchers focused their efforts on differential privacy. Definitions like $k$-anonymity were a thing of the past. In industry, however, DP was still largely ignored. People thought it was a neat theoretical idea, but pretty much unusable in practice. Its strong assumptions were unrealistic, and the idea of adding noise to data was too painful.

So, I thought: I'm trying to start a part-time PhD, splitting my weeks between industry and academia. Can I work towards bridging this gap? What if we considered DP, but made its assumptions a little weaker… Would it be possible to prove something about older notions like $k$-anonymity? They might not be as robust as DP, but they might be good enough sometimes, right? Could one, for example, quantify their guarantees using the same Bayesian interpretation as with DP?

This line of study eventually led to this paper, a collaboration with Esfandiar Mohammadi, Elisabeth Krahmer, and my PhD advisor David Basin. We studied this natural question: what if an attacker only has partial knowledge over the data? How can we formalize this assumption? How does the DP definition change, and can we use this variant to prove interesting results?

This basic idea is not new: the formalism we used was based on work by Bhaskar et al., published in 2011. To capture the attacker's uncertainty, we model the input data by a probability distribution, denoted $\theta$. And instead of comparing the output of the mechanism $A$ on two fixed databases, we compare it on two samples from this distribution.

This $\theta$ captures both what the attacker know and doesn't know. The more uncertainty the attacker has, the "more random" this probability distribution is. How do we choose $\theta$? Great question. We'll come back to it a little later.

Now: what did we find out?

Some positive results

Let's take the very simple case of a referendum, where everybody answers a question by Yes or No. The attacker is uncertain about whether some people $i$ vote Yes or No. This is captured by having each vote be random: Yes with probability $p_i$ and No with probability $1-p_i$. And we capture "how uncertain" the attacker is by having a certain number of $p_i$ be "not too close to 0 or 1".

This modeling is the same as prior work, but we get better bounds for $\varepsilon$ and $\delta$. This is due to a neat trick: we can reuse the results from amplification by shuffling! The context is different, but the underlying math is the same: take a bunch of small random things, mix them together, and you get a good amount of randomness in total. That's Theorem 1 in the paper, if you want to follow along. And it can easily be extended to settings where people are choosing among more than two options (Corollary 1).

We continue by showing that thresholding can provide some guarantees, under the right assumptions. Of course, thresholding only has an effect for rare events: when all the $p_i$ are very small, and we only publish results if the total number of Yes is larger than a threshold. Then, depending on the exact parameters, we can get reasonably small values of $\varepsilon$ and $\delta$. Again, this captures an intuitive phenomenon: if very few people are in a group, but we are suppressing this group from the results entirely, then this can provide some protection (Theorem 4).

Putting these two results together, we get what we were looking for: a formal interpretation of the privacy of $k$-anonymity… Or, at least, of the simplest possible $k$-anonymity scheme, with a long list of assumptions. The high-level intuition is as follows. For each group in the output:

either there are many (more than $k$) people in this group, and then the attacker will have some uncertainty, because they don't know everyone there;
or there aren't that many people in the group, but then the count isn't published, which protects the people in it.

This is Theorem 5 in the paper. Mission accomplished! Right?

Well… Not really. For this theorem to hold, we have to make a number of weird assumptions. The "uncertainty distribution" $\theta$ has to have exactly the right shape. There needs to be a "special group" whose count we don't publish, regardless of the number of people in it. And as we'll see in the rest of this article, the definition of privacy itself is more than a little shaky.

A distinction between active and passive attackers

A first difficulty we identified is the necessity to distinguish between active and passive attackers. When the attacker has partial knowledge of the input data, this can mean two things.

The attacker can inject some fake records in our data: they're active.
Or the attacker can get information on some records, but not influence them: they're passive.

Both situations can be realistic in different scenarios. For example, say you call people on the phone to take part in a survey. Then, an attacker might know some participants, but not be able to answer in their place. But what if you're releasing statistics about the use of an online service? Then, an attacker might be able to spin up bots to interact with it, and create artificial data.

With "real" differential privacy, this doesn't matter. Giving the attacker control over all the records doesn't change anything: you have to reason about the worst-case scenario anyway, so you end up with the same definition.

But with partial knowledge, you do get two distinct versions of the definition, depending on the attacker's capability. Some results, like the one about thresholding, only hold in the passive version of the definition. That makes sense: if the attacker can inject fake records, they can artificially boost low numbers to make them go above the threshold. In this case, thresholding as a privacy mitigation is pointless.

In fact, this concern isn't just theoretical. In a paper investigating Facebook's ad targeting system, researchers used a similar method: by carefully controlling the data and queries, they negated a threshold-based protection.

Problems with dependencies in the data

Second, you might have raised your eyebrows at one of the hypotheses we made to prove our initial results: all records had to be independent. This assumption turns out to be both very important, and quite brittle.

If we don't make this assumption, then the math gets difficult fast. You need to find a way to model possible dependencies, and take them into account when calculating the privacy guarantees. I'm not sure anybody has found a tractable way to do this. I tried and failed to get anything convincing.
But if this assumption is false, and the attacker can use some knowledge about the correlations in the data… Then everything breaks down. The privacy guarantees no longer hold, and maybe the attacker can get all the data they want.

We also found an additional subtlety. When modeling the attacker's partial knowledge, it is crucial to delineate what they know from what they want to achieve. If there are correlations between their partial knowledge and the sensitive information they try to find out… Then the definition is meaningless: you end up in situations where an attacker with more knowledge over the data is less powerful than one with less.

We're not the first to point out such problems: Bassily et al. showed that in certain cases, the original notion can be buggy, and proposed an alternative. But we showed that this alternative can also have fundamental problems. This separation between partial knowledge and sensitive information seems essential to fix them.

This requirement, though, makes these definitions really tricky to use in practice. We have to think hard about what the attacker wants to achieve, and what they might know about the data. And if we're wrong, then everything we proved might be meaningless.

Difficulties making composition work

Finally, these variants of differential privacy don't compose very well. Imagine a database of referendum votes, where each person is associated with their vote. Then, consider the two following queries (without noise).

The count of everybody who voted Yes.
The count of everybody who voted Yes, excluding a specific user $i$.

Each query, on its own, might be considered private in the partial knowledge setting: if the attacker has uncertainty over many votes, each query result won't give a lot of information about individual votes. But, of course, combining both results gives you the vote of user $i$. So, composition doesn't degrade privacy loss smoothly, like for differential privacy. Instead, it can lead very quickly to catastrophic privacy loss.

Of course, this counterexample is a little artificial. It's natural to wonder: can we find a simple condition on queries allowing us to prove a composition result? We investigated, and found possible ways of achieving composition… But nothing was really natural and convincing.

The problem is, again, correlations: if queries give correlated results, then the math breaks down. That's bad news: the queries can't touch the same data, otherwise, the results are correlated. What if the queries look at different columns of a dataset? You need one more assumption: the different columns in the data must be uncorrelated. And that's pretty unrealistic.

One option is a little more viable: databases that constantly get new data. This situation is common in practical scenarios. And it seems natural to assume that this new data might add to the attacker's uncertainty… So, if we require that each new query gets some "fresh" data, we can get some composition results. They're quite preliminary for now, but maybe worth investigating further.

Conclusion & perspectives

I'm happy about some of the progress we made on this problem. We found important issues with prior work, and proposed a more robust definitional framework. The link we established with shuffled DP is interesting, and somewhat promising. I'm hopeful that both aspects might end up being useful to folks doing further research in this area.

But while the promise of utilizing an attacker's uncertainty in DP is alluring, a closer look revealed big challenges with this line of work. I'm not sure I'm optimistic about making this idea work well enough for real-world use cases. The assumptions that are necessary to make this work are too brittle and unrealistic. The math seems to get messy too fast.

While I was working on this research project, I was also working on building infrastructure for differential privacy. And over time, I became convinced that the gap between DP theory and practice was usability. We'll get to widespread adoption by building better tools and doing a better job helping people use them. I even made a whole talk about this idea since then.

This work also changed my mind about the assumptions behind differential privacy. They're not unrealistic, or too strong: instead, they seem necessary to get the basic properties you want out of a convincing notion of privacy. They might be an over-approximation of the capabilities of realistic attackers… But reducing these assumptions is dangerous, costs a lot, and doesn't buy much.

If you've found this post interesting, or if you disagree with some of the points I made in it, let me know! My contact info is at the footer of this page, and I'd love to hear more perspectives about this problem.

Thanks to Esfandiar Mohammadi and David Basin for their helpful feedback on drafts of this post.

A bottom-up approach to making differential privacy ubiquitous

2022-03-09T00:00:00+01:00

This post is a transcript of an invited talk I delivered to PPAI-22. It was also published on the Tumult Labs website. Ashwin Machanavajjhala, Gerome Miklau, Philip Bohannon, and Sam Haney contributed to these slides.

Hi everybody! Here is a graph counting the number of academic papers related to differential privacy, over time.

In academia, differential privacy essentially won. There is broad agreement, at least among computer scientists, that this is the notion of choice to formally bound the privacy leakage when publishing data. Differential privacy has become the default tool that people use to quantify trade-offs between privacy and accuracy.

The field is growing every year, with exciting new domains of application, empirical improvements, and theoretical advances.

For comparison, here is a graph showing the number of real-world deployments that I could find public information about.

As you can see… Ooops! Sorry. I forgot to change the scale of the vertical axis.

OK, now we’re seeing something… It’s not much, though. We’re still at a stage where I can list all public deployments of differential privacy in a single blog post.

I know what you’re going to say, though. There might not be many use cases, but some of these are really big.

The 2020 Decennial Census! Mobility data from around the globe to help combat COVID-19! Telemetry collection from billions of devices!

If we look at who is deploying DP, one thing in common for almost all of these organizations is that they’re large. They can afford to invest in, or contract with, specialized science and engineering teams to help them roll out this technology.

Special mention to OhmConnect, the only exception I could find to this rule! They’re a startup sharing smart meter data to increase power grid reliability.

But the problem that differential privacy solves isn’t limited to these massively large organizations: smaller organizations also have data sharing and publishing needs! Everyone could benefit from using strong anonymization techniques, not just these giants.

This is the question I’m here to talk about today. How do we bridge that gap? How do we make differential privacy ubiquitous?

I’m Damien, and I work as a scientist for Tumult Labs. We’re a startup trying to make widespread adoption of differential privacy into a reality.

In this presentation, I’ll outline a bottom-up approach for reaching that goal.

What do I mean by “bottom-up”? Well, first, here’s what a “top-down” approach could look like.

First, we lobby decision-makers: we convince executives, regulators, standard committees, etc., that differential privacy should be the notion of choice for anonymizing data.
Then, once DP has become a requirement in different places, like standards, regulations, internal best practices, etc.…
People adopt it because they have to.

This comes with many challenges.

First, writing good policy documents and guidance is very difficult. We would need to answer questions like “how to choose parameters”, which are already tricky for specific use cases, but even harder to decide on in generic terms.
Second, people won’t sign off on a technology unless they’re convinced it can work in practice. Differential privacy needs to prove itself in the field, in sufficiently many cases, in each vertical it can be applied to, before that happens.
Finally, when privacy/security efforts are compliance-oriented, implementation can often be people doing the bare minimum. That might not be too great.

To be clear: this outreach work with key decision-makers is still valuable, and worth doing! But this isn’t what we’re focusing on in the immediate future.

Instead, we’re pushing for a bottom-up approach.

First, we create excitement among data scientists and engineers, across the industry.
Then, we create a well-lit path that makes it super easy for these people to go from “I’m curious what this technology can do” to “I know exactly how it will work for my use case!”
And we create tools to make this entire process super easy, all the way to deployment.

Sign-off from decision-makers happens at the end of the process, not at the beginning. By that time, all they need to do is confirm that it works.

Note that as far as I know, this is what happened for most of the deployments that I could find described publicly. There wasn’t an executive giving the order to use differential privacy, out of the blue. Instead, individual teams of engineers and scientists built prototypes, showed that it worked in practice, did internal advocacy, and eventually got the go-ahead.

So, how do enable many more people and organizations to get to this point?

We have a vision, and we need your help.

We said that the way to get adoption is to make usable tools for differential privacy, and train people to use them. By the end, we want thousands of engineers and data analysts to become DP practitioners. What does that learning path look like in practice?

First, people might hear about differential privacy, and decide to give it a try, using an open-source tool.
They’ll follow tutorials to get the hang of it…
… and maybe reach a point where they’re giving it a first try on their own data.
Then, they’ll probably need to optimize privacy-accuracy trade-offs…
… and if they’re convinced that this is workable, start thinking of which parameters would make sense for their use case.
Once they get the sign-off from their hierarchy…
… they will need to do a bunch of operational deployment stuff …
… and end up shipping a differentially private data release.

Our goal is to get people to that point on the right, where they can deploy DP to production.

As I mentioned earlier, I could find one small company that went further than this line. We want to get to many more such examples.

In fact, we’re going to visually represent how many people there are at each step of the process.

Today, the curve might look like this. Even though there might be some initial interest, almost nobody ends up crossing that line we’re interested in. So why is that?

We don’t know the answer for sure, but we’ve heard of a lot of hurdles that people encounter when trying to roll out differential privacy. Here are a few of them.

First off, people might be immediately put off by how complex tooling looks like. If the interface & documentation looks like it was designed for people who already know what they’re doing, they might simply think “OK, I’m not the target audience”, and give up before even trying it out.
Second, DP comes with additional requirements: things like group-by keys, or clamping bounds, are new concepts that people never had to think about before. People might think: why is it so hard to do even basic things that would take me 3 lines in SQL? And drop off, thinking that it’s just going to get worse from there.
If people reach the point of trying it out on their own data, the initial results might be absolute garbage, because the strategy is extremely sub-optimal. This can be demoralizing, and make people feel like they won't ever make this work.
Even if the results end up looking kind of reasonable when plotted on a graph, this might not be enough. People need stronger guarantees on how accurate the data is, and if the tool doesn’t provide this, this might also be a hard blocker.

Instead, this is where we want to be. We won’t ever make the line entirely flat, that’s normal, every software project loses users in the learning process. But our goal is to maximize the number of people reaching deployment. We want the intersection point to be as high as possible.

OK, so how do we do it?

First, we make the line start higher. We want to get as many people as we can who know that differential privacy exists, and have an idea of what problem it solves. Further than that, we want to get them excited about trying it out.

Some of this is the job of companies like mine: we need to do a good job at marketing this technology.

But researchers can also help there: we need many more resources that are accessible to beginners, like blog posts! These serve the dual purpose of helping people learn, and of making our field more widely known to the public.

We also need y’all to contribute to the public discussions around data privacy, beyond research papers. There can be many examples of that kind of work.

Opinion pieces in scientific publications, like the ACM magazines, or in newspapers, can do wonders to raise awareness.
Participating to events or workshops discussing adjacent problems can be great to open your research horizons, and to socialize with people outside of your usual research community. Attending non-academic events around data privacy, in particular, can be eye-opening.
Finally, getting into the contact list of a tech journalist whose work you follow is easier than you think! Reach out and let them know that you’re happy to comment on technical topics in your area of expertise, and they’ll be happy to take you up on the offer some time.

All of these can have a major impact. Thanks to everyone who is already doing this kind of work today! We need even more.

OK, once we made that line start as high as we could, what do we do next?

We make sure that we lose as few people as we can in the initial learning stages. We make that curve as flat as possible, avoiding those cliffs from earlier. We make the learning process as smooth as we can.

To do that, we need interfaces that are super simple to use, and a great onboarding experience. Let me give you a sneak peek of what the interface looks like on our platform right now.

Our platform is built in Python, and runs on top of Spark, so we can scale to very large datasets.

To use it, you start by defining a session. This session encapsulates your data, given as a Spark dataframe, and gives you clear privacy guarantees. Here, the library promises you that everything downstream of this session will satisfy differential privacy, with ε=1.7.
Then, you write a query, using a Spark-like query language. Here, the query filters the records to only keep the individuals older than 42, then we group by zip codes, and we compute the median income for each zip code.
Finally, you evaluate the query using a portion of your privacy budget, here, 0.8. The result is a regular Spark dataframe. We could, later, evaluate further queries, as long as we don’t spend more budget than was initially allocated.

There are a couple of things that will still seem unfamiliar to data scientists without prior experience with differential privacy: the way we specify group-by keys, for example, or clamping bounds. We’re working hard to make these even simpler and more accessible.

That’s what we’re doing. Now, what can the academic community do to help users in this initial learning stage?

One way is by thinking about usability when designing new techniques to achieve DP. How many choices will a user have to think about before using a given mechanism?

Suppose, for example, that you found a novel a way to compute quantiles. It’s better than the state of the art, but it introduces new hyperparameters: for example, you need to discretize the data first, and the user can choose the granularity. Can you recommend a good default for this new parameter? If there is no universally reasonable choice, can you automatically and privately select it based on the data, using some portion of the budget?

Doing this will make it much more likely that people can use your fancy algorithms, even in the early stages of the process.

Ok, so that was the initial learning stage. What comes next?

As people start moving out of the learning phase and closer to the deployment stage, we need to keep supporting them. Again, we want that curve to stay flat. We don’t want people to drop off just before the finish line.

Once people start trying to ship their initial idea to production, they start having more complex needs that require advanced features. For example, in one of the data releases we’re working on with the US Census Bureau, the goal is to publish statistics on various population groups, depending on geographic and racial characteristics. These groups can vary tremendously in size: in some cases, we can only publish total counts with reasonable accuracy, while in others, we also want to split these groups into more fined-grained categories.

This requires an adaptive algorithm. Let me show you what it looks like in our interface. It’s going to be little more complicated than the previous example, but don't worry, I’ll walk you through it step by step.

First, we set aside 10% of our total budget. Here, we use zero-concentrated DP, because each individual will contribute to many statistics, so we’re using Gaussian noise and tight privacy accounting methods. Note that here, switching to another privacy definition is as simple as changing the privacy budget type: the framework is extensible enough to make this kind of operation very easy.

We then use that budget to compute, for each population group, a total count of people. Here, we group by geography and race/ethnicity combinations.

Then, we augment these results by checking, for each of these groups, whether the count is below or above a certain threshold. Later on, we will want to do different things depending on the value of this column.

At this point, we have a table that tell us, for each group, whether the noisy count of people is above or below a threshold.

We join our secret data with this augmented table. Each individual record is now associated with additional information telling us whether the group they’re a part of has a count below or above the threshold.
And once we have this new, augmented private data set, we partition the session into two sessions, depending on the value of this column. One session will have part of the data, the other will have the rest. These sessions are allocated a given privacy budget; here, we use the entirety of the privacy budget we have left. That budget is depleted from the original session, and transferred to the new sessions.

Now, we have two sessions, so we can do different things in each.

For the records that are in small groups, we only compute the total counts, with the rest of the privacy budget. This is the same aggregation as earlier, but with more precise results.
And for the records that are in bigger groups, we compute the counts at a more granular level, also including age. Here, the multiplication operator in the groupby correspond to doing the cross-product of groupby keys for different attributes.

Of course, the actual algorithm is a lot more complicated. We actually split the data in four different levels of granularity, so we have three different thresholds, ages are bucketed, we use more demographic information, and we compute a lot more things. But hopefully, this gives you an idea of what real-world use cases can require, and what kind of advanced features our platform can support.

OK, so that’s an example of what we’re doing to support people at this stage. Can the academic community also help flattening this curve, and make it easy for people to reach deployment?

Yes. Yes, yes, yes. In this path towards deployment, there are so many open problems. People routinely need things that don’t exist yet. Our customers are constantly asking us for very reasonable things, natural requests that… turn out to be open science problems. This is why we’re hiring scientists, by the way.

Let me give you a distilled list of areas where we desperately need more progress in research and engineering.

First, explainability and transparency. Releasing private error measures along with the DP output is easy for simple mechanisms, but still hard for things like quantiles, or when clamping bounds are involved. More generally, can we explain to non-expert users what was done to the data? Can we give them a summary they can understand and use in later analyses? Finally, DP algorithms can introduce biases in the data – can we make these transparent, and allow data users to take them into account? Explainability and transparency are absolutely critical to build trust, and trust is key to adoption. We, as a field, need to have better answers for these questions.

A second one is decision support tools. Dashboards and visualizations that allow people to understand the privacy/accuracy trade-offs in their data, and fine-tune parameters, are critical. In our experience, this is often what makes people “get it”, and make them feel like they can actually use this tech. This is a promising area of research for visualization and usable privacy folks, but there are also complicated algorithmic questions here: how do we do that efficiently?

When people want to generate DP data, they don’t want to specify a budget: instead, it would be much nicer if they could decide what level of data quality is fit-for-use – good enough for their use case – and specify that as input to the algorithm. Note that these data quality measures are often interpreted as the error of a single noisy estimate or parameter. But in real use cases, it can be a lot more complex: for example, will the relative ranking of items based on noisy estimates be approximately correct?

Finally, operational aspects of DP are critical. How do we keep a good accounting of the privacy loss over time, for data releases that happen every day or week? How can we validate that the DP data is correct before publishing it? How should we handle failures? How do we detect drifts in accuracy, and how should we handle these alerts? Work on these topics is starting to emerge in academia, but there is a lot more to be done.

Back to our curve. One last thing we can do is lowering the level of expertise necessary to ship DP to production is as small as possible. The faster people get there, the less likely they’re going to drop off.

This first requires more data: what do people actually need to do? Once we know, we can build the advanced features that people need, and build interfaces that make them easier to use.

The call to action here is: if you know of more people using differential privacy in practice, try convincing them to communicate about this! Even when there isn’t novel science involved, it’s still worth telling the world what you did, and ideally, why you did it. This way, we can know what problems people encounter in practice, and what are the most pressing issues to solve to increase adoption. As a an added bonus, communicating about your use case for differential privacy is a great way to foster trust among stakeholders, and to convince other people to also try using DP!

There are two more ways you can help.

If the little code snippets I showed sounded interesting, and you’d like to play with our platform and give us feedback, let us know! We’re happy to give you a preview before our open-source launch later this year.
Finally, if you’d like to apply your research skills to hard, impactful real-world problems, and work with a great team, drop us a line! We’re looking for scientists in Europe and in the US.

Thanks again for the invitation and for attending this presentation! I'm looking forward to your questions, and I'm also happy to continue the conversation by email or via Twitter.

Averaging risk: Rényi DP & zero-concentrated DP

2022-02-20T00:00:00+01:00

This post is part of a series on differential privacy. Check out the table of contents to see the other articles!

Remember the privacy loss random variable (PLRV)? We saw that it described all values that the privacy leakage could take, and how likely each one was. And we saw that $\varepsilon$-DP was a worst-case property: the PLRV must always be lower than $\varepsilon$. By contrast, we described $(\varepsilon,\delta)$-DP as an "almost worst-case" notion: it left a little wiggle room for rare events with a privacy loss larger than $\varepsilon$.

In this article, we'll use the PLRV in yet another interesting way. Instead of only looking at its extreme values, we'll look at the average of the PLRV. This will give us an intuitive explanation of two useful variants of DP: Rényi differential privacy, and zero-concentrated differential privacy.

Basic idea

Recall the core intuition behind the privacy loss random variable (PLRV). Say our secret mechanism $A$ returns an output $O$. The attacker is trying to find out whether the input was $D_1$ or $D_2$, where both options differ in a single person's data. The PLRV $\mathcal{L}_{D_1,D_2}(O)$ was kind of the "actual $\varepsilon$ value" for this attacker: $e^{\mathcal{L}_{D_1,D_2}(O)}$ was the advantage they gain when observing output $O$. This gave us a reformulation of $\varepsilon$-DP, as an upper bound on this value. If $A$ is $\varepsilon$-DP, then for all possible choices for $D_1$ and $D_2$, and all possible outputs $O$:

$$ \mathcal{L}_{D_1,D_2}(O)\le\varepsilon. $$

How do we transform this worst-case property into an average case definition? Two options come to mind.

We could average the privacy loss across possible databases $D_1$ and $D_2$…
… or we could average it across possible outputs $O$.

The first option turns out to be a Bad Idea™, for subtle reasons that I'm not going to go into here¹. The second option, however, makes sense. This is the same "kind" of relaxation as $(\varepsilon,\delta)$-DP: there's a small probability that something goes worse than we'd like. Importantly, that this probability doesn't depend on the attacker. It only comes from the algorithm's randomness, and doesn't require additional assumptions.

There is a significant difference, however. In $(\varepsilon,\delta)$-DP, there can be a small probability (at most $\delta$) of infinite privacy loss. When we're averaging the privacy loss, that's no longer allowed. If the privacy loss is very low 99.99% of the time, but is infinite with probability 00.01%… then its average is still going to be infinite.

So bounding the average PLRV will be a way of relaxing DP, but without allowing infinitely bad events. Arbitrarily bad events can still happen, but only with vanishing probability. Let's formalize this.

Rényi differential privacy

Here's a first attempt at capturing this intuition of averaging risk. For every pair of databases $D_1$ and $D_2$ differing in a single record, we require that:

$$ \underset{O\sim A(D_1)}{\mathbb{E}} \left[\mathcal{L}_{D_1,D_2}(O)\right]\le\varepsilon. $$

Here, $\mathbb{E}$ is the expected value: you weigh each possible event by its probability. A very bad event can be acceptable if it happens almost never. This seems to capture our initial intuition.

One problem with this approach, though, is that we're not really averaging the right thing. The advantage that a Bayesian attacker can gain is $e^\varepsilon$, not $\varepsilon$! So by averaging the privacy loss, we're not really averaging the risk. Let's show this with a little visualization. The following plot shows the attacker gain of a made-up mechanism: how much will the betting odds change, depending on the random output?

With this distribution, the expected value of $\mathcal{L}$ (the "average $\varepsilon$") is around $3.4$. This translates to an attacker gain of about 30.

That doesn't look right. The actual risk is often a lot larger the "average"! If we want to average out the risk, we should take the mean of $e^\mathcal{L}$ instead. This would be closer to the intuition. The requirement would become…

$$ \underset{O\sim A(D_1)}{\mathbb{E}} \left[e^{\mathcal{L}_{D_1,D_2}(O)}\right] \le e^\varepsilon. $$

This makes more sense: this can be seen as the arithmetic mean of the risk. If we plot it, the corresponding $e^\varepsilon$ is the average of the blue line.

This still feels a bit arbitrary. Why not take a different averaging function? Large values of the privacy loss correspond to the worst events. These are particularly scary: maybe we want to give them more "weight"? We could do this using a quadratic mean, for example. We would then require something like this:

$$ \underset{O\sim A(D_1)}{\mathbb{E}} \left[\left(e^{\mathcal{L}_{D_1,D_2}(O)}\right)^2\right] \le \left(e^\varepsilon\right)^2. $$

This gives us a larger average than before.

Let's generalize this. To decide which kind of averaging function to use, we'll introduce a parameter $\alpha$.

$$ \underset{O\sim A(D_1)}{\mathbb{E}} \left[\left(e^{\mathcal{L}_{D_1,D_2}(O)}\right)^{(\alpha-1)}\right] \le \left(e^\varepsilon\right)^{(\alpha-1)}. $$

This is Rényi differential privacy. If $A$ satisfies the above inequality for all choices of $D_1$ and $D_2$, we say it's $(\alpha,\varepsilon)$-Rényi differentially private.

Some special values of $\alpha$ correspond to common averaging functions.

$\alpha\rightarrow1$ bounds the arithmetic mean of $\mathcal{L}$ or, equivalently, the geometric mean of $e^\mathcal{L}$;
$\alpha=2$ bounds the arithmetic mean of $e^\mathcal{L}$;
$\alpha=3$ bounds the quadratic mean of $e^\mathcal{L}$;
$\alpha=4$ bounds the cubic mean of $e^\mathcal{L}$;
and it's also possible to pick $\alpha=\infty$, which bounds the maximum value of $e^\mathcal{L}$: it's then equivalent to $\varepsilon$-DP.

Let's visualize these options using our previous example.

Rényi DP, invented by Ilya Mironov, has a bunch of neat properties. In particular, it composes nicely, just like DP. If a mechanism $A$ is $(\alpha,\varepsilon_1)$-Rényi DP and a mechanism $A'$ is $(\alpha,\varepsilon_2)$-DP, then releasing the output of both will be $(\alpha,\varepsilon_1+\varepsilon_2)$-DP.

Zero-concentrated differential privacy

So Rényi DP is pretty neat, but it involves an additional parameter $\alpha$. That's a bit annoying. Choosing $\varepsilon$ was already difficult. Having to make a new decision about "how to average the risk" seems even harder. Yet, this idea of averaging the privacy loss is pretty natural. Ideally, we would like to keep this intuition, but have a single parameter.

What if we covered all possible values of $\alpha$ at once? Larger $\alpha$ values put more weight on bad events: the "average" also gets larger as $\alpha$ grows. So what if we put a bound on the average… but have this bound grow with $\alpha$? This seems like a good idea. But now, the question becomes: how fast should it grow? There are a lot of increasing functions. But a logarithm doesn't exactly behave like an exponential!

Since we have a choice, we can think of what other things we'd like from a single-parameter definition. We saw that Gaussian noise was a neat tool to design DP mechanisms: it would be nice to describe its privacy guarantee in a simple way with our new definition. Composition is also important, and if possible, a simple composition result would be better.

To sum up, we're looking for a formulation that:

has a single parameter,
corresponds to a larger $\varepsilon$ for growing values of $\alpha$,
describes the guarantee of Gaussian noise in a simple & precise way,
and has a simple composition guarantee.

That's exactly what zero-concentrated differential privacy (zCDP) provides. Introduced by Mark Bun & Thomas Steinke, it can be interpreted in simple terms: given a single parameter $\rho$, the $\varepsilon$ corresponding to each $\alpha$ must be at most $\rho\alpha$. In the formalism above, the mechanism is $\rho$-zCDP if:

$$ \underset{O\sim A(D_1)}{\mathbb{E}} \left[\left(e^{\mathcal{L}_{D_1,D_2}(O)}\right)^{(\alpha-1)}\right] \le \left(e^{\rho\alpha}\right)^{(\alpha-1)}. $$

It's easy to verify that it matches all the requirements above.

The single parameter $\rho$ corresponds to the arithmetic average of the privacy loss. (Or, equivalently, to the geometric average of the $e^\mathcal{L}$.)
It guarantees that the relationship between $\alpha$ and $\varepsilon$ is at most linear, which is very simple.
It describes the Gaussian mechanism beautifully. Suppose that the statistics you're computing have a $L^2$ sensitivity of $\Delta$. Then, adding adding Gaussian noise of variance $\sigma^2$ to the result. Then the result satisfies $\rho$-zCDP, with $\rho=\frac{\Delta^2}{2\sigma^2}$. So much nicer than the formula giving the $(\varepsilon,\delta)$-DP guarantee!
And composition is a breeze. If a mechanism is $\rho_1$-zCDP and another is $\rho_2$-zCDP, then publishing the result of both is $\left(\rho_1+\rho_2\right)$-zCDP.

The last two points are super useful to analyze multiple Gaussian mechanisms at once: we can look at them separately, and add their corresponding $\rho$ values. This works even if they use very different noise magnitudes. And the resulting guarantee is much more precise than if we'd done the accounting with $(\varepsilon,\delta)$-DP.

These nice properties are why zCDP has been used in practice for some high-profile use cases, like the 2020 U.S. Census Redistricting Data. If you want to release a lot of statistics, you might benefit from using this notion in your privacy analysis as well.

tl;dr

Can we describe all the definitions we've seen so far in a tweet-length summary? Here's an attempt.

$\varepsilon$-DP: the absolute worst case is $\varepsilon$.
$\left(\varepsilon,\delta\right)$-DP: the worst case is $\varepsilon$, almost always.
$\left(\alpha,\varepsilon\right)$-Rényi DP: the average case is $\varepsilon$, and $\alpha$ tells you which average function to use.
$\rho$-zCDP: many $\left(\alpha,\varepsilon\right)$-Rényi DP guarantees at once, well-chosen for convenience.

Simple, right?

Note: I optimized this article for simplicity. I tried to find the simplest possible intuition for these notions, and made up a neat storyline to introduce one after the other. This came at a cost in historical accuracy. If your main goal was getting an intuitive understanding of these definitions, then you can stop here. If you're also interested in learning about the history of these notions, keep reading.

Contrary to the story above, zero-concentrated DP was introduced before Rényi DP. This work itself built on a prior definition, concentrated DP, invented by Dwork and Rothblum. This prior notion says that if you take the PLRV and subtract its mean, you get a distribution that's "smaller than a Gaussian".

Concentrated DP was a fruitful notion, used to prove tighter composition theorems for $(\varepsilon,\delta)$-DP. It also described the privacy properties of the Gaussian mechanism in a neater way. But has also some shortcomings: it was not closed under post-processing, and its formulation was fairly complex. This is what zero-concentrated DP was introduced to fix: it formalized the "PLRV is smaller than a Gaussian" intuition in a simpler way, keeping the advantages without the problems.

The original goal of both notions was to get better composition results, not to average risk. Rényi DP, introduced afterwards, followed this line of research. Fixing the parameter $\alpha$ was a way of getting more flexibility in the privacy analysis, in particular for machine learning use cases.

Thanks to Anthony Caruso for letting me know about a mistake in a previous version of this post.

This blog post gives a few examples of what this can look like, and the dangers of doing so. ↩

A list of real-world uses of differential privacy

2021-10-01T00:00:00+02:00

This post is part of a series on differential privacy. Check out the table of contents to see the other articles!

This article is a list of real-world deployments of differential privacy, along with their privacy parameters. One day, we might have a proper Epsilon Registry, but in the meantime…

First, some notes.

The main list only includes projects with a publicly documented value of the privacy parameters, including about what the privacy unit is. Projects that don't publish this information, but mention using DP, are listed at the end.
All use cases use central DP unless specified otherwise.
The list is sorted by alphabetical order of the organization publishing the data.
When a project uses open-source differential privacy tooling, I added a link to it.
I also added some caveats and general comments at the end of this post.

If you'd like to add or correct something, please let me know! My contact info is at the bottom of this page.

Apple

Apple uses local DP to collect some data from end-user devices running iOS or macOS. The process is documented in a high-level overview document and a detailed paper. All use $\varepsilon$-DP, the values of the privacy parameter are described below, with a privacy unit of user-day.

QuickType suggestions learns previously-unknown words typed by sufficiently many users, using $\varepsilon=16$.
Emoji suggestions calculates which emojis are most popular among users, using $\varepsilon=4$.
Lookup hints collects data on actions taken from iOS Search suggestions. (I think. It's not very explicit.) It uses $\varepsilon=8$.
Health Type Usage estimates which health types are most used in the HealthKit app, using $\varepsilon=2$.
Safari Energy Draining Domains and Safari Crashing Domains collect data on web domains: which domains are most likely to cause high energy consumption or crashes, respectively. Both features use a common budget of $\varepsilon=8$.
Safari Autoplay Intent Detection collects data about websites that auto-play videos with sound: in which of these domains are users most likely to mute vs. keep playing the video? It uses $\varepsilon=16$.

Facebook

Full URLs Data Set

The Full URLs Data Set provides data on user interactions with web pages shared on Facebok. The privacy unit is each individual action: this can be e.g. "Alice shared URL foo.com", or "Bob viewed a post containing URL bar.org". For each type of action, the privacy parameter is chosen to protect 99% of users with $(\varepsilon,\delta)$-DP, for $\varepsilon=0.45$ and $\delta=10^{-5}$. Across all metrics, 96.6% of users are protected with $(\varepsilon,\delta)$-DP with $\varepsilon=1.453$ and $\delta=10^{-5}$.

Movement Range Maps

The Movement Range Maps quantify the changes in mobility of Facebook users during the COVID-19 pandemic. There are two metrics: how much their users move during each day, and how many people are generally staying at home. Each metric uses a daily value $\varepsilon=1$, so the overall privacy budget is $\varepsilon=2$ with user-day as a privacy unit.

Google

Community Mobility Reports

The Community Mobility Reports quantify changes in mobility patterns during the COVID-19 pandemic: how many people went to their workplace or to specific kinds of public places, and how long people spent at home. Each metric uses $\varepsilon=0.44$ per day, and each user contributes to at most six metrics per day. Thus, the total privacy budget is $\varepsilon=2.64$, with user-day as a privacy unit. The data was made differentially private using GoogleDP¹.

Gboard next-word prediction models

Google uses federated learning along with DP to build next-word prediction models for Gboard, a virtual keyboard application for Android. Each model uses $(\varepsilon,\delta)$-DP with $\delta=10^{-5}$ and $\varepsilon$ varying between $0.82$ and $11.12$ depending on language. They were trained using TensorFlow Federated and TensorFlow Privacy.

Gboard out-of-vocabulary word discovery

Google uses distributed DP to discover new words to add to vocabulary lists on Gboard, a virtual keyboard application for Android. They collect data using $\varepsilon$-DP with $\varepsilon=10$ in the local model, which corresponds to a central $(\varepsilon,\delta)$-DP guarantee of $\varepsilon=0.32$ and $\delta=10^{-10}$. The privacy unit is a single word; each user contributes at most 60 words in 60 days.

Search Trends Symptoms Dataset

The Search Trends Symptoms Dataset measures the volume of Google searches related to a variety of symptoms. It uses $\varepsilon=1.68$, with a user-day privacy unit; the release was generated using GoogleDP.

Shopping

Google Shopping uses a differentially private count of product page views as a signal to priorize the crawling of pages. It uses $(\varepsilon,\delta)$-DP with $\varepsilon=1$ and $\delta=10^{-9}$, with user-day as a privacy unit. The data is generated in a streaming fashion by a proprietary engine called DP-SQLP.

Trends

Google Trends uses differential privacy to select which gueries to proactively show on the website, e.g. as trending or related queries. It uses $(\varepsilon,\delta)$-DP with $\varepsilon=2$ and $\delta=10^{-10}$, with user-query as a privacy unit. The data is generated in a streaming fashion using DP-SQLP.

Urban mobility data

Google shared mobility data with researchers, using DP to anonymize it. The resulting paper says that this data sharing scheme used $(\varepsilon,\delta)$-DP with $\varepsilon=0.66$ and $\delta=2.1\cdot10^{-29}$. The privacy unit is whether a given user made a trip from one location to another location during one week; both locations being fixed areas of size $\approx1.3$km².

Vaccination Search Insights

The Vaccination Search Insights quantify trends in Google searches related to COVID-19 vaccination. It uses $(\varepsilon,\delta)$-DP with $\varepsilon=2.19$ and $\delta=10^{-5}$, with user-day as a privacy unit; the data was generated using GoogleDP.

Israel's Ministry of Health

Israel's Ministry of Health published a synthetic dataset of live births in 2014 in Israel (there is also an unofficial English version), using $\varepsilon$-DP with $\varepsilon=9.98$, with singleton births (with a single baby) as the privacy unit. It used custom code which reused parts of OpenDP SmartNoise and Diffprivlib, patching some vulnerabilities along the way. The data release is documented in a thorough technical paper.

Audience Engagements API

The Audience Engagements API is the only interactive query system in this list. It allows marketers to get information about LinkedIn users engaging with their content. Each query returns $(\varepsilon,\delta)$-DP with $\varepsilon=0.15$ and $\delta=10^{-10}$, with a user as a privacy unit. Each analyst can send multiple queries, but a monthly cap limits how many: the total $(\varepsilon,\delta)$ budget is $\varepsilon=34.9$ and $\delta=7\cdot10^{-9}$, with a privacy unit of user-month-analyst.

Labor Market Insights

The Labor Market Insights measure trends in people changing their occupation on LinkedIn. There are three types of reports.

Who is hiring? lists the companies who are hiring most. It uses $(\varepsilon,\delta)$-DP to protect each hiring event (a LinkedIn user changing their occupation), with $\varepsilon=14.4$ and $\delta=1.2\cdot10^{-9}$.
What jobs are available? enumerates the job titles that most people are being hired for. It also uses $(\varepsilon,\delta)$-DP to protect each hiring event, with $\varepsilon=14.4$ and $\delta=1.2\cdot10^{-9}$.
What skills are needed? lists the most popular skills for the jobs above. It protects each LinkedIn user's skills information during a single month with $\varepsilon=0.3$ and $\delta=3\cdot10^{-10}$.

This suggests a total $\varepsilon=28.8$ and $\delta=2.4\cdot10^{-9}$-DP for hiring events, and $\varepsilon=0.3$ and $\delta=3\cdot10^{-10}$ for skill information during a single month. However, there are many subtleties involved in the above analysis. It's very possible to interpret the paper differently.

Microsoft

Global victim-perpetrator synthetic dataset

Microsoft collaborated with the International Organization for Migration to publish the Global Victim-Perpetrator Synthetic Dataset, which provides information about victims and perpetrators of trafficking. The release uses $(\varepsilon,\delta)$-DP with $\varepsilon=12$ and $\delta=5.8\cdot10^{-6}$; the privacy unit is a victim in the original dataset. It uses custom code to generate the data.

Telemetry collection in Windows

Microsoft collects telemetry data in Windows. The process used to get information about how much time users spend using particular apps uses local DP, with $\varepsilon=1.672$, and a privacy unit of user-6-hours.

U.S. Broadband Coverage Dataset

The U.S. Broadband Coverage Dataset quantifies the percentage of users having access to high-speed Internet across the US. It uses $\varepsilon$-DP with $\varepsilon=0.2$, the privacy unit is a user. The data was privatized using OpenDP SmartNoise.

OhmConnect

The Energy Differential Privacy project enables sharing of smart meter data. In one project, Recurve helped OhmConnect share data from their virtual power plant. This project uses $(\varepsilon,\delta)$-DP with $\varepsilon=4.72$ and $\delta=5.06\cdot10^{-9}$, with user as a privacy unit. The project uses both custom open-source code and Google's open-source DP libraries.

United States Census Bureau

Post-Secondary Employment Outcomes

The Post-Secondary Employment Outcomes provide data about the earning and employment of college graduates. The technical documentation mentions two statistics using $\varepsilon$-DP with $\varepsilon=1.5$, for a total privacy budget of $\varepsilon=3$. The privacy unit is a person in the dataset, and the methods are described in detail in this paper.

2020 Census Redistricting Data

The 2020 Census Redistricting Data contain demographic information about the U.S. population. It is protected with $(\varepsilon,\delta)$-DP with $\varepsilon=13.64$ and $\delta=10^{-5}$, where the privacy unit is a person in the dataset. This uses custom code that was published on GitHub.

2020 Census Demographic and Housing Characteristics File

As part of the 2020 Decennial Census, the U.S. Census is publishing the Demographic and Housing Characteristics File (DHC), which provides data about demographic and households characteristics of the U.S. population. The first part of this release, the Detailed DHC-A, was published with $(\varepsilon,\delta)$-DP with $\varepsilon=49.95$ and $\delta=10^{-5}$, using Tumult Analytics.

Wikimedia Foundation

Page view statistics

The Wikimedia Foundation, helped by Tumult Labs, published statistics about how many distinct users visited each Wikipedia page on each day, from each country. The data publication also covers other Wikimedia projects, and is split in three parts.

Data from July 1st, 2015 to February 8th, 2017 is protected with $\varepsilon$-DP with $\varepsilon=1$, the privacy unit being 300 page views per day.
Data from February 9th, 2017 to February 5th, 2023 is protected with $\varepsilon$-DP with $\varepsilon=1$, the privacy unit being 30 page views per day.
Data from February 6th, 2023 onwards is protected with $(\varepsilon,\delta)$-DP with $\varepsilon=0.85$ and $\delta=10^{-5}$, with a user-day privacy unit.

The data publication uses Tumult Analytics. A technical paper explains the why different units of privacy and privacy budgets are used for different periods.

Editor statistics

The Wikimedia Foundation, helped by Tumult Labs, publishes statistics about editor activity by project and country, on Wikipedia and other Wikimedia projects. The data publication happens at two separate time intervals.

Some data is published monthly, and uses $\varepsilon$-DP with $\varepsilon=2$ and a privacy unit of editor-project-country-month.
Some data is published weekly, and also uses $\varepsilon$-DP with $\varepsilon=2$ and a privacy unit of editor-project-country-week.
A one-off release for Russian editors used $\varepsilon$-DP with $\varepsilon=0.1$ and a privacy unit of editor-project-country-month.

These datasets are generated using Tumult Analytics.

Other deployments

This list is almost certainly incomplete. Again, don't hesitate to reach out if you'd like me to add or correct something!

Apple uses differential privacy to learn iconic scenes scenes and improve key photo selection for the Memories and Places iOS apps. The blog post mentions using $(\varepsilon,\delta)$-DP with $\varepsilon=1$ and $\delta=1.5\cdot10^{-7}$, but the privacy unit is not specified.
Apple and Google's Exposure Notification framework has an analytics component that uses shuffled DP. The paper mentions a local $\varepsilon=8$ and corresponding central values of $\varepsilon$ depending on how many users participate and on the central $\delta$ chosen. However, it does not specify the privacy unit, the number of aggregations, nor the minimal number of participating users.
Google mentions using DP in two Google Maps features: the first quantifies how busy public places are during the day, the second which restaurant's dishes are most popular. It does not specify the privacy parameters used nor the exact method used to generate the data.
Google's RAPPOR used to collect browsing information in Google Chrome with local DP. It is now deprecated.
Google mentions using DP and federated learning to train models to improve text selection and copying on Android. The deployment uses distributed DP, which provides similar guarantees to local DP, with additional assumptions about the adversary (which must be honest-but-curious). The value of $\varepsilon$ is reported to be "in the hundreds", but not precisely specified; the privacy unit is also not reported.
LinkedIn mentions using DP for post analytics. The value of $\varepsilon$ is reported to be "in the hundreds", but not precisely specified; the privacy unit is also not reported.
The Internal Revenue Service and the U.S. Department of Education, helped by Tumult Labs, used DP to publish college graduate income summaries. The data was generated using Tumult Analytics and published on the College Scorecard website. The project is outlined in this post, but no specific privacy parameters are given.
Microsoft's Assistive AI automatically suggests replies to messages in Office tools. It provides $(\varepsilon,\delta)$-DP with $\varepsilon=4$ and $\delta<10^{-7}$, but does not specify what the privacy unit is.
- A separate blog post by Microsoft suggests that this choice of $\varepsilon=4$ is a policy standard across use cases for differentially private machine learning, and applies to the data of each user over a period of 6 months.
Microsoft also mentions using DP in Workplace Analytics: this allows managers to see data about their team's interactions with workplace tools. No specific information about privacy parameters is given.
Spectus published a dashboard containing DP metrics about mobility trends during Hurricane Irma, and the page suggests that they generated similar datasets for other natural disasters. The whitepaper mentions that OpenDP SmartNoise was used to generate four $\varepsilon$-DP metrics for a total $\varepsilon=10$; the privacy unit is not specified.
The U.S. Census Bureau published OnTheMap in 2008: this was the first-ever real-world deployment of DP. It provides statistics on where US workers are employed and where they live. The DP process is described in paper, but I haven't found the privacy parameters published anywhere.
The U.S. Census Bureau published demonstration tables for their County Business Patterns data product, providing information about business establishments in the US. The About page mentions the use of $\rho$-zero-concentrated DP with $\rho=12.058$, which gives $\varepsilon=35.62$ for $\delta=10^{-5}$, but does not specify the privacy unit.

There are (many) other examples of companies and organizations saying they use DP. I only added them here if they point to a specific project or feature.

Finally, many scientific papers report experimental results on real datasets. Most don't mention whether the system was deployed. I did not attempt to list those.

Caveats & comments

What's a user?

Most of these projects have user as part of their privacy unit. This can mean slightly different things depending on the project: a device (for telemetry collection), an account (for online services), a household (for smart meter data), and so on. This means that an individual who uses multiple devices or accounts on the same online service might get weaker privacy guarantees. This subtlety is not always made explicit.

Replacement vs. addition/removal

In differential privacy, the definition of the two neighboring datasets can be of two types. Do you change the data of one person? Or do you add or remove a user? This subtlety is also not always explicit, and I've ignored it in the list above.

Zero-concentrated differential privacy

Multiple data releases use zero-concentrated DP to do the privacy budget accounting. Some report guarantees using this definition, others convert the guarantees to $(\varepsilon, \delta)$-DP in communication materials. To make the comparison easier, I converted all these guarantees to $(\varepsilon,\delta)$-DP with $\delta=10^{-5}$; even when the reported $\delta$ is different.

Comparing projects

You should not use this list to make broad statements or comparisons about the privacy posture of different organizations. Differential privacy parameters are a very small part of the story, even for these specific projects. How was the data collected? How long is it kept? How sensitive is it? Who has access to the input and output data? Answering these questions is crucial to put each DP deployment and its parameters in context.

In addition, different privacy units also make simple comparisons fairly meaningless. Even across time periods, the semantics are subtle. As an example, consider two DP processes.

Process $A$ uses a privacy unit of user-day with $\varepsilon_A=0.2$.
Process $B$ uses a privacy unit of user-month with $\varepsilon_B=3$.

Can we simply multiply $\varepsilon_A$ by $30$ to compare it to $\varepsilon_B$? Well, not really. The data of a user during a single day is protected by Process $A$ with $\varepsilon_A$, which is better than what Process $B$ can guarantee (at most $\varepsilon_B$). But with process $A$, the data of an entire month is only protected with $30\varepsilon_A=6$ with Process $A$, so Process $B$ has better guarantees. And this is without the possibility of using better privacy accounting methods, to get tighter parameters for the monthly guarantees of Process $A$.

Thanks to Ashwin Machanavajjhala, Erik Taubenek, Hal Triedman, Kai Yao, Lars Vilhuber, Lorraine Wong, Marc Paré, Rodrigo Racanicci, Sergey Yekhanin, Tancrède Lepoint, and Ziteng Sun for their helpful comments and suggestions.

The project name in the GitHub repository is "Google's differential privacy libraries"; most of the academic literature uses "GoogleDP" to refer to it, so I reuse the abbreviation here. ↩

A friendly, non-technical introduction to differential privacy

2021-09-27T00:00:00+02:00

Differential privacy is getting a lot of attention lately. Companies and governments are starting to publish data anonymized with this notion. Universities are offering courses about it. Statisticians are getting acquainted with this new approach to protecting data. Open-source organizations are publishing tooling to make differential privacy easier to use.

So, you might be wondering: what's the hype all about? What even is differential privacy? What makes it so special? How does it work in practice? And, perhaps more importantly, can I understand it without having to read a bunch of complicated equations?

The good news is: you've come to the right place. Welcome to my friendly blog post series about differential privacy! It provides simple explanations for the core concepts behind differential privacy. It is meant for a wide, non-technical audience: it doesn't assume any prior knowledge, uses as little math as possible, and illustrates everything with simple examples and diagrams.

Sounds interesting? Excellent! Start with these two articles.

Why differential privacy is awesome presents a high-level, non-technical description of the definition. Then, it explains what makes it so great, and why people are so excited about it.
Differential privacy in (a bit) more detail introduces the formal definition. It uses very little math, and explains every single concept needed to understand it.

Then, this blog post series splits in two branches. You can read one or the other in any order, depending on what you're most interested in.

The first branch is about the how: what techniques can you use to achieve differential privacy? It's a little bit technical, though I still keep it as simple as I can. If that doesn't sound interesting, feel free to skip over and go directly to the next section!

Differential privacy in practice (easy version) explains how to make simple statistics differentially private: counts, sums, histograms… In many cases, you don't need much more!
Almost differential privacy describes how to publish private histograms without knowing the categories in advance. It also introduces an important variant: $(\varepsilon,\delta)$-differential privacy.
The privacy loss random variable introduces this central concept in differential privacy. Then, it uses it to explain the real meaning of $\delta$ in $(\varepsilon,\delta)$-differential privacy.
The magic of Gaussian noise introduces Gaussian noise and its shiny properties. In particular, it explains why Gaussian noise is so useful when releasing many statistics.
Getting more useful results with differential privacy presents five simple techniques to improve the utility of your anonymized data. If you're rolling out your first differential privacy pipeline and it's not performing well, this is the article for you!
Averaging risk: Rényi DP & zero-concentrated DP introduces two variants of differential privacy. By averaging the privacy loss, these notions allow for much better privacy analysis in a number of common use cases.
Choosing things privately with the exponential mechanism explains a fundamental building block of differential privacy: a mechanism to privately select the best option among many.

The second branch of these series is about the why. In which contexts can differential privacy be used? Why do organizations decide to adopt it? What policy questions does it raise? The articles in this branch are accessible to non-technical folks, and are all self-contained. You can read them in any order you like!

Local vs. central differential privacy presents the two main models of differential privacy, depending on who the attacker is. Publishing provably anonymized data is very different from collecting it, and this post explains the difference.
Why not differential privacy? explores what it means for an algorithm to not be differentially private. Don't understand why people insist on using differential privacy, even when it seems overkill? This is the resource for you.
Demystifying the US Census Bureau's reconstruction attack explains why the government agency decided to use differential privacy. They ran an attack on the data they published in 2010, and the results are quite scary.
Don't worry, your data's noisy makes the case that all data is noisy, even without differential privacy. In fact, the noise used in differential privacy is the best kind of noise: the one you can quantify.
Is differential privacy the right fit for your problem? gives a simple litmus test to quickly get an idea of whether a given use case is well-suited to using differential privacy.
What anonymization techniques can you trust? lists some common legacy techniques used to anonymize data, explains how they fail, and draws some lessons from these historical failures.
Mapping privacy-enhancing technologies to your use cases situates differential privacy in the wider context of privacy-enhancing technologies, explaining which kind of data privacy challenge each one addresses.

Finally, one article lists the known real-world deployments of DP, along with their privacy parameters.

This series isn't finished. I have a list of future articles I'd like to write… and I'm adding new ideas to this list faster than I'm writing blog posts! If you're looking for further things to read on differential privacy, you can do two things.

You can check out this reading list I curated. I particularly recommend it you're looking for more formal content: textbooks with mathematical proofs, scientific papers, etc.
You can follow me on Mastodon or subscribe to this blog's RSS feed to keep updated about future posts.

Joining Tumult Labs

2021-08-05T00:00:00+02:00

I'm joining Tumult Labs, a startup focused on making differential privacy ubiquitous. I'm incredibly excited about taking DP to the next stage, and I think Tumult Labs is a fantastic place to do that. Let me tell you why!

Why differential privacy?

Differential privacy (DP) makes it possible to safely publish statistics about sensitive data. Today, using DP in practice isn't very easy, but potential applications are everywhere. I'm hopeful that by making this tech more usable and more useful, we can make progress on three fronts.

Making existing practices safer

Today, many organizations compute statistics about people, and share it with third parties. Government agencies publish information about their citizens. Online platforms give content creators data about their audience. Hospitals share patient data with researchers. And of course, a myriad of commercial applications exist too.

All this data is often more revealing than people expect: anonymization is difficult to get right, and easy to get wrong. Problems happen all the time. Worse, these problems are especially harmful to folks in marginalized populations. Only releasing aggregate data, or using clever-looking mitigations, isn't enough. The US Census reconstruction attack made that very clear.

Worse, these are only the problems we hear about. Many organizations generate and archive statistical data, without ever sharing it with outsiders. What if they misclassify this data as "anonymized", even though reidentification is possible? This can have painful compliance implications, even though it might not make headlines.

Differential privacy is the only tool we know that prevents these problems. The math works: DP provides provable guarantees. The main obstacle to adoption is usability. So let's work on that, and make all these practices much safer!

Enabling new use cases

These existing practices are only the tip of the iceberg. How many applications never see the light of day, for privacy and compliance reasons? Governments want to do more open data, but don't want to breach their constituents' privacy. Organizations want to share data with researchers, without risking privacy incidents. Valuable data gets deleted, even though it could have been useful to share it with others, or archive some of it.

Robust and accessible anonymization tooling will enable many such use cases. It will lower the cost of both compliance and respectful privacy practices. Today, DP is still a shiny new concept that requires considerable investment to use: only large organizations can afford it. In a few years, we can turn DP into a standard industry best practice, like encryption at rest or audit trails. I can't wait for us to get there!

Improving incentive structures

What happens once it gets dead simple to use provable anonymization techniques? Enabling the use cases I mentioned before is one thing. But there might also be rippling beneficial effects, much more impactful in the long run.

Here's an example. Today, large companies can see value in sharing data with researchers. But the choice is theirs. What if they don't see the point? Or worse, what if they're afraid of what the research might uncover? Nothing stops them from keeping the data secret. Pushing these companies to be more transparent about their impact is difficult: privacy can be a convenient excuse to keep damaging information secret.

Once differential privacy is the new standard, and using it is easy, this excuse goes away. It will be easier to demand openness from companies holding massive amounts of data. This could happen, for example, through legislation. Today, regulators who want to mandate such transparency face a difficult task: how to do so without creating requirements that conflict with data protection regulations? Differential privacy could change the equation.

And legislation isn't the only path: changing cultural norms can have a massive impact as well. The easier it is to share data, the more organizations will. Over time, the perception of what is and isn't legitimate to keep secret will shift. And such a global shift towards openness can create a lot of pressure!

So that's why I'm excited about differential privacy. Now, why do I think Tumult Labs is a great place for making it successful?

Why Tumult Labs?

Tumult Labs was founded by three professors, who have been working on DP for more than a decade. They worked on foundational theory and on DP's first-ever real-world deployment. Now, they're advising the US Census Bureau on some of their most complex challenges.

Their work on real-world deployments led them to a realization I'm quite familiar with: if we want to reach wide adoption of DP, we have to invest in building excellent tooling. So they've been assembling a talented team to do exactly that: build what their clients need into a single general-purpose platform. And I was very impressed by what they published about this work! It shows both deep technical insight, and a focus on building things in a modular and reusable way. I can't wait to dig into this tech some more, and learn from the folks behind it.

Finally, I'm also eager to tackle the personal challenges that await. Being a part of a small startup should be a much-needed change of pace. Helping external clients will be new and pretty exciting. I'll also be Tumult's first employee in Europe, but not the last: I'm thrilled about the idea of growing the team there. Speaking of which, if the mission resonates with you, check out our careers page, or drop me a line directly!

Don't worry, your data's noisy

2021-07-27T00:00:00+02:00

This post is part of a series on differential privacy. Check out the table of contents to see the other articles!

Here is a cold, hard, inescapable truth: your data has noise in it. No, we're not talking about differential privacy (yet)! Nobody added random numbers to your statistics so far. But still, your data is noisy. It's wrong. It's not 100% accurate. It's uncertain. Worse, there are two kinds of uncertainty.

Two kinds of uncertainty

Let's start with the first kind. Here's a diagram.

The point on the left is how most people think about statistics. Precise numbers that corresponds to exact truths. Take, for example, a database with eye color information in it. Here is a statistic: « there are 4217 people with brown eyes in this database ». What did you learn about the world when reading this statistic? One answer could be « well, I've learned that the dataset contains 4217 people with brown eyes ». That sounds reasonable enough. Right?

Wrong.

At best, you've learned that the number of people with brown eyes in the database is around 4217. It might be the best estimate you've got. But is this statistic 100% accurate? Would you bet money on it being 100% accurate? What would it take for you to be willing to take such a bet?

I can hear your objections already. « Whoa, wait a second. What does 100% accurate means? How is eye color defined exactly? How do we double-check the number? And how was this statistic generated, anyway? Did each person answer a question about their own eye color? Or did someone else do the classifying? Is eye color information available for 100% of the people in the database? » And so on. You probably have many other valid questions. Behind those questions lies the deep, uncomfortable truth: this statistic is almost certainly noisy.

At best, you need to add error bars if you want to represent this statistic in an honest way. This is what happens on the right side of the previous picture. We've got an exact number, represented by a point, and a confidence interval (or error bars) around this point. This statistic no longer says « there are exactly 4217 people with brown eyes in this database ». Instead, it says something more complex, but more accurate. « With 95% certainty, there are between 4209 and 4226 browned-eyed people in the database. The best estimate we've got is 4217. »

Wait a second.

Right. How was this confidence interval computed? What do we mean by 95% certainty? What is the uncertainty capturing? Did we miss some sources of uncertainty? Is there uncertainty in this uncertainty estimate¹?

This leads me to my second point, even more devastating than the first. Your data is noisy, and also, you probably don't even know how much noise is in it. The statistics you have are best guesses, not much more.

The figure on the right represents this unknown uncertainty. You suspect that the number you have isn't too far removed from the actual number. But you can't quantify how far.

Let's take a closer look at both kinds of uncertainty. Quantifiable uncertainties can take several forms.

Your data might be a uniform sample of a larger population. In this case, you can calculate the sampling error of your statistics.
A large-scale data collection process might miss a few events. This happens, for example, when servers crash, or network problems occur. But you might be able to estimate how often that happens, and say the uncertainty is below e.g. 1%.
The accuracy of certain algorithms can be measured. For example, you train machine learning algorithms on a training set, and evaluate them on a test set.
If humans are labelling data by hand, you can have several analysts label the same data. This way, you can get an idea of how often they agree, and estimate uncertainty this way.

After quantifying the uncertainty, you can take it into account in the analyses you run. For example, you can propagate the error bars. Or only return results if they are very likely to still be valid despite the sources of error.

Meanwhile, unquantifiable uncertainties can also come from several places.

In practice, statistical sampling is often not uniform: your statistics might suffer from selection bias or survivorship bias. You can guesstimate the effect of those on your data, but it's hard to be 100% sure.
When you ask people questions, their answers might not be accurate. This is called reporting bias. There are many reasons why it can happen, and the overall impact is also difficult to estimate.
People sometimes dig into data until they find something interesting to report. This practice, called data dredging, generates good-looking, but completely meaningless results. This is made worse by the fact that positive results are more likely to get published. Both phenomena lead to real problems in practice! They create inherent uncertainty in the results reported in the scientific literature. And again, it's hard to quantify this uncertainty.

Sometimes, the situation is even worse than that. People might use ad hoc protection techniques that add some fuzziness to the data, and not tell you about it. Or they might give you an idea, but no details. This is what the US Census Bureau did for their 2000 and 2010 releases: they randomly swapped records, but didn't publish any details about how that procedure worked. Sadly, it didn't even succeed in protecting the data. But it did make the entire data noisy in ways nobody could find out, nor take into account.

Often, both quantifiable and unquantifiable effects end up affecting your data. So you end up with something like this.

The quantifiable uncertainty, for which you can draw confidence intervals, is in blue. But there is still some unquantifiable uncertainty, here in brown: you should take the whole thing, including the error bars, with a grain of salt.

You probably see what's coming next. What if we add noise to the statistics, to get differential privacy guarantees?

Adding differential privacy to the mix

Differential privacy is typically obtained by adding noise to statistics. We pick a number at random according to some distribution, and add it to the statistics. This distribution isn't secret, only the random number is. So this is the nice kind of uncertainty: the one we can quantify. If you already have some quantifiable uncertainty, you can combine both, and get a single confidence interval. And of course, the whole thing might still be uncertain in ways we can't quite estimate.

We didn't change the situation much. We only made the confidence interval a little bit wider. The quantifiable uncertainty increased a little, and we could quantify how much.

Some DP algorithms are more complex, and add noise to data in more creative ways. Computing the uncertainty for these algorithms can be difficult. But most building blocks can be analyzed to find out confidence intervals. This might be a bit of a pain, so in an ideal world, you don't have to do this by hand: the tooling you use does it for you.

Sometimes, though, you get the not-so-nice kind of uncertainty. For example, you might need to limit the contribution of each person in your database. This can take several forms, like clamping or subsampling. This creates additional error, or uncertainty, on top of the noise itself. But this one is hard to quantify: in theory, someone could contribute a billion things, and we would only count 5 of them. In that case, the error due to this single outlier would be huge.

If you want to quantify this effect, you have two options. The people publishing the data can tell you the magnitude of data loss due to clamping. If that didn't happen, you can make some reasonable assumptions on these outliers: how many there are, and how much data was dropped. In both cases, note this clamping is often positive for data quality: robust statistical analyses shouldn't be sensitive to huge outliers.

Conclusion

Differential privacy's effect on data isn't as world-ending as you might think. Your data didn't go from perfect to noisy. It was noisy all along! DP only made it a little more so. And you can quantify the effect of this new noise, which is nice. (At least, nicer than some of the uncertainty sources that were already there.)

This might come off as a surprise. You might have been using your data as if it was a perfect source of absolute truth. Not the most scientific approach, but maybe that was good enough for your application. In that case… you can probably continue doing the same with DP data? If the people who designed the DP process did a decent job, the statistics shouldn't be too far from the real data. Unless you were looking at very small statistics, of course. But in that case, you were looking at pure statistical noise in the first place.

If you were already taking uncertainty into account, DP doesn't change much. It only adds a new, quantifiable element to this uncertainty. With one caveat: you need to know exactly what was done to the data. Privacy parameters aren't enough for you to guess. You need to know the shape and magnitude of the noise, and all other ways in which the data was altered.

This also means that DP practitioners should give you this information. People creating and implementing DP algorithms share part of this responsibility, too. Novel algorithms should come with a way to calculate arbitrary confidence intervals. Tools should return uncertainty information along with the output: nobody wants to compute it by hand.

In a few words: transparency is key. It brings trust. It makes the data more useful. And it can show that the accuracy impact of DP is smaller than people might think!

For more musings on differential privacy, head over to the table of contents of this blog post series.

I'm thankful to Cynthia Dwork and danah boyd for their helpful feedback on drafts of this blog post.

This is not a silly question! And it can go further. I have seen physicists calculating confidence intervals for confidence interval bounds for confidence interval bounds. It is a real thing that very serious scientists sometimes do. ↩

Getting more useful results with differential privacy

2021-06-16T00:00:00+02:00

This post is part of a series on differential privacy. Check out the table of contents to see the other articles!

This post was co-written by Daniel Simmons-Marengo and myself.

So you decided to use differential privacy (DP) to publish or share some statistical data. You ran your first pipeline or query¹, all excited, but then… You're getting useless results. Maybe all your data seems to have disappeared. Or the statistics look very unreliable: the noise completely drowns out the useful signal. Don't lose hope! This situation is common the first time people try to use differential privacy. And chances are that you can fix it with a few simple changes.

In this post, we'll cover five basic strategies to improve the utility of your anonymized data. These are far from the only tricks you can use, but they're a good place to start. And none of them requires you to sacrifice any privacy guarantees.

Aggregate as much data, as coarsely, as possible

DP algorithms produce better results when run over more data. Remember: the noise they introduce is proportional to the contribution of a single person. It doesn't depend on the size of the input data. So, the more people you have in a statistic, the smaller the relative noise will be. Individual contributions will "vanish into the crowd", and so will the added uncertainty.

Increasing the total amount of input data will improve utility, but you may not be able to get more data. Luckily, there are other ways to take advantage of this property. What matters is the amount of data that contributes to each statistic you produce. In other words, the finer you slice and dice your data, the worse your utility will be. If you can, slice the data into coarser partitions. For example, calculate weekly statistics rather than daily statistics. Or aggregate your data per-country rather than per-city. You get the idea.

Another common trick is to slice by fewer dimensions at the same time. Suppose that your query calculates the number of visitors by country and language. Do you need the combination of both dimensions? Many combinations of country and language are rare, and will only have a few visitors. Instead, calculating them separately might work better: you will get more users in each statistic, so the overall impact of the noise might be lower.

Minimize the number of aggregations

DP bounds the total amount you can learn about any individual from a series of data releases. Every statistic you calculate reveals some information about individuals in the data. To bound the global privacy cost of a data release, you have to set a privacy budget. This is the total $\varepsilon$ and $\delta$ cost of your set of calculations.

Each statistic spends part of this privacy budget. So if you have a fixed privacy budget, and you want to calculate more statistics, each one must reveal less. That means more noise must be added to it. Limiting how much noise is needed to protect many statistics is an active area of research. The scientific literature is full of clever tricks to that end. But the best solution is often the simplest one: calculate fewer statistics.

OK, this is a bit abstract. How can you decrease the number of statistics you calculate, in practice? Here are some common strategies.

Remove metrics. For example, if you're calculating both the number of page views and the number of unique visitors… Could you, instead, use only one of the two?
Remove dimensions. Do you need to calculate the number of visitors per country and per language? Or would only one of the two get you the information you need?
Remove time periods. Do you need to calculate the number of unique visitors in the past day, week, month and year? Or would one or two of these statistics be enough?
Remove "sliding windows". What's a sliding window? Suppose that every day, you calculate e.g. the number of visits in the past week. In that case, each data point will count towards seven separate statistics… Would calculating that metric only once a week do the trick, instead?

Split the privacy budget unevenly

This trick is related to the previous one. Suppose that you reduced the number of aggregations, but you still have several. The idea is to split your total privacy budget unevenly between them. Say your total privacy budget is $\varepsilon=1$ and you have five statistics. You don't have to allocate $\varepsilon=0.2$ to each of them! You could instead use $\varepsilon=0.8$ for one statistic, and $\varepsilon=0.05$ for all others.

Splitting the privacy budget unevenly is useful in two common situations.

You care much more about the accuracy of some statistics than others. In that case, you might want to allocate a bigger portion of the privacy budget to the most important ones.
Some of your statistics are more fine-grained than others. For example, suppose that you calculate both daily and weekly statistics. On average, weekly statistics will have 7 times more data than daily statistics: you could use a budget that is 7 times smaller for them. In doing so, the relative impact of the noise will be about the same for both.

Set aggressive contribution bounds

Most DP algorithms bound each individual's contribution to each statistic. For example, if you're counting the number of visits per web page, you need to bound:

the number of different pages each individual can contribute to;
and the number of times each individual can visit each page.

How should you pick these bounds? It’s tempting to use generous bounds that will cover any conceivable input. But this is usually bad for utility: the magnitude of the added noise grows with the bounds. There is a tradeoff between two sources of error:

larger bounds will lose less data, but require more noise;
smaller bounds might lose more data, but require less noise.

Often, reducing the level of noise on your whole dataset is worth truncating a few outliers. The best cut-off depends on the distribution of your dataset: the 95th percentile might work well for one dataset, while another might do better with the 98th. In most use cases though, you’ll reach optimal utility when part of your dataset exceeds the bounds.

Note that some systems don't make you specify these bounds. Instead, they can be automatically calculated by an auxiliary DP algorithm². But this operation uses some part of the privacy budget! In that case, if you specify the bounds by hand instead, you can save that part of the budget, and get less noisy results.

Use public partitions

Most DP pipelines produce many statistics, grouped by partitions. Partitions are like the buckets in a histogram: in the example where we count the number of visits per web page, each page is a partition. By default, pipelines only release statistics for some partitions in the dataset. Typically, the partitions with most people in them are kept, and the ones with few users are dropped. This process can reveal information about individuals: it must be done in a DP way³. Like before, we need to use part of the privacy budget for this, and add extra noise to the statistics to compensate.

You can skip this step by listing all the partitions you want to appear in your output before you run the query. If you do so using non-private data, it is no longer necessary to choose partitions in a DP manner. This allows you to save budget, and return more partitions.

There is a downside: all partitions that you specified will appear in the output, even if they have little or no data in the dataset. In that case, they can be extremely noisy. Still, if you can list the partitions you want in advance, this is often an excellent technique. You can see how it works in practice in e.g. the Privacy on Beam codelab. Note that you're not allowed to look at the private data to build your list of public partitions!

More clever ideas

There are a myriad other techniques out there to squeeze more utility out of DP pipelines. Most of them are more complex than the ones listed in this post. Some might be the topic of future blog posts! In the meantime, here are three other, more generic suggestions.

You can try looking up your problem in your search engine of choice, and add "differential privacy". This will often dig up relevant literature. Unless you're lucky, you won't find a readily available implementation. But you might get valuable ideas or insights!
You can send a friendly message to one of the communities out there working on DP tooling. Between Google, OpenMined, or OpenDP, someone might be happy to help! (And try to convince you to use their tooling :D)
You can also try to think about how sensitive your problem is to individual changes in the data. If a single person changes their data, will you come to different decisions or results? If yes, your use case might be fundamentally incompatible with DP, and no clever trick will fix that.

Want to keep learning more about differential privacy? Head over to the table of contents of this blog post series to see its other articles.

You did not roll your own tooling, right? Doing so is generally unwise; implementing DP correctly is much more tricky that you'd expect. There are some excellent open-source tools out there that you should use instead. Like these libraries that my team at Google published! ↩
Section 5.1.1 of our paper describes such an algorithm, implemented e.g. here. ↩
This operation is described in a previous blog post, and is also the topic of a paper I co-authored. ↩

Demystifying the US Census Bureau's reconstruction attack

2021-05-26T00:00:00+02:00

This post is part of a series on differential privacy. Check out the table of contents to see the other articles!

Every 10 years, the US Census Bureau asks every American household a series of questions. How many people are living here? How old are they? What is their race and ethnicity? What is their relationship to each other?

The collected answers have very high quality, for two reasons. First, all households have to respond truthfully to these questions, by law. Second, the US Census Bureau has a legal duty to keep the answers secret for 72 years after each survey. Both aspects are key to convince everyone to answer truthfully. Appealing to people's sense of civic duty also helps!

What is the data used for, then? The Census Bureau aggregates it and publishes statistics about the US population. A lot of statistics: in 2010, it released over 150 billion statistics about the US population. These statistics then have many applications: scientific research, informing public policy, determining election districts, etc.

Confidentiality is central to the Census Bureau's mission. By law, they have to make sure that nobody can use their data to find out information about individuals. Disclosing such information even comes with criminal penalties! This has led to the creation of an entire field of study: statistical disclosure control. It predates even the oldest definition of privacy mentioned on this blog (k-anonymity).

How did statistical disclosure control work in practice? Before 1990, the method was pretty blunt: they removed the tables with fewer than five individuals or households in them. Then, from 1990 to 2010, the Census Bureau used a more complex technique called swapping. Swapping works in two steps. First, it selects households at random in small geographic areas: those are the ones most at risk of re-identification. Then, it exchanges records between these households and others before generating the statistics. The details of this swapping process were secret, to make it harder for people to design attacks.

Nowadays, the Census Bureau is moving towards formal notions: the statistics published for the 2020 Census will be differentially private. If you've read the previous articles of this blog, this might not surprise you. Differential privacy (DP) is designed to solve this exact problem: publishing statistics without revealing information about individuals. But this choice was far from obvious! Adding noise to statistics is quite scary for data users¹. Using DP for such a complex release is also unprecedented.

So, what convinced the Census Bureau to take this decision? Their scientists ran an attack on some of the 2010 data, to better understand the privacy risks. And they realized that the attack was much more successful than they expected. The attack is simple, not very expensive, and pretty scary. The scientists then measured how much more swapping it would take for the attack to fail… and observed that the utility loss would be catastrophic. Older techniques like suppression were also ineffective. The only way to mitigate the risks and limit accuracy loss was differential privacy.

This choice, of course, has other benefits. Differential privacy provides quantifiable guarantees. It composes well, and protects even against very powerful attackers. It's also good for transparency: for the first time, the Census Bureau will be able to publish the details of their process. But these nice theoretical properties weren't the main factor in their decision. Instead, the choice of DP was pragmatic: it worked best to prevent realistic attacks without sacrificing too much utility.

In this blog post, we'll take a look at the details of this compelling attack, at the core of this decision. The attack has two stages: reconstruction and reidentification.

Reconstruction

The first step in the attack is to reconstruct records, using statistical data. The statistical data is as follows. In each geographic area (like Census blocks), we can get the answers to questions like: how many people of age 47 live there? How many people between the ages of 25 and 29 self-identify as white? How many households with 2 adults and 2 children are there?

How can we use this kind of information and reconstruct the original records? Let's take a simplified example. Say that a hypothetical Census block has four people in it.

Age	Race²
17	White
17	Asian
43	White
47	Asian

Now, suppose that we have the following statistical data about this Census block.

There are four people in total.
Two of these people have age 17.
Two of these people self-identify as White.
Two of these people self-identify as Asian.
The average age of people who self-identify as White is 30.
The average age of people who self-identify as Asian is 32.

This data is statistical in nature: these numbers are all aggregated over several people (here, two). Yet, it's not difficult to guess what the original table looks like based on the numbers.

Not obvious yet? Take the two people aged 17. Points 1, 3 and 4 tell us that:

either they both self-identify as White,
either they both self-identify as Asian,
either one of them self-identifies as White and the other as Asian.

The first option is impossible: if they both self-identified as White, then their average age should be 17, not 30 (point 5). The second option is also impossible, for the same reason (with point 6). So the third option is correct. We now know the first two records, and we can find the age of both others using the average age of each race group. It's like a fun puzzle!

That's the idea behind reconstruction attacks: taking statistical information and using simple reasoning to reverse-engineer the original records. Of course, when we have billions of statistics to work with, we don't do this by hand. Instead, we convert the data into a massive system of equations. Each piece of individual information is an unknown variable. The data gives us the relationships between them. Then, we can use a big computer to find a possible solution to this large system. This solution, in turn, gives us reconstructed records.

So, the team at the Census Bureau did exactly that, with statistical data from the 2010 Census. They transformed it into many equations, and used Gurobi to reconstruct the raw data. The records they obtained matched 46% of the original records exactly. That's pretty impressive! Especially since only a small fraction of the statistics were used in the attack (6.2 billion out of 150 billion). Swapping was not enough to prevent accurate reconstruction.

At first glance, that result looks pretty scary. But let's take a step back: how bad is it really? What does this 46% number actually tell us?

How bad is reconstruction?

Imagine that a given Census block has a particularly homogeneous population: out of 100 habitants, 95 all self-identify as White, and are evenly distributed between the ages of 20 and 39³. The other 5% of people in this block do not belong to this demographic, and we don't have any information about them.

Can we "reconstruct" this dataset with high accuracy? Yes, and it will be easier than we expect: simply use the data on the majority group, and ignore the minority population. In practice, for each age between 20 and 39, we output 5 records with this age, and White as a race.

Given the statistics on this block, it's likely that our naive guess is pretty close to the truth. After all, only 5% of people don't belong to the majority demographic, and we know the distribution of the other 95%. Recall the success metric from the previous section: the percentage of matching records between real data and reconstructed data. According to this metric, our naive strategy performs very well! Accuracy is about 95%, if the age distribution is very uniform.

Has this process revealed sensitive information? Well… not really, right? All reconstructed records are identical across many people. So they don't seem to reveal very sensitive information… And the people in demographic minorities are safe from the attack.

It's also not clear yet how problematic these reconstructed records are. All we get is a list of records that are compatible with the published statistics. But how do we know which ones are actually correct? And how can we use them to learn something new and sensitive about individuals?

These questions show that on its own, the 46% number from the Census Bureau doesn't tell us much. But the Census Bureau didn't stop at reconstruction. The attack has a second step, re-identification, and this one gets much scarier results. Let's take a look at how it works.

Re-identification

In the Census Bureau attack, the reconstruction step outputs records with five characteristics: Census block, age, sex, race, and ethnicity. The idea of the re-identification attack is simple. First, they buy the kind of commercial data that an attacker could have access to. Second, they link this information with reconstructed records.

This "commercial data" is exactly what it sounds like: information about people, compiled by random companies, and made available to anyone who pays for it (or steals it). These companies, called "data brokers", are largely unregulated in the US. Their privacy practices are about as awful as you can imagine⁴. For their attack, Census obtained data from five different companies. Their goal was to simulate what an attacker would likely have access to in 2010.

These commercial datasets typically contain people's names, associated with demographic information: location of residence, age (or date of birth), and sex⁵. We will use these three characteristics to re-identify reconstructed records.

The technique is simple. We look at each record of the commercial dataset, one after the other. And we try to match this record's characteristics with the reconstructed record. Is there a single reconstructed record with the same location, age, and sex? If so, we link both records together. Here is a visualization of that process (with made-up data). The commercial dataset is in blue, the reconstructed Census records are in green.

Linking datasets in this way achieves two goals.

It confirms that the reconstructed record corresponds to a specific individual. In fact, it allows us to re-identify the reconstructed record. Here, the commercial data tells us that this person is James Link.
It gives us more information about this person, which the commercial data didn't have. Here, we learn James Link's race and ethnicity.

How do we quantify the success of this attack? We can look at two numbers. First, how many records can be linked between datasets in this way? Second, out of these linked records, how many are accurate? After all, some of these re-identifications might be wrong: both datasets might be incomplete or inaccurate. The percentage of correct answers is the precision of the attack:

$$ \text{precision} = \frac{\text{true re-identifications}}{\text{number of linked records}}. $$

So how does the attack perform? Census Bureau scientists linked 138 million records between both datasets. And their average precision was 38%. This means that the attack successfully re-identified 52 million records. Scary!

One could argue that the attacker can't know which records are correctly re-identified. Some of them will be false positives! After all, 38% aren't great odds. But the analysis from Census Bureau scientists doesn't stop there. Two further considerations make the attack even more compelling.

The precision of re-identification goes up for people in small Census blocks: it's 72% on the smallest Census blocks (in which 8 million people live). This makes sense: statistics across a few people are more revealing than aggregates over large groups. It's not surprising, but it's still bad news: it means that folks in minority populations are more at risk of being re-identified. But disclosure avoidance is precisely trying to protect these people!
The precision also goes up if one has better-quality data. Census Bureau scientists use a neat trick to find worst-case guarantees: they use the raw Census data itself as an approximation of the best possible data an attacker could find. Using this, they show that the global precision increases to 75%, and even goes up to 97% on small Census blocks.

Is this assumption of high-quality data unrealistic? No, for two reasons.

The commercial data used in the attack is what someone could have had access to in 2010 — more than 10 years ago. Data brokers have much better data available for sale by now.
The re-identification step only requires high-quality data about our targets. The attack works just fine even if we're trying to re-identify a single person, or a handful of people. With high-quality information about them, the precision goes up to these worst-case numbers.

There's nothing preventing businesses from running this attack on their employees or customers. In this kind of context, access to high-quality data isn't an optimistic assumption: it's a given.

Conclusion

This attack is bad news.

It proves two points beyond doubt. First, anyone can reconstruct Census records. Many of these reconstructed records are accurate, especially in small Census blocks. This does not require massive computing power: you can run smaller-scale attacks in minutes on your average laptop. Second, high-quality data about some people is enough to re-identify their Census records. This re-identification has high precision, and reveals previously-unknown information about these people.

In the attack, the attacker could learn race and ethnicity information. Data brokers might want to do this to augment their datasets, for example. But the risks can get even more tangible. Remember: the attack only used a small fraction of all published statistics. The targeted tables had only demographic information. What if someone were to attack household tables instead? This could likely reveal intimate details about the people you live with. Do you have a have a partner of the same sex? Children of a different race? Anyone with basic information about you — employer, acquaintance, data broker… — might find out.

Some of that information is particularly sensitive, and otherwise unavailable via data brokers. For example, reconstruction also works on children's data, which is illegal to sell in the US. Someone could combine household reconstruction with re-identification, and locate kids of a specific race, sex and age range in a given area. No need to spell out the possible harms this could enable.

Risks to individuals can translate to risks to data quality in the long run. Suppose Census data can no longer be considered as confidential. How will it impact people's willingness to answer future Census surveys? Minority groups are already more likely to have privacy concerns about their answers. This will get worse if the Census Bureau can no longer guarantee the confidentiality of the data.

This explains the Census Bureau's decision to move to differential privacy. Everything we described is just one possible attack, and it used only a fraction of the released data. The Census must protect all other attacks that people might come up with in the future! And that's exactly what differential privacy provides: provable privacy guarantees, even against attacks that haven't invented yet.

If you'd like to learn more about differential privacy, I have good news for you: this blog also contains an entire article series about this notion, introducing it in friendly, non-technical terms. Head over to the table of contents to see the other posts!

Nice, you made it all the way to the end of this article! Here are some more links and disclaimers.

This article is almost entirely sourced using documents from a lawsuit. The main one is this declaration from John Abowd, the Census Bureau's chief scientist. It explains the history and context behind the use of DP for the 2020 Census. It starts on page 85 of this PDF, and Appendix B (starting on page 147) describes the attack in more detail. Some of the numbers also come from Abowd's supplemental declaration.

This last declaration has been filed as a response to a filing by a couple of demographers. They make an argument similar to the one outlined in the second part of this article: you can reconstruct records by picking them randomly, so reconstruction doesn't mean anything. Hopefully, this post managed to convinced you that this argument has two flaws: it ignores the higher risks for minority groups, and it fails to address re-identification, the second part of the attack.

Still hungry for more legal filings? Many leading data privacy experts filed an amicus brief supporting the use of DP for the 2020 Census. It's clear, concise, and makes a lot of great points. Worth a read!

In case this wasn't obvious, this article made a lot of simplifications. I rounded all numbers so they would look nicer. I optimized the reconstruction example for clarity and fun, not for accuracy: in particular, the Census Bureau doesn't actually release statistics like "average age". The real attack doesn't use only a system of equations for reconstruction, but inequalities as well; I also suppose that there is some optimization done, but the details aren't public. The linking attack has some additional logic to do fuzzy matching. And I'm sure I made other shortcuts along the way. If you're looking for more accurate information, you will probably find it in the documents linked above.

There's a lot more to say about swapping, too. For example, it has surprising negative effects on data quality! Since folks in demographic minorities in each location are more likely to have their record swapped, the process biases the data: it makes all areas seem more homogeneous than they actually are. Recall that the details of swapping are secret: data users couldn't quantify such effects to take them into account in their analysis! Interestingly, the attack run by Census Bureau scientists ignored swapping entirely. A more clever attack might take it into account, and attempt to reverse it. This could make the attack even more accurate, especially for folks in demographic minorities.

Finally: I have not been involved in any way with Census work. I'm thankful to Aloni Cohen, Cynthia Dwork, Thomas Steinke, Kunal Talwar, and Yuan Yuan Zheng for helpful comments and suggestions on drafts of this post. Of course, if there are inaccuracies or errors left, that's entirely on me — please let me know if you find any!

The people using the data: scientists, people drawing electoral district boundaries, public agencies, businesses, etc. ↩
We're using the classification from the Census here. Census data also has a separate Hispanic origin field, called "ethnicity". Don't ask me to explain this, I have no idea. ↩
This hypothetical Census block is basically a tech company. ↩
To learn more about this industry, this EFF paper is a pretty solid resource. ↩
Or gender, depending on the dataset. These are not the same thing, but the Census asks about sex, so we'll assume the commercial datasets use the same notion. This inevitably introduces errors and feels a bit icky. ↩

Why not differential privacy?

2021-03-30T00:00:00+02:00

This post is part of a series on differential privacy. Check out the table of contents to see the other articles!

Some algorithms are not differentially private, but still claim to perform anonymization. Such mechanisms are common, both in the academic literature and in industry. Explanations on why they still preserve some notion of privacy vary.

They might include some ad hoc protections against entire classes of attacks.
They might aggregate the data to a point where the statistics "obviously" seem safe.
They might use some other metric for data leakage, like entropy or mutual information.

How to get an idea of how robust these proposals are? In this post, I'll suggest a somewhat provocative approach: we'll try to analyze them in the language of differential privacy. We're going to ask the following question: why isn't a given mechanism differentially private?

I'll need a straw man to get the discussion going. Meet Paille.

Paille (it's pronounced $\pi$) has an algorithm $A$. They believe that $A$ performs anonymization: it protects the data of individuals in its input. Their line of argument starts with:

It's not differentially private, but… [insert long explanation here]

Rather than focusing on the explanation itself, let's dig into why that algorithm is not DP. First, what does it mean for an algorithm to not be DP? Let's take the definition of differential privacy and negate it. If $A$ isn't DP, then there are databases $D_1$ and $D_2$ differing in only one individual, such that the ratio:

$$\frac{\mathbb{P}\left[A(D_1)=O\right]}{\mathbb{P}\left[A(D_2)=O\right]}$$

gets arbitrarily large for varying possible outputs $O$. Remember: we called this ratio the privacy loss before.

Suppose that an attacker is hesitating between $D_1$ and $D_2$: they know all the database, except the data of one single individual. Then it's possible that by looking at the output of $A$, the attacker knows for sure what the data of this person is.

That sounds… not great. In fact, that sounds exactly like what we were trying to avoid. Why is this OK? Oh, wait, I have an idea. What if this only happens very rarely?

Averaging the privacy loss across outputs

Differential privacy is a worst-case property: it must hold for every possible output $O$. So if there's the slightest chance that the ratio of probabilities is unbounded, we don't get DP. Yet, we might be able to say « unless we're extraordinarily unlucky, the DP property holds ». In fact, we've done this before, when we introduced $(\varepsilon,\delta)$-DP. That could be good enough!

We saw that $(\varepsilon,\delta)$-DP allows a small possibility of catastrophic failure: the privacy loss can sometimes be infinite. To avoid this, we can average the privacy loss across all possible outputs instead. Some variants of DP even let us choose what kind of averaging function we want to use¹.

So, Paille, is this what's happening here? Do we have differential privacy for almost all possible outputs? Or is the average privacy loss bounded by some reasonable value?

Eh… not exactly. The privacy loss can be really large even if we average it across all possible outputs.

Oh, OK. Well, let's see what else could happen. What if, instead of averaging the privacy loss across outputs, we average it across people?

Averaging the privacy loss across people

Differential privacy gives the same protection to all individuals. The guarantees on the privacy loss apply to everyone. Is that always necessary? In some cases, it might be reasonable to say that some people need more privacy protection than others. For example, folks from at-risk populations might need a smaller $\varepsilon$ than majority groups.

Another possible approach is to protect almost all individuals, without specifying which ones. To do so, we first need to model the population according to a probability distribution. Then, we say « with high probability, someone sampled from this distribution is protected ». Unlucky people might not get any protection, but these are hopefully very rare².

This is a bit like $(\varepsilon,\delta)$-DP: there is a small chance that things go wrong. We could, instead, average the privacy loss across people. Like before, it would avoid the possibility of infinite risk for some individuals. This is much less robust than the previous class of definitions, though. First, some people might never get good protection, if their data is unusual. Second, it requires us to model our population with a probability distribution. This is hard to do! And if our model is wrong, more folks might be at risk than we expected.

Still, though, it's something. Paille, does your algorithm $A$ behave this way?

Hmmm… no. It seems that the privacy loss is very large for more than a few individuals. So averaging it doesn't bring much.

Arg. Well… If you're not protecting individuals, maybe you're protecting some other property?

Changing the protected property

With DP, the attacker tries to distinguish between databases differing in one person. This means that we protect everything about any single individual. Sometimes, though, getting to this level of protection seems like an impossible task.

For example, suppose the input database is growing over time: every day, we get new data from the users of an app. We want to publish an anonymized version of this daily data every day. Each daily release might be differentially private… But the total privacy loss of a given user over time is unbounded: the same person might use the app every day for a long time.

This is better than nothing, though: we can still claim that we're protecting all contributions of each user in every single day. Capturing this idea is easy: we can redefine "neighboring datasets" to differ in the data of a single person in a single day.

We can also extend this idea to other properties that we want to protect. Maybe finding out that someone is in our database might not be that sensitive. But finding out the value of a specific field might be problematic! In this case, we can adapt the definition of DP, and have the two databases differ only in this field for a single individual.

Paille, can you capture the properties of your algorithm $A$ this way? If it's too hard to get formal privacy guarantees for individuals, can you do it for smaller "units of privacy"?

Erm… it doesn't look like it. Even when the "unit of privacy" is smaller, the privacy loss is still too high to be meaningful.

Well, this doesn't look great. But let's persevere and try one last thing. What if we assume the attacker is uncertain about the initial dataset?

Assuming a weaker attacker

When using DP, we compare the output of the algorithm on two databases that differ in a single user. Implicitly, we assume that the attacker knows the data of everyone else in the database. What if we relax this assumption?

Doing this seems reasonable. After all, the only realistic way an attacker could know about everyone in a database is by having direct access to the database… And then there's not much left to protect. Some variants of DP attempt to formalize this idea. To do this, they capture the attacker's uncertainty using a probability distribution. The neighboring databases are no longer fixed: they're sampled from this distribution, conditioned on the data of a specific user.

The variants of differential privacy obtained this way³ have two major problems.

First, they don't compose. Say two algorithms are "private" if an attacker has limited background knowledge. Each output, in isolation, doesn't leak too much information. Both outputs combined, though, might not be private at all, even under the same assumption.
Second, these variants need us to model the database as a probability distribution. This distribution is supposed to capture the attacker's uncertainty… So you have to put yourself in the shoes of each possible attacker and model their knowledge somehow. This is difficult and very brittle: if you misjudge their knowledge even slightly, all privacy properties might break down.

Because of this⁴, assuming a weaker attacker can be kind of a dangerous road. Paille, does your algorithm $A$ satisfies one of these variants? It wouldn't be enough to fully convince me: I'd also need take a long look at the underlying assumptions, and at how you're using it in practice. Nonetheless, it'd be a start, and it'd be better than nothing, I guess.

Well, let me check. Modeling the attacker's uncertainty is difficult, but… doing that doesn't give me convincing results either. I can make unrealistic assumptions on my data, and then it sort of works. But if I try to model the attacker in a more realistic way, I don't get great numbers at all.

That's… unfortunate.

What's left?

Let's recap what we know so far about Paille's algorithm $A$. If we negate all the relaxations we've seen so far, what do we have left?

An attacker who looks at the output of $A$:

can retrieve very fine-grained information
about many individuals
even if the attacker is not particularly lucky
and only has limited knowledge about the data.

This is not good! But this is the direct conclusion of the discussion so far. Paille's mechanism not being DP didn't seem so bad at first, after all, DP is quite a high bar. But if we can't say anything about $A$ in the language of DP, even if we relax the definition a lot, then this is pretty damning. No need to dive deep into the original rationale for why $A$ might be safe: we just showed it isn't.

Or, rather, we are unable to show that it is. This will be the last resort of people defending their custom anonymization method: « I can't prove that it's safe, but I still argue that it is. Prove me wrong! Show me an attack that works. » Reversing the burden of proof this way is, of course, a red flag. If you're anonymizing my data, you should have to convince me that what you're doing is safe, not the other way around.

Further, experience shows that if someone does find an attack, that won't be enough to end the debate. In practice, people slap a patch or two on their algorithm, and go right back to proclaiming its safety. The history of computer security is littered with such examples: people patch systems after an attack is discovered, but shortly after, a minor change to the attack proves successful. The early days of data privacy were no different. I hope that we learn from this past, and focus future efforts on stronger notions with provable guarantees!

So, next time you encounter a non-DP algorithm… Why don't you insist that its authors explain to you why it isn't DP?

Final note

There are many more variants and extensions of DP beyond those mentioned in this post. In fact, a colleague and I wrote a whole survey paper about it! In this paper, we classify all these variants, list their properties, and provide intuitions for each. For a short overview of this work, you can check out the recording of the talk I gave about it at PETS last summer.

The next article in this branch is about the US Census Bureau's reconstruction attack. Or you can also go to the table of contents of this blog post series to pick something else to read!

This is Rényi DP, a definition often used for machine learning applications. Its additional parameter $\alpha$ determines which averaging function is used: $\alpha=1$ bounds the geometric mean of the ratio, $\alpha=2$ bounds the arithmetic mean, $\alpha=3$ bounds the quadratic mean, etc. ↩
This is random DP, though a couple of other variants do essentially the same thing. ↩
Like noiseless privacy, named this way because even algorithms that don't add any noise to data can be considered private under this variant. ↩
… and because of other complicated technical reasons I won't get into here, though I co-authored an entire paper about these problems. ↩

Converting my PhD thesis into HTML

2021-03-04T00:00:00+01:00

Finishing a PhD is a weird emotional experience. All the hard work, the joys, the pains, the pulled hairs, everything gets condensed into a scary-looking PDF and then you're just… done? What? This makes no sense whatsoever. Or rather, this makes sense on paper, but then you feel this weird sense of grief somehow. And you're not quite at the acceptance stage yet. So instead, you decide to deal with those feelings in a perfectly normal and healthy way, and you embark on a journey to compile said thesis into a series of HTML pages.

HTML, by the way, is a much better way of disseminating information than PDF. Pretty much all of recent scientific research is recorded in PDF files, for historical reasons that are largely irrelevant today. PDFs are difficult to browse, impossible to read on a phone, uncomfortable to read on a tablet, hostile to screen readers, impractical to search engines, and the list goes on. It's just a terrible format, unless you're trying to print things on paper. Printing things is a perfectly reasonable thing to do, but that's really not the main use case we should be optimizing for.

Anyway. I converted my thesis to HTML and this is my story. A story of false hopes, perseverance, pain, and futility. I hope this can be useful to other people, as a guide on how to do this for your own thesis or large & complex LaTeX documents, or as an encouragement to do something better with your time instead.

False hopes

"Convert LaTeX to HTML", I type in my search engine of choice. Ooooh, I have options! There's pandoc, lwarp, LaTeXML, TeX4ht, and probably others. This looks excellent. Converting LaTeX to HTML is clearly a problem that other people have already solved for me before. I will just have to run an existing tool, and iron out the kinks.

I download the tools in question, run them on my thesis, and look at the initial results. TeX4ht fires off a bunch of compilation errors and warnings, but it outputs something that kinda looks reasonable from a distance. All others fail completely. So I go, "OK, let's try to fix the TeX4ht problems, to get a feeling for how difficult this is". It turns out not to be too difficult to fix most common issues, LaTeX Stack Exchange answers most of my questions, so I make progress. I also notice that there is a nice-looking build system for TeX4ht called make4ht, which looks really nifty, I imagine it's going to be similar to latexmk, which I love.

So, things are going alright. I make progress. Here are some of the problems I found at first and how I fixed them.

A bunch of packages or commands don't make much sense in an HTML context: page breaks, PDF anchors, page numbers, floats, landscape layouts, margins or other types of spacing… Some of them (like floatrow) throw compilation errors, most are simply ignored. I made a pass at all the packages I used and removed the ones that were obviously irrelevant for HTML.
One special case is longtable: since a regular table can be as long as you need it to be in HTML, you also don't need it. Replacing it by a regular tabular, and ThreePartTable (from threeparttablex) by the regular threeparttable fixed the problem.
Importing an image originally stored in a PDF rendered it into a tiny unreadable vignette. Adding a "config file" with some dark magic in it did the trick.
SVGs generated with tikzpicture by were very wrong (missing text, blank graphs…). Apparently the "driver" included in htlatex is not good, but for some reason it's still in use. Including the line that calls a different driver wasn't enough, even though the file was already present in my system, I still got some bugs (text not at the right place). Importing the file directly from GitHub worked.
Some commands don't work for reasons I didn't really understand, but are easily fixable: for example, \notin works fine with pdflatex, but tex4ht complains about it. Replacing it by \not{\in} everywhere fixes it.
Each footnote is, by default, put in its own separate HTML file. It gets fixed by creating a .make4ht file that contains something like: settings_add { tex4ht_sty_par = ",fn-in" } This tells make4ht to pass additional arguments (here, fn-in) to tex4ht, which change its behavior. There are many available options.

I should probably have noticed the early warning signs. One is that the default behavior often makes zero sense: for example, this footnote problem… who would want footnotes in a separate HTML files when all the rest is in a single HTML file? Why is that a reasonable thing to do?

Also, compilation errors don't give you a clear picture of what actually goes wrong. LaTeX is bad at this in general, but TeX4ht is definitely worse. The error messages are often classical LaTeX errors like ! Extra }, or forgotten \endgroup, but that's almost never the actual error, since the same file compiles fine into PDF. So looking it up error messages online doesn't help. Instead, I fixed those early problems by bisecting the error, or by asking the internet how to do a certain thing.

Still, I'm making quick progress. I wonder things like "can I put the different sections on different HTML pages rather than having one monolithic document" and find out that all I need to do is pass an option to TeX4ht and it works. The option is unbelievably badly named: to tell it "make one page per subsection", you tell it "3", because that's three subdivision levels (chapter / section / subsection). Yes, I really mean "3". The option has no other name. You just pass a single number to the command line.

But whatever. It works. I make progress. I invest time fixing things. Surely, if I just spend a few more hours of fixing things, I'll be done. The sunk cost fallacy starts taking its hold on me. I don't notice a thing.

Perseverance

I start stumbling into some issues that are more difficult to fix. The first big one is how equations display. By default, TeX4ht converts each equation into an image, and includes the image in the HTML file. I imagine it's pretty awful for accessibility, and it's also really ugly. The images are low-quality, stand out in the middle of text, zooming in or out is a visual nightmare. After some testing, I decide that the best solution is to pass the mathml option to tex4ht, and pass the html5+mathjaxnode option to make4ht to tell it to post-process all of the pages with mathjax-node-page, which converts the MathML equations into… prettier-looking equations I think. I don't exactly understand how it works, but MathML alone is ugly, and this is pretty. Ship it. This requires me to install Node.js, which, urgh, but whatever.

I realize only afterwards that this package is deprecated, and that TeX4ht's GitHub repository recommends using the mjcli option instead. That option isn't recognized on my machine, probably because I don't have a recent enough version. What I have works, so I don't look further.

I also start cleaning up my build process. And this is where I start noticing some behaviors of these tools that are kind of really wrong and frustrating for no reason.

One example is the -d option of make4ht, which is supposed to tell it "put all output files in this specific subdirectory". This option is lying to you. The files are copied over to this directory, and only some of them. So your working directory is still cluttered with intermediary files, logs, and HTML files.
I initially thought that it would be kind of like latexmk, running the compilation commands multiple times until it gets the bibliography references right. It does not do that. You have to do it manually.
When you realize you didn't compile what you wanted to, pressing ctrl-c doesn't seem to stop the process. It does, however, make the command-line output hang. So you have to close the terminal and open a new one again.

None of these things is a huge deal-breaker. I am still making progress. I also fix a bunch of other problems that start looking more like real weird bugs than understandable annoyances.

\autoref did not work. I tried pretty hard to fix it, and finally gave up and changed all the \autorefs into regular \refs using sed.
LaTeX expressions that are perfectly fine according to pdflatex, like a_\text{b} or a^\mycommand{b} (where \mycommand is a custom command), failed to compile. This could be fixed by adding brackets: a_{\text{b}} works, as does $^{\mycommand{b}}. Alas, fixing all compilation problems isn't enough: simple expressions like e^\eps, where \eps is defined as simply an alias of \varepsilon, compile fine, but display incorrectly, so they must also be detected and changed to e.g. e^{\eps}.
But wait, this gets even worse: expressions like e^{\eps} are fine in text, but if they are put in macros, then they no longer work. Sometimes. To solve that final problem, I replaced all _ and ^ in my macros by \sb and \sp. Gross.
The itemized list of tablenotes in threeparttable environments did not correctly put line breaks between items. You have to add line breaks manually.
Speaking of tables, multirow doesn't work. A workaround is to use \newline within cells. There is probably a better option.
Having multiple \tikzpicture commands in a single figure resulted in really weird visual bugs, without a compilation error: only a single picture being shown, random text in absurd places. Putting each \tikzpicture in its own cell in a tabular environment is a quick workaround. There is probably a better option (subfigure with the right arguments maybe?).
\hat{D} looked reasonable, \hat{O} displayed like the french Ô in equations. Whyyyy. I fixed it by using \hat{{O}}. No clue why it works nor why it happened in the first place.
Regular parentheses in equations are automatically sized to the biggest thing on the same line. So if you have an equation like $f(x)=\frac{tallthing}{alsotallthing}$, the parenthesis around the $x$ are comically large. You need to replace all these by \left(x\right) to get the correct behavior.
Having a cases* environment nested inside of an align* environment failed to compile. Replacing the align* environment by \[ … \] compiles, but the line breaks within the cases* environment are ignored. I solved it by using a matrix* environment instead (with the [l] option for correct alignment), surrounded by \left\{...\right. to emulate the big left bracket.

This is where I started doing some really ugly things to get around such bugs. Using grep and sed to do large-scale changes, or doing gross things like replacing horizontal spaces by non-breaking spaces, became routine. At that point though, I was in too deep to reconsider my choices. So I kept going, even as the bugs got progressively more arcane.

Pain

The serious problems happened as I was trying to figure out how to get the table of contents working as expected. It seemed to be truncated for no reason, with very weird errors on the command-line, referencing some intermediary files. I bisected it to a % symbol in the caption of a table. You read that right: I had a correctly-escaped % in the legend of one of my figures, it compiled and displayed perfectly fine, but it broke the regular table of contents. Not the "list of figures", mind you! I didn't even have a list of figures!

Another problem was with chapter- or section-specific tables of content, which are a good thing to have when everything is separated across many HTML pages. Sadly, they sometimes had the wrong sections or subsections in them; Section 4.2 would have a few subsections from Section 4.3 in its table of contents. I tried for a while to make an minimal working example to figure out where the problem came from, the behavior didn't look very deterministic, so I gave up and simply removed these altogether.

Captions also have their share of bizarre, non-deterministic bugs. For example, using a formula like \left[a\middle|b\right] inside of a caption made compilation fail. Removing the \middle part, which does not cause any issue anywhere else, fixes it. Except that macros also sometimes fail to display the desired formula inside captions, with e.g. a subscript being ignored. But the exact same code without a macro would work fine, or the same macro outside a \caption{} would also work fine. Bizarre stuff.

Eventually, I stopped trying to fix the bugs, and simply learned to work around them, by either removing the thing entirely, or post-processing the output. This happened, uh, a number of times.

Using \intertext between lines of an align* equation, a trick to keep equations aligned even when you put a paragraph of text between them, resulted in the entire thing being ridiculously shifted to the right. I solved it by changing the \intertext into a normal paragraph.
Algorithms from the algorithm2e package display really strangely. Removing line numbers kind of helps, but it's not great, and the official advice seems to be "convert it as an image", which, ew. I only used this environment once, so I simply converted it into a listing.
The TeX4ht config file did not work as expected. Internet tells me adding lines starting with \Configure{@HEAD} is supposed to add corresponding lines in the <head> element of the generated HTML files, and you add multiple such lines to add multiple elements. There are plenty of examples online of this pattern being used. Somehow, on my machine, only the first such command was added to <head>, the others appeared in the <body> instead (which, of course, does not have the expected semantics). After a few hours trying to debug this, I trashed that whole idea and, instead, made a Python script that replaced the beginning and the end of each HTML page entirely instead.
A series of underscores got added after some of the citations at the beginning of each chapter. I added a few lines to my Python script to get rid of them without even trying to understand where that particular weirdness came from.

The CSS part of this whole build process is also broken in interesting ways. Two style files are generated: a main one that I think is part of TeX4ht, and another one added by mathjax-page-node.

The main CSS has the same commands repeated many times for no reason. It also has styles that are "obviously wrong": class="indent" ends up disabling the text indentation, while there are elements with the "noindent" class, which aren't defined anywhere on CSS, so inherit the global behavior (which is "add an indentation" on my website).
The mathjax CSS is fine, but the build process copies it over to the output directory every time a file is generated. But when the file doesn't contain any equation, that CSS is empty instead! So if that's the case for the last file generated by the build process, its empty CSS file overwrites the correct CSS file and all of a sudden, the equations looks terrible. I fixed it by manually adding the "right" CSS in a fixed place.

Futility

So, it's done. I'm pretty happy about how it looks. The entire exercise was entirely futile, of course: it's not like anyone will, y'know, actually read the damn thing. But I'm weirdly glad it exists.

Obviously though, I'm not at all impressed by the road that was needed to get to this point. It's infuriating that doing something like this was so hard. LaTeX is the main way scientific research gets written up. HTML might be the main format used by pretty much everyone on the planet to consume written content. Why is converting one to to the other not a solved problem?

Of course, we know why. Incentives in academia are irremediably broken, so we're stuck with old practices, bad formats, a lack of funds for projects that would make everyone's life better, and a structural impossibility to do much about it. My friend a3nm lays out all of these root causes much better than I possibly could, and this LaTeX-to-HTML story is a good illustration. Imagine that we lived in a world where it was trivial to make beautiful web pages out of scientific papers. Wouldn't that encourage more researchers to share their work more widely? Wouldn't that create whole new categories of readership, given that most people consume content on their phone? If HTML was the default format for research, would more people realize how ridiculous it is that paywalled research papers are still a thing in 2021?

Anyway. I'm complaining, but I still want to finish off on a positive note: the people who are actually doing the work of building and maintaining this tooling are heroes. The many bugs and annoyances I complained about should in no way be interpreted as a criticism of the authors of the software. Converting LaTeX to HTML is absurdly hard because LaTeX was never designed for such a thing, because the input language is forever stuck in the 80's, and because the complexity of the package ecosystem is out of control. The more you dive into how these converters work, the more you realize that the fact that they work at all is actually pretty impressive! Massive respect to folks like Michal Hoftich, who are creating software that solves a fundamentally difficult problem and spending massive amounts of time and energy answering people's questions. Genuinely inspiring.

I hope that some day, that kind of work can be properly funded and rewarded. I don't really know how we get there.

Additional thoughts (added in December 2023)

Time has passed since I originally wrote this blog post, and a few things happened since.

I presented my thesis and this blog post as an exhibit at a workshop called Rethinking ML Papers. I recorded a short talk about it, and a recording is now available on YouTube.
Deyan Ginev, one of the maintainers of LaTeXML, reached out to me to tell me that they've landed patches to avoid fatal errors during conversion. Versions of LaTeXML from 2022 onwards now produce a partial output when ran on the original sources of my PhD thesis. He's now involved in the ar5iv project, whose goal is to convert all papers on arXiv into HTML. Super cool progress!
Brian Dunn, the main author of lwarp, reached out to me to ask for the original sources of my PhD thesis, and understand more about what problems I originally encountered. He then fixed all the issues in lwarp until my thesis compiled without issue (on v0.897 and above). How impressive is this! My original blog post said that the folks who build & maintain conversion software are heroes, and I could not have been more right.

This made me want to understand more about the technical complexity of this kind of work, and get a better overview of the different tools that are out there (something I wish I'd done at the beginning of this project rather than at the end). I found some good discussions available online, for example here or there. It made me realize that there are some profound differences with how different tools tackle the problem.

If I had to do this again, I would probably use lwarp. I like its straightforward technical approach: it uses LaTeX itself to parse the source files and directly generate HTML. This means it can't handle unknown LaTeX packages that implement brand new things… but also that it's less likely to lead to super arcane errors, and that adding support for new packages is easier. It also tries to suggest alternative packages when it encounters an unsupported one, allowing users to solve the error at the source. I would also be cautiously optimistic about my ability to patch lwarp itself if necessary.

If you've had some experience doing big LaTeX-to-HTML conversion projects like the one described in this blog, let me know!

The magic of Gaussian noise

2020-11-15T00:00:00+01:00

This post is part of a series on differential privacy. Check out the table of contents to see the other articles!

Previously, we used Gaussian noise to explain the real meaning of $\delta$ in $(\varepsilon,\delta)$-differential privacy. One question was left unanswered: why would anyone use Gaussian noise in the first place? The guarantees it provides aren't as strong: it gives $(\varepsilon,\delta)$-DP with $\delta>0$, while Laplace noise provides pure $\varepsilon$-DP. This blog post gives an answer to this question, and describes the situations in which Gaussian noise excels.

Gaussian noise is nice

A first advantage of Gaussian noise is that the distribution itself behaves nicely. It's called the normal distribution for a reason: it has convenient properties, and is very widely used in natural and social sciences. People often use it to model random variables whose actual distribution is unknown. If you sum many independent random variables, you end up with a normal distribution. And these are just a few of the many properties of this fundamental distribution. Thus, most data analysts and scientists are already familiar with Gaussian noise. It's convenient when you release anonymized statistics: analysts don't need to learn too many new concepts to understand what you're doing to protect the data.

A second advantage is that the Gaussian distribution has nice, thin tails. The vast majority of its probability mass is focused around its mean. Take a normal distribution with mean 0 and standard deviation $\sigma$. The 68–95–99.7 rule says that a random variable sampled from this distribution will be:

in $[-\sigma,\sigma]$ with 68% probability;
in $[-2\sigma,2\sigma]$ with 95% probability;
and in $[-3\sigma,3\sigma]$ with 99.7% probability.

It even gets better as you go further away from the mean. The probability that the random variable is outside $[-k\sigma,k\sigma]$ decreases faster than $e^{-k^2/2}$. In practice, you're rarely surprised by the values that a Gaussian distribution takes. Even if you sample $1,000,000$ values, they are all probably going to be within $[-5\sigma,5\sigma]$.

Laplace, by comparison, isn't quite as nice. Its tails decrease exponentially fast, but that's still much slower than Gaussian tails. Suppose you sample $1,000,000$ values from a Laplace distribution of standard deviation $\sigma$. On average, 849 of them will be outside $[-5\sigma,5\sigma]$.

OK, so Gaussian noise is nice. But that does not change a simple fact: to get a comparable level of privacy for a single statistic, Laplace is much better. Assume that we're adding noise to a simple count, of sensitivity $1$. This graph compares the Laplace noise needed to get $\varepsilon=1$, and Gaussian noise needed to get $\varepsilon=1$ and $\delta=10^{-5}$.

Despite its weaker privacy guarantees, the Gaussian distribution is much flatter. Its standard deviation is over 3.4, while Laplace's is less than 1.3. Thus, much more noise will need to be added, and analysts care a lot about minimizing the noise. Why, then, would Gaussian noise be a good option? The answer is simple: because it gets better when you're publishing a lot of statistics.

From one to many statistics

In most of our previous examples, we assumed that each individual appeared in a single statistic. This case is common, for example when partitioning people based on demographic information. But in many applications, this assumption does not hold. Imagine, for example, that you want to answer the question: « what types of specialized physicians did people visit in the past 10 years? »

Assume we're working in the central model. We have a dataset of 〈patient ID, specialist type〉 pairs, and each record corresponds to an individual visiting a specialized physician (cardiologist, dermatologist, radiologist, etc.). We want to count the number of unique patient IDs per specialty.

Note that each patient can only influence a single count once. We count distinct patient IDs: if you visit dermatologists 10 times, you will only add 1 to the "dermatologist" count. However, a single patient might visit many types of specialized physicians. There are many kinds of specialties, and a single patient might influence all the counts. Say there are 50 of them.

How to make these counts differentially private? A first solution is to split the privacy budget across all the counts. Here, we can split our privacy budget in 50. If we want to achieve $\varepsilon=1$, we compute $\varepsilon'=1/50=0.02$, and add Laplace noise of scale $1/\varepsilon'=50$ to all the counts.

Unfortunately, this is a lot of noise.

Fortunately, this is exactly the kind of situation in which Gaussian noise shines. When a single patient can impact $k$ distinct statistics, we need to scale Laplace noise by $k$. By contrast, Gaussian noise must only be scaled by $\sqrt{k}$. Comparing the two gives a much more flattering view of the power of Gaussian noise.

OK, so that's the general idea. Now, why does that happen? How come composition doesn't seem to behave in the same way for Laplace and Gaussian? To understand this better, we'll first introduce the concept of sensitivity.

Different kinds of sensitivities

Consider our example above. For each type of specialized physician, we count the people who consulted with one. But we won't consider this histogram as 50 different counting queries. Instead, we'll consider it as a single function, with values in $\mathbb{N}^{50}$. It outputs a vector: a list of 50 coordinates, each of which corresponds to a fixed specialty. How to make such a function $f$ differentially private? We'll add noise, scaled by the function's sensitivity.

The sensitivity of a function measures how much its output can change when you add one record in the database. If the function returns a single number, we measure the absolute value of the difference: the sensitivity of $f$ is the maximum value of $\left|f\left(D_1\right)-f\left(D_2\right)\right|$. We already encountered the sensitivity before: when counting things, if each patient can change the statistic by more than $1$, we needed to scale the noise accordingly. The same happened for sums.

Here, the function $f$ returns a vector. How do we measure the distance between two vectors? We have a few options. We could use the Manhattan distance, or the Euclidean distance, or even weirder stuff. As it turns out, the distance we need to use depends on which noise function we're adding. Laplace noise is scaled by the $L^1$ sensitivity, itself based on the Manhattan distance. Here is its definition, denoting by $f_i(D)$ the $i$-th coordinate of $f(D)$:

$$ \Delta_1(f) = \max \sum_{i=1}^{50} \left|f_i\left(D_1\right)-f_i\left(D_2\right)\right| $$

where the $\max$ is taken over $D_1$ and $D_2$ differing in a single record. This is easy to understand: you just sum the sensitivities for each coordinate. For our function $f$, we have $\Delta_1(f)=50$: Laplace noise needs to be scaled by 50. You might have noticed that this is equivalent to using simple composition: the scale of Laplace noise is $\Delta_1/\varepsilon$, so dividing $\varepsilon$ by $50$ is the same as considering all coordinates together.

By contrast, Gaussian noise needs to be scaled by the $L^2$ sensitivity. This type of sensitivity is based on the Euclidean distance, and defined by:

$$ \Delta_2(f) = \max \sqrt{\sum_{i=1}^{50} {\left|f_i\left(D_1\right)-f_i\left(D_2\right)\right|}^2} $$

still taking the $\max$ over $D_1$ and $D_2$ differing in a single record. This formula might look complicated, but the Euclidean distance is a simple concept: it's the length of a straight line between two points. If you only have two dimensions, this formula might be reminiscent of the Pythagorean theorem!

The standard deviation $\sigma$ of Gaussian noise will be proportional to $\Delta_2(f)$. Let's compute this value for our function. Each patient can change a single count by at most one, and each can change all counts. Thus:

$$ \Delta_2(f) = \sqrt{\sum_{i=1}^{50} 1^2} = \sqrt{50} \approx 7.07. $$

The noise scales with the square root of the number of counts. This is key to Gaussian's superiority in such situations: the $L^2$ sensitivity grows much more slowly than the $L^1$ sensitivity. As a result, scaling the noise by the sensitivity hurts accuracy much less. Of course, using Gaussian noise gives you $(\varepsilon,\delta)$-DP, not pure DP, so there is still a tradeoff. As we saw in the previous article, this isn't a super scary $\delta$, so it's generally worth it.

This paper (Theorem 8) gives the exact formula for calibrating Gaussian noise depending on $\Delta_2(f)$, $\varepsilon$ and $\delta$. You need to pick $\sigma$ such that the following equality holds:

$$ g\left(\frac{\Delta_2(f)}{\sigma},\varepsilon\right) = \delta $$

where $g$ is a complicated function. As you can see, increasing $\Delta_2(f)$ and $\sigma$ simultaneously has no effect: $\sigma$ is proportional to $\Delta_2(f)$. There is no analytic form for $\sigma$, but since $g$ is monotonic, you use e.g. a binary search to approximate its value. If you want to know the exact formula, click here: .

OK. You need to satisfy the inequality:

$$ \Phi\left(\frac{\Delta_2(f)}{2\sigma}-\frac{\varepsilon\sigma}{\Delta_2(f)}\right) - e^\varepsilon \Phi\left(-\frac{\Delta_2(f)}{2\sigma}-\frac{\varepsilon\sigma}{\Delta_2(f)}\right) \le \delta $$

where $\Phi$ is the cumulative distribution function of a Gaussian distribution of mean $0$ and variance $1$. This is an inequality, but in practice, you want to get as close as possible to equality, to add as little noise as possible. Implementing this correctly on floating-point machines is, as usual, pretty tricky, here's an example.

This formula, called the Analytic Gaussian mechanism, is different from the more famous Gaussian mechanism introduced in this paper (Appendix A). The analytic version is tighter, and more importantly, it works for arbitrary values of $\varepsilon$. The formula in the original Gaussian mechanism, by contrast, only works for $\varepsilon<1$.

Now, why do these two noise distributions work so differently? Rather than proving this formally, here is a visual intuition. Let's look at the density function of Gaussian and Laplace noise, in two dimensions. The first is Gaussian, the second is Laplace.

As you can see, Gaussian noise has a circular shape, while Laplace noise has a square shape. How indistinguishable are two points, when noise is added to both of them? With Gaussian noise, it depends on their distance from each other in a straight line. By contrast, with Laplace, it depends on how far they are in Manhattan distance.

In conclusion, whether to use Laplace or Gaussian noise depends on two things:

whether we are OK with a non-zero $\delta$;
how many statistics a single individual can influence.

The first point is clear: if we want $\delta=0$, we can't use Gaussian noise. Let's quantify the second point. If a single person can impact at most one statistic, Laplace is better. If they can impact many, Gaussian is better. Where does the boundary lie? The following graph answers this question, comparing both mechanisms by their standard deviation. We pick $\varepsilon=1$ and $\delta=10^{-5}$, and we assume that each person can influence each statistic at most once.

For these values of $\varepsilon$ and $\delta$, Gaussian noise is better if each individual can influence 8 statistics or more. Of course, with different privacy parameters, the result might differ. But as the impact of a single individual grows, Gaussian noise will always end up being the least noisy choice.

Further uses

What if each person can influence each statistic differently? Suppose, for example, that we count the number of visits to each type of physician. A single patient can add many visits to a single count. Worse, the maximum number of visits per patient can vary across physician types. Can we still use Gaussian noise? The answer (hat tip to Mark) is yes: you can scale down each statistic so the sensitivity becomes $1$, add noise to them, then scale them back up. This makes Gaussian noise even more powerful: if you compute many statistics on the same data, you can use the Gaussian mechanism to reduce the noise magnitude, even if the statistics are completely unrelated.

Finally, Gaussian noise is heavily used in differentially private machine learning. The fundamental reason is the same. Consider methods like stochastic gradient descent, a popular algorithm in machine learning. At each iteration of this method, each data point influences many coordinates of a vector. To make it differentially private, we need to add noise to all coordinates. Thus, Gaussian noise is a good choice, for the exact same reason. Machine learning is full of more Gaussian-related goodness, but this article is long enough already.

Maybe we'll come back to it in future posts! The next article, though, tackles a much simpler problem: what do you do when you try using differential privacy on your data, but the results aren't accurate enough?

Thanks to [Frank McSherry] and [Antoine Amarilli] for their helpful comments.

The privacy loss random variable

2020-03-06T00:00:00+01:00

This post is part of a series on differential privacy. Check out the table of contents to see the other articles!

Remember the notion of « almost » differential privacy? We changed the original definition to add a new parameter, $\delta$. We said that $\delta$ was « the probability that something goes wrong ». This was a bit of a shortcut: this nice and easy intuition is sometimes not exactly accurate. In this post, I'll do two things. I'll introduce a crucial concept in differential privacy: the « privacy loss random variable ». Then, I'll use it to explain what $\delta$ really means.

Friendly heads-up: this post has slightly more math than the rest of this series. But don't worry! I made it as nice and visual as I could, with graphs instead of equations. All the equations are in a proof hidden by default.

The privacy loss random variable

Recall the setting of the definition of $\varepsilon$-DP (short for differential privacy). The attacker tries to distinguish between two databases $D_1$ and $D_2$, that differ by only one record. If a mechanism $A$ is $\varepsilon$-DP, then $A\left(D_1\right)$ and $A\left(D_2\right)$ will return output $O$ with similar probability:

$$ \mathbb{P}[A(D_1)=O] \le e^\varepsilon\cdot\mathbb{P}[A(D_2)=O]. $$

The equality also goes in the other direction, but the relation between $D_1$ and $D_2$ is symmetrical, so we only use this one inequality, to simplify.

We said before that the $\varepsilon$ in $\varepsilon$-DP was the maximal knowledge gain of the attacker. We defined this knowledge gain in Bayesian terms, where the attacker is trying to guess if the real database $D$ is $D_1$ and $D_2$. We saw that $\varepsilon$ bounds the evolution of betting odds. For each $O$, we had:

$$ \frac{\mathbb{P}\left[D=D_1\mid A(D)=O\right]}{\mathbb{P}\left[D=D_2\mid A(D)=O\right]} \le e^\varepsilon\cdot\frac{\mathbb{P}\left[D=D_1\right]}{\mathbb{P}\left[D=D_2\right]} $$

What if we don't just want to bound this quantity, but calculate it for a given output $O$? Let us define:

$$ \mathcal{L}_{D_1,D_2}(O) = \ln\frac{ \frac{\mathbb{P}\left[D=D_1\mid A(D)=O\right]}{\mathbb{P}\left[D=D_2\mid A(D)=O\right]} }{ \frac{\mathbb{P}\left[D=D_1\right]}{\mathbb{P}\left[D=D_2\right]}. } $$

This formula looks scary, but the intuition behind it is pretty simple. The denominator corresponds to the initial betting odds for $D_1$ vs. $D_2$. How likely is one option vs. the other, before looking at the result of the mechanism. In Bayesian terms, this is called the "prior". Meanwhile, the numerator of the fraction is the betting odds afterwards — the "posterior". Differential privacy guarantees that $\mathcal{L}_{D_1,D_2}(O)\le\varepsilon$ for all $O$.

Bayes' rule allows us to reformulate this quantity:

$$ \mathcal{L}_{D_1,D_2}(O) = \ln\left(\frac{\mathbb{P}\left[A(D_1)=O\right]}{\mathbb{P}\left[A(D_2)=O\right]}\right). $$

This is called the privacy loss random variable (PLRV for short). Intuitively, the PLRV is the « actual $\varepsilon$ value » for a specific output $O$. Why is it a random variable? Because typically, we consider $\mathcal{L}_{D_1,D_2}(O)$ when $O$ varies according to $A(D_1)$, which we assume is the "real" database.

OK, this is very abstract. We need a concrete example.

A concrete example

Suppose that we're counting the number of people with blue eyes in the dataset. We make this diferentially private by adding Laplace noise of scale $1/\ln(3)$, to get $\varepsilon=\ln(3)$. The attacker hesitates between two possible datasets: one with $1000$ blue-eyed people, the other with $1001$. The real number is $1000$, but the attacker doesn't know that. The two distributions look like this:

Let's consider three possible outputs of the mechanism, given the "real" database is $D_1$. We represent them below as $O_1$, $O_2$, and $O_3$.

Say the attacker is very uncertain: initially, they give equal probabilities to $D_1$ and $D_2$. What are they going to think once we give them the output of the mechanism?

If we return $O_1$, the attacker is starting to suspect that the real database is $D_1$. There's a larger chance to get that output if $D=D_1$ than if $D=D_2$. How much larger? Exactly 3 times larger: the attacker's knowledge is tripled.
If we return $O_2$, the attacker is like: ¯\_(ツ)_/¯. This is not giving them much information. This output could have come from $D_1$, but it could just as well have come from $D_2$. The attacker's knowledge doesn't change.
If we return $O_3$, the attacker is getting tricked with wrong information. They will think it's more likely that the real database is $D_2$. Their "knowledge" is divided by 3.

Let's look at all possible events $O=A(D_1)$, and order them. We'll put the ones that help the attacker most first, and look at the value of $\mathcal{L}_{D_1,D_2}(O)$. Let's call this $\mathcal{L}$, for short, and plot it.

This is why Laplace noise is so nice: look at this neat horizontal line. Oh my god. It even has a straight diagonal. It never goes above $\varepsilon\approx1.1$: a beautiful visual proof that Laplace noise gives $\varepsilon$-DP.

Let's change the graph above to more accurately represent that $\mathcal{L}$ is a random variable. On the $x$-axis, we represent all events according to their probability. We're also more interested in $\exp(\mathcal{L})$, so let's plot that instead of $\mathcal{L}$.

Now, what if you were using some other type of noise? Say, from a normal distribution? It would make data analysts happier: Laplace noise is weird to them, it never shows up in the real world. Normal distributions, by contrast, are familiar and friendly. A lot of natural data distributions can be modeled with them.

In the context of differential privacy, the normal distribution is called « Gaussian noise ». Let's try to add Gaussian noise, of variance $\sigma^2=2$:

OK, looks reasonable, now let's see what $e^\mathcal{L}$ looks like:

Ew. Look at this line going up to infinity on the left side. Gross. We can't just draw a line at $e^\varepsilon$ and say "everything is underneath". What do we do, then? We cheat, and use a $\delta$.

$\delta$ and the PLRV

In a previous article, we said that the $\delta$ in $(\varepsilon,\delta)$-DP is the probability that something terrible happens. What does that mean in the context of Gaussian noise? First, we pick an arbitrary $\varepsilon$, say, $\varepsilon=\ln(3)$. Then, we look at how likely it for $e^\mathcal{L}$ to be above the $e^\varepsilon=3$ line. It's easy to do: the $x$-axis is the probability space, so we can simply measure the width of the bad events.

This simple intuition is correct: this mechanism is $(\ln(3),\delta_1)$-DP, with $\delta_1\approx0.054$. But it misses an important subtlety. Let's zoom in on the part where things go wrong, and consider two possible outputs.

Returning $O_1$ is not great: $e^\mathcal{L}>e^\varepsilon$. But it's not terrible: the privacy loss is only a tiny bit larger than we'd hope. Returning $O_2$, however, is scary news: $e^\mathcal{L}$ is huge. Intuitively, $O_2$ leaks much more information than $O_1$.

With our way of quantifying $\delta$, we don't account for this. We only measure the $x$-axis. What we count is whether $e^\mathcal{L}$ is above the line, not how much it's above the line. For each bad event of probability $p$, we're adding $p\times1$ to the $\delta$. A finer approach is to weigh the bad events by "how bad they are". We want to give a "weight" of $\approx1$ to the very bad events, and a weight of $\approx0$ to the "not too bad" ones.

To do this, we transform a bit the curve above by doing two things. First, we take the inverse of the curve: very bad events are now close to $0$ instead of very large. Second, we normalize the curve by taking the ratio $e^\varepsilon/e^\mathcal{L}$. This way, events that are "not too bad" are close to $1$.

This allows us to consider the area between the curve and the $y=1$ line. When $\mathcal{L}$ is very large, the inverse is close to $0$, so the distance to $1$ is almost 1. And when $\mathcal{L}$ is close to $\varepsilon$, the ratio is one, and the distance is almost 0. Very bad events count more than sort of bad events.

This is the tighter, exact characterization of $\delta$. In $(\varepsilon,\delta)$-DP, the $\delta$ is the area highlighted above. It is the mass of all possible bad events, weighted by how likely they are and how bad they are. This tells us that the mechanism is $(\ln(3),\delta_2)$-DP with $\delta_2\approx0.011$, a much better characterization than before.

The typical definition of $(\varepsilon,\delta)$-DP doesn't use this complicated formulation. A mechanism $A$ is $(\varepsilon,\delta)$-DP if for any neighboring $D_1$ and $D_2$, and any set $S$ of possible outputs:

$$ \mathbb{P}[A(D_1)\in S] \le e^\varepsilon\cdot\mathbb{P}[A(D_2)\in S]+\delta. $$

This definition is equivalent to the previous characterization. If you want to see the proof of that, click here:

What about infinity values?

Using Gaussian noise, all possible values of $\mathcal{L}$ are finite. But for some mechanisms $A$, there are outputs $O$ such that $\mathbb{P}[A(D_1)=O]>0$, but $\mathbb{P}[A(D_2)=O]=0$. In that case, $\mathcal{L}(O)=\infty$. This kind of output is called a distinguishing event. If we return a distinguishing event, the attacker immediately finds out that $D$ is $D_1$ and not $D_2$. This is the case for the "thresholding" example we looked at previously.

Our interpretation of $\delta$ captures this nicely. Since we inverted the curve, if $\mathcal{L}=\infty$, we simply have $e^\varepsilon/e^\mathcal{L}=0$. The distance to $1$ is exactly $1$, so we count these events with maximal weight. The graph looks like this:

In that case, $\delta_1=\delta_2$: all "bad" events are worst-case events. For such a mechanism, the two characterizations of $\delta$ are the same.

Final note

You might be wondering: why use Gaussian noise at all if it requires $\delta>0$?

This is an excellent question. I'm glad you asked it, because it is exactly the topic of the next blog post in this series. Or you can, as always, select another article to read next in the table of contents!

Thanks to Sebastian Meiser, who wrote the reference paper about the subtleties with $\delta$. It makes for excellent reading if you want to dig a bit deeper into this. Thanks also to Antoine Amarilli for proofreading this blog post, and to Ivan Habernal for detecting a mistake in an earlier version.

A reading list on differential privacy

2019-09-20T00:00:00+02:00

This post is part of a series on differential privacy. Check out the table of contents to see the other articles!

Someone recently asked me for reading suggestions to learn more about differential privacy (DP). I thought that the answer was worth posting somewhere. So, here it is: a list of material that I often recommend to people who want to dive into the field.

Before I start, two remarks. First, most of these are going to be research papers. There's a lack of accessible material on differential privacy. I hope it gets better over time; I'll keep this list updated as I discover more good stuff. Second, this list is going to be very subjective. I'm more interested in some problems than others. I like beautiful theory and very practical applications. I'm less enthusiastic about the research in-between. This list might not be the right one for you. If you're a student starting a research project, ask your advisor what you should read! They'll be a better source of reading material than a random guy on the Internet.

Introductory and reference material

There are less links that I'd like in the category « suitable for a non-technical audience ». That's exactly why I'm writing these blog posts… Still, some people have created great material.

I recently stumbled upon a fantastic 12-minute video called Protecting Privacy with MATH. It's done by MinutePhysics, and it's friendly, understandable, yet very accurate. I love that it explains the basics of reconstruction attacks! It provides a great motivation for differential privacy.

For more in-depth material, I recommend the Primer for a Non-Technical Audience. It's written by folks at the Harvard Privacy Tools Project, and is perfect if you don't want heavy math. I prefer it to the typical recommendation for a reference book on DP, surnamed the « Privacy Book ». Its real name is The Algorithmic Foundations of Differential Privacy. One of the authors is Cynthia Dwork, one of the creators of DP, so it's a good way of discovering the original motivations… But I would mostly recommend Sections 1 to 3, not the whole thing.

Ah, on the topic of things written by the creators of DP! Frank McSherry's blog makes for a fun read (look for the articles tagged « Privacy »).

Local and shuffled model

Excited by my [blog post][local] about the different models of DP? When it comes to the privacy model, the RAPPOR paper is a classic. It explains how Google Chrome collected useful statistics without learning information about individuals. The follow-up paper, Prochlo, introduced the shuffled model in a very practical context. To understand its theoretical guarantees, I recommend the later Privacy Blanket paper, it provides a very nice explanation of the intuition behind it¹.

Rolling out differential privacy

Moving on to applied views of central DP for statistics. Google colleagues and I recently published recently a paper about a differentially private SQL engine. This isn't the first work in that direction², but I'm still pretty proud of it: we worked hard to make it easy to understand and mostly self-contained! You can see an updated version in Section 4.2 of my PhD thesis. Further in the thesis, you can also find a discussion of operational aspects of DP, a topic that doesn't get as much attention as it probably should.

Want more about rolling out DP in practice? The Issues Encountered Deploying Differential Privacy paper, by the US Census Bureau, is a must-read. I strongly relate to the problems it raises. For even more fun and absurd implementation things, go read about the floating-point attack on the Laplace mechanism, it's a classic.

Theory of differential privacy

There are natural theory questions raised by DP. For example « how many queries can we answer privately and accurately on the same database? » The Complexity of Differential Privacy is a survey of fundamental results on accuracy/privacy trade-offs. This being theory, almost everything in there is asymptotic, of course.

The theory around the privacy loss distribution is beautiful. It gives a global view of the privacy leakage of a mechanism. It's less simple, but it makes much more sense than only considering a pair of (ε,δ) parameters. To learn about it, I recommend the Privacy Loss Classes paper. It has practical consequences, too, on estimating privacy budgets over many compositions. It enables neat ways of estimating the evolution of a privacy budget over many compositions. This is explored in the Privacy Buckets paper, which is also worth a read.

Speaking of interesting reformulations of DP, the hypothesis testing formulation is also super cool. The Gaussian Differential Privacy paper does a great job at explaining it. It also gives you nice results on amplification by sampling, a shiny and useful tool in practice³.

Finally, there are interesting debates around how to interpret the guarantee provided by DP. Everyone understands the math, but how do we translate it in practice? There's a long-standing debate about data correlations and adversary strength. On that topic, the Differential Privacy as a Causal Property paper is fantastic. It clarifies this debate, links to all relevant references, and proposes a neat way to solve this issue conceptually.

Differentially private machine learning

I don't know much about machine learning, but it felt weird to not mention it at all in an article like this. So I asked my colleague Peter Kairouz to suggest some material! The following suggestions are from him. I was glad to see that there aren't only scientific papers in there!

First, start with this pretty great 2-hour video tutorial (with available slides and historical references). If you then want to actually test this in practice, try out TensorFlow Privacy! For a nice introduction to it, check out this blog post, by one of the TensorFlow Privacy authors. It explains how to run DP-SGD with it, step by step and with a lot of additional references.

The paper that introduced DP-SGD is a classic: Deep Learning with Differential Privacy. Another method for the same problem, DSSGD, appeared earlier in Privacy-Preserving Deep Learning^(PDF). Yet another alternative approach, more generic and surprisingly understandable, is called PATE. It's described in a blog post, which comes with references to the original papers.

Most of these methods rely on tight accounting on the privacy budget. A common tool for that is Rényi Differential Privacy, a DP variant that considers the averaged privacy loss. Another is Amplification by Iteration, which studies mechanisms that iterate over the data multiple times, without releasing intermediary results. Both techniques enable better composition results.

Deep learning isn't the only interesting ML application of differential privacy. Methods also exist for Ordinary Least Squares, or for Empirical Risk Minimization. This last method admits Efficient Algorithms and Tight Error Bounds. In general, Convex Optimization problems are a nice fit for DP methods.

Something missing?

If you feel this reading list is missing something, please let me know! I'd like this post to be a living resource. So I'm enthusiastic about adding new material, especially if they're more approchable than scientific papers. My contact info is in the footer of this page.

Thanks to Antoine Amarilli, Úlfar Erlingsson, Peter Kairouz, and Rob Yoder for their helpful comments and suggestions.

Amplification by shuffling was initially formalized in the eponymous paper. ↩
The first system to do generic differentially private computations was PINQ, and a major follow-up was Flex. Both papers are worth a read too! ↩
For a slightly more exhaustive and historical view of results in that area, Section 4 of the Amplification by Subsampling paper (and the corresponding references) is a solid resource. ↩

« What does a privacy engineer do, anyway? »

2019-08-13T00:00:00+02:00

This is another one of those posts where I answer a question I get regularly. What is it like being a privacy engineer? Besides a cool job title, what does the job actually entail?

Before I even start, I need to point out a few major caveats of this post.

My experience is not representative of privacy engineering as a whole. I'm sure it's different across companies, and even across people within the same company.
I have a particularly narrow view. I work on a sub-field of privacy (anonymization), and I've never led a team or hired privacy engineers.
"Privacy engineer" is not even my official job title. I still have my initial job title at Google: software engineer (part-time). Privacy engineering is definitely the job I do, though — they don't care much about job titles. I might switch at some point.
Like in all other entries from this blog, all opinions are mine and mine only. I'm only talking about my own experiences and feelings, and this post isn't vetted by my employers.

So, what do I do, on a daily basis, while working as a privacy engineer? I would split the job responsibilities in three broad categories.

Consulting with teams

A large part of my time is spent helping product teams get privacy right.

Back when I did generic privacy reviews, I checked future product launches for a wide variety of things. Will users understand what happens with the data they share? Is the data appropriately protected in transit and on-rest? Can the product be misused and allow bad people to do evil¹? Is the system collecting only what is required for the product's functionality? Will it have a harmful behavior in specific cases, or will it protect at-risk users properly²? Are deletions handled correctly?

Some of this is compliance: making sure that the product aligns with existing policies and regulations is important. This often isn't the focus point, though. Everything that can be ticked off a checklist is usually pretty straightforward. What's most complex and interesting is to identify what can go wrong in specific scenarios.

How do I do this in practice? I read through design docs, slide decks and sometimes code or demos to understand the product. I ask for additional documentation when it's lacking. And poke at the system to see where it could fail in unexpected and problematic ways. Then, I'll communicate the findings to the team, and help them correct possible issues.

Nowadays, I'm in a group focused on anonymization, so I have a narrower focus. Product teams come to us when they need to anonymize data, and we help them get it right. We make sure they understand what they need to do and how to do it. Then, we give them the green light when we end up on a solution we're comfortable with.

This process is much easier when teams consult with privacy folks as soon as possible. I love to be involved in early design discussions! It's in everyone's interest. It avoids making choices that we'll regret later, which can save a lot of engineering time. And if the team does the right thing in the first place, that makes my job much easier at review time!

This part of the job requires lots of empathy. It's necessary to relate with users, and understand what will create issues. It's also crucial to build productive and respectful relationships with product teams. Luckily for me, it's something that can be learned and improved over time. I wasn't very good at it at first!

Building things

This is why the job title has « engineer » in it!

Finding issues in products is only the first step. Can we automate some of these investigation methods? Make sure certain classes of problems don't happen in the future? Detect failures early to prevent them from causing harm? These follow-up questions can lead to impactful engineering projects.

Technical improvements to processes: this is not a phenomenon limited to privacy. When engineers run processes, they'll detect inefficiencies, and identify automation opportunities. Building tooling to assist with checklist-type things is often a good idea. Time is better spent focusing on the complex and unique aspects of consultations!
Improving infrastructure: baking privacy into your tech stack is an excellent investment. Suppose that some vetted storage system takes care of encryption and deletion correctly. You no longer need to worry about these aspects in a product that uses this system. That's a great way to save time and avoid problems! Privacy engineers are uniquely positioned to notice when this is worth doing.
Monitoring: how do you check that products continue to behave as expected over time? Some problems might arise after the privacy review. Bugs happen. Code evolves over time. Changes might appear harmless, but have unintended consequences. Catching potential failures with automated monitoring before they harm anyone is very rewarding. And again, to know what to look for and how to detect privacy issues, you need domain experts.

Sometimes, it makes sense for a privacy engineer to take on such projects on their own. This is especially true when there is a lot of specific expertise involved, which is the case for my group: a large chunk of our work is about building tools to make anonymization easier and safer. Sometimes, building a new thing yourself is not the optimal move. It might make more sense to collaborate with existing teams, or influence their roadmap.

This part of the job is like a classic software engineering job: writing design docs, getting resources, writing and reviewing code, improving documentation, writing tests… In my experience, though, there's often more communication involved than on a typical project. The problems tackled are often very horizontal: pain points shared across an entire organization, a new rule that applies across the board… In these cases, maintaining technical alignment between related efforts is crucial. Privacy engineers are in the perfect central position to help with this aspect.

Setting policy

Privacy engineers try to make their organization do the right thing. But who decides what the « right thing » is in the first place?

Turns out, it's also part of the job. Privacy engineers take high-level goals or regulations and translate them into concrete, actionable requirements. Typical privacy principles are broad and vague in nature, and have many interpretations. This isn't great: people will have some creative ideas about how to put them in application.

Being at the interface between non-technical stakeholders and engineers is tricky, but necessary: if nobody does this job, it's not going to end well. And it goes both ways! Helping policy makers understand what makes sense from a technical perspective is crucial. Otherwise, you end up with inapplicable rules, or counter-productive efforts.

In addition to that, policies don't all come from top-down regulations or principles. It's frequent to stumble upon questions for which there is no existing guidance. You often have to make judgment calls, and when you do, it's important to document these decisions: this is the only way to keep them consistent across products and over time. And doing that is, in effect, setting an unofficial policy. Depending on how generalizable it is, it might be worth turning it into official guidance.

I find policy work way more complex than consulting or engineering work. It involves long discussions with a wide variety of people: lawyers, executives, engineers, product managers… And it's critical to get right. Spending the time to write great guidance is an investment that pays back many times over. And inconsistent or inapplicable policies can have a huge damaging impact³. So it's sometimes frustrating, but also very challenging and rewarding.

In practice, what does policy work look like? Many meetings, long discussions on docs, unending email threads. Yum!

All the other stuff

I lied! There's a fourth category. It's a catch-all, for all the extra responsibilities besides the core ones.

Education: you can't make every person in your organization a privacy expert, but it doesn't hurt to try! Giving talks and doing outreach is useful for many reasons. It helps orient people in the right general direction when designing new products. It increases their awareness of potential issues, and makes them more likely to consult with experts early. It's also a great way to recruit! People you reach this way might later join your team, or become local privacy experts in their own team.
Proactive investigations: poking at existing products outside of structured consultations can be worthwhile. Especially if nobody has looked at them in a while…
User advocacy: privacy engineers try to make sure products aren't harmful to marginalized communities. As such, they have a duty to speak up when that's the case, even if it's not technically about privacy⁴.
Incident response: bad stuff happens, and you need to have a process to make sure some people are ready to put out fires! Privacy engineers can be in such incident response roles. I never had such a role, though, so I don't know what it's like.
External outreach: few privacy engineers do external outreach, even though they're in a unique position to do so. We could do better in that area, and collaborate more with academia and civil society. Luckily, most excellent folks are actively improving the situation on that front!
And a bunch of other engineering-related things. You sometimes need to analyze data to quantify and prioritize issues. You want to keep educating yourself on new developments in the privacy space. You also need to pay attention to public discourse; it's crucial to better understand users, and anticipate new regulations. You sometimes want to run user studies to align your UIs with user expectations.

I'm sure that I'm forgetting some aspects. You might be a privacy engineer and this list might not feel very familiar. If that's the case, drop me a line! I'd love to understand other perspectives on that topic. I hope you agree with me on one thing, though: being a privacy engineer is challenging, fun and rewarding!

Now, there's a related question I also get sometimes: « being a privacy engineer sounds awesome, how can I become one ? ». Unfortunately, I don't have a great answer to that one. For me, it was pretty random. I joined Google as a software engineer, switched to the privacy team because it sounded fun, and learned everything I know there. I've heard good things about a Master's degree at CMU… But apart from that, I don't know a lot of educational opportunities. I'm not sure how to hunt for privacy engineering jobs, either. I hope someone else writes a good answer to that question! Because we definitely need more privacy engineers in industry.

There are lots of sub-questions here! Can a domestic abuser use the product to spy on their spouse? Can criminals take advantage of it to run online scams? Can political actors use it to spread disinformation? ↩
Will your system deadname trans people, or unmask political activists? Are privacy-critical surfaces accessible and understandable to people with disabilities? Can the product amplify hurtful or triggering content? Needless to say, this is why you need your privacy engineering team to be diverse. ↩
Dealing with technical debt in software engineering is no easy feat… But "policy debt" is many times worse and more difficult to pay back. Significantly changing existing policies is hard. ↩
Things like ethics or ML fairness deserve specific processes and technical expertise. They're like security and privacy, complex and crucial to get right. But when these processes don't exist, they fall by default under the privacy umbrella. So existing practitioners have a duty to keep an eye on this sort of thing in the meantime. ↩

Local vs. central differential privacy

2019-06-27T00:00:00+02:00

This post is part of a series on differential privacy. Check out the table of contents to see the other articles!

When people talk about differential privacy, they don't always have the same thing in mind. People agree on the definition, but context also matters a lot. There are two "flavors" of differential privacy: the central model (or global model) and the local model. In this post, I'll first explain the difference between the two options. Then, I'll give a glimpse of an exciting new idea that combines them.

What do I mean by "context"? In abstract terms, a differentially private algorithm transforms an input into an output.

The main difference will be: who has access to the raw input?

Central differential privacy

In the central model (or global model) of differential privacy, a central aggregator has access to the real data. What is this "aggregator"? Typically, it's a service or a research organization collecting data about individuals. In this model, each user sends their data to this aggregator without noise. The aggregator takes this data, and transforms it with a differentially private mechanism.

The differentially private mechanism is only applied once, at the end of the process. The aggregator can then e.g. publish the result or share it with third parties.

This model has one big advantage: accuracy. In the central model, you usually don't need to add a lot of noise to get valuable results with a low $\varepsilon$. Remember when I explained how to release statistics or histograms using differential privacy? These examples were using this central model. It worked pretty well: only a little noise was needed to hide someone in a count.

Where's the catch, then? Well, the central aggregator needs to know the real data. In the scenarios above, we added noise to real counts. This is only possible if we know the true numbers in the first place… To enable this, each user has to trust the aggregator enough to share data with it. That might be difficult: the aggregator can be an untrusted company or government. Also, with the central model, all the data is collected in one place. It increases the risk of catastrophic failure, for example if the aggregator gets hacked and leaks all the data.

The most famous real-world example of the central model is probably the US Census. In 2020, the US Census will use differential privacy to anonymize the data before publication. This is pretty exciting! You can read more about it here.

Local differential privacy

What's the alternative, then? It's the local model of differential privacy. In this model, there is still an aggregator, but they no longer have access to the real data. Instead, each user applies a differentially private mechanism to their own data. And they only send their data to the aggregator once it's already anonymized.

After collecting this noisy data, the aggregator can compute some statistics, and publish them. This last step doesn't need to be differentially private: the data is anonymous to begin with. In theory, the aggregator could publish the entire dataset they collected.

The big advantage of this model is that it no longer requires trust. Since each user is protecting their own data, they're safe even if the aggregator is malicious. This makes the local model well-suited to situations where trust is difficult to get. And we already saw an example of this! Remember the survey about drug use that used randomized response to gather data. The scheme allowed subjects to answer honestly without admitting to breaking the law. This is a typical application of the local model.

Can you guess the drawback of this model? Since each user must add noise to their own data, the total noise is much larger. You typically need many more users than in the central model to get useful results. To mitigate this problem, practical applications often use high values of $\varepsilon$.

Besides randomized response, the most famous example of this model is probably RAPPOR. This clever scheme was invented to collect differentially private data in Google Chrome. Another example, a bit more recent, is the mechanism that Apple uses to collect data on the iOS keyboard.

The best of both worlds

Until very recently, there was no middle ground between the two options. The choice was binary: either accept a much larger level of noise, or collect raw data. This is starting to change, thanks to recent work on a novel type of architecture. This new design is called ESA: Encode, Shuffle, Analyze. In the two previous models, there was only two types of participants: users and aggregator. By contrast, the ESA architecture has three elements.

The encoder is a fancy name to say "user". It collects the data, encrypts it twice, and passes it to the shuffler.
The shuffler is an intermediary process. First, it removes identifiers, and groups similar pieces of data together. Then, it passes these pieces of data to the analyzer if there are enough of them.
The analyzer actually decrypts the data, and computes the statistics we're interested in.

In ESA, the data is encrypted twice, in two layers. The shuffler can only decrypt the first layer. It contains the user IDs, and something called "group ID". This group ID describes what kind of data this is, but not what is the actual value of the data. For example, group ID could be a label like "app latency for iOS", while the actual data would be the latency value.

The shuffler then groups all group IDs together and counts how many users are in each group. If there are enough users in a group, it passes them all to the analyzer. The analyzer can then decrypt the second layer of the data, and compute the output.

Having two layers allows to separate the data and its metadata. The shuffler can see IDs, but not the actual data. Meanwhile, the analyzer only sees the data in batches, and cannot know who sent what. The magic comes with tweaking what the shuffler does. You add some randomness in the process, in well-chosen places… And it ensures that the final output is differentially private.

This nice property comes from the separation between shuffler and analyzer. Of course, if the same entity runs both, that benefit disappears. This design can still be useful! In particular, catastrophic failures are much less likely. But from a trust perspective, it becomes like the central model. So there are two options:

A different organization can run the shuffler as a service. In principle, a nonprofit like the EFF could play this role.
The shuffler can run on secure hardware, using e.g. remote attestation. This way, the client can verify that the shuffler does what it's supposed to.

Both options are probably a few years away, but I find them pretty exciting. They could, in theory, bring the best of both worlds: the low noise levels of the central model, and the trustless aspect of the local model. If you want to learn more about it, you can check out the paper that introduced this design.

Interested in learning more about differential privacy? Head over to the table of contents of this series to see its other posts. Or you can directly go to the next article in the series, which is somewhat paradoxical: it explores what it means for an algorithm to not be differentially private.

Research post: Cardinality Estimators do not Preserve Privacy

2019-03-09T00:00:00+01:00

Next summer at PETS 2019, I'll present a paper I wrote with Andreas Lochbihler and my PhD advisor David Basin. This post will attempt to explain what the paper is about, and what its results mean in practice.

tl;dr

You can't both remember unique individuals and not remember them.

It's not incredibly surprising, but still cool to have a formal negative result about it :D

Cardinality estimators

The title of the paper is Cardinality Estimators do not Preserve Privacy. First, what's a cardinality estimator? It's an algorithm, and an associated data structure called a sketch. It does two things:

it can count unique items in a list (without duplicates);
and you can merge several of them to count unique items in multiple lists.

Suppose you have two huge log files, each containing one billion unique identifiers. How many unique identifiers are present in the union of both files? If you only have counts, there's no way to tell. Both files might have the same identifiers, so the total count is one billion. Or maybe they only contain different identifiers, so the total count is two billion. Or it could be in-between these two extreme options.

A cardinality estimator can answer that question. You can apply it to each log and get two sketches. Each can estimate the number of unique items in its input log. Further, you can merge the two sketches to get a new sketch… And the estimated count of this new sketch is the deduplicated version of the two counts.

The simplest example is a set that remembers all the items it saw. It will return an exact count, and merging two sketches is straightforward. Unfortunately, it doesn't scale very well: you need a lot of memory to remember every single element, and lots of time to merge large sketches. Instead, we often trade precision for performance, and use approximate algorithms. They don't return an exact answer, but are much faster and memory-efficient.

The first cardinality estimator I encountered is HyperLogLog. It's a very popular choice: very efficient, and precise enough for most use cases. The idea behind it is quite smart: if you want to learn about it, I recommend this blog post and this research paper.

As part of my privacy reviewer role at Google, a team approached me with a question. They were storing HyperLogLog sketches, and asked me: "are those sketches as sensitive as the original data?".

Privacy modeling

To answer that question, one first has to define what we mean by sensitive. We used the idea behind differential privacy, the attacker's information gain. Intuitively, if an attacker has some suspicion that a user was added to a sketch, and then looks at the sketch… How much can their suspicion increase, or decrease?

However, we added two twists to this idea.

First, we only care if the level of suspicion increases, not if it decreases. If the attacker learns that their target is in a given sketch, it's a problem. If they learn that their target is not in a given sketch, then we don't consider it that big of a deal.
Second, we don't assume that the attacker knows all users except their target. Instead, we assume that the attacker knows nothing about the sketch. From their perspective, the data is 100% random.

These choices only made sense for the original use case (the Google privacy review). For the sketches I was considering, learning that somebody was not in a sketch wouldn't have been very problematic. There was also no reason to assume the attacker would have any background knowledge. These assumptions are much weaker than in most data privacy papers and use cases.

If this research had led to a positive result, it wouldn't have been very convincing. "Here is a cardinality estimator that satisfies this super weak notion of privacy!" People would have pushed back, saying that the assumptions were too weak.

Luckily (for me), the result was otherwise. First, I found that HyperLogLog was not private according to this definition. That was the easy part, and it led to a natural follow-up question: can we make it private? Or more generally, can we build a private cardinality estimator? We want it to have the same nice properties as HyperLogLog, but a better privacy.

Main result

It turns out that the answer is negative. Even with our weak privacy notion, the problem is unsolvable. No cardinality estimator can be both private and accurate. A private cardinality estimator has its accuracy gets exponentially worse with its size.

Since the result is negative, the privacy definition's weakness makes our result stronger. Accuracy is incompatible with a weak notion of privacy… So it's also incompatible with stronger notions. We also considered even weaker variants, e.g. allowing for a small probability of error. It didn't change the negative result. There seems to be a fundamental incompatibility between privacy and accuracy.

There is one caveat: this is only true if you want to be able to merge an arbitrary number of sketches. If the accuracy can get worse as you merge sketches, the result does not hold. In such a context, privacy-preserving schemes might exist. So, if your use case only requires you to merge a bounded number of sketches, you might have options. But if you want analysts to be able to do arbitrary aggregations of sketches and still get reasonably good results… then privacy is an impossible goal.

With this added caveat, our result becomes more intuitive. To merge two sketches that count unique users, you have to deduplicate users… So you have to keep the information about which users are in the sketch. As with HyperLogLog, this information doesn't have to be exact. But the more you remember, the more an attacker can use it to break the privacy property. HyperLogLog remembers some users more than others, and that's what allows it to stay accurate. If you can't remember any user well, then your cardinality estimator gets very inaccurate.

So there are two contributions: a theoretical one and a practical one.

The theoretical part is a confirmation and formalization of an expected phenomenon. It's still interesting, because it's quite rare. There aren't many negative results in the world of differential privacy. A typical privacy paper takes a problem and solves it in a differentially private way. Here, we're presenting a problem for which this is impossible. This leads to an open question, which we ask at the end of the paper: what's a minimal set of constraints that make differential privacy impossible?

There is also one practical consequence: cardinality estimators in use today are not private. Their sketches should be considered roughly as sensitive as raw data. We proved it manually for HyperLogLog… But our result is generic, so it holds for all cardinality estimators.

Behind the scenes

The story we tell in the paper isn't exactly the path we actually followed to get our results. For example, the attacker's lack of background knowledge came from a practical constraint. In the original problem, cardinality estimators were built using large-scale tools like MapReduce. Such tools assume that the aggregation primitives are deterministic: for example, MapReduce double-checks the computation results for fault tolerance. HyperLogLog is deterministic: a sketch formed with a given input is always the same. All other cardinality estimators we found were also deterministic. So at first, we required that any solution to our question should also be deterministic.

But it's impossible to get differential privacy without adding noise to the data. We assumed that the attacker lacked knowledge about the data to get around this problem. If the data itself is random, it can play the same role as noise from the attacker's perspective.

The first negative result we got had the assumption that cardinality estimators were deterministic. I was unhappy about it, and wanted the paper to have a more generic result. For a good chunk of 2017, I tried to extend the result to arbitrary cardinality estimators. We gave up and tried to submit the paper in autumn, but got rejected for exactly this reason. This gave me a motivation boost: reviewers, too, thought this was important. The next submission had the generalized result. A good example of the peer review process working well! ^^

August 2019 edit: this paper and the presentation I gave to PETS won a Best Student Paper award! After two rejections from other conferences, and the fact that it only got accepted to PETS thanks to a consistency experiment (one group of reviewers accepted it, not the other), it was unexpected and nice. My talk was recorded, you can watch it here.

Also, the "a-ha!" moment to get the generic result didn't happen when I expected. It didn't click when I was spending hours working on it in the lab. Instead, it struck in the shower, after spending a week of vacation without an Internet connection. I strongly suspect this isn't a coincidence… Logging off must have helped my brain be more relaxed and creative or something.

What comes next

I'll try to write a similar blog post for each research paper I publish¹. If you're a researcher, I'd encourage you to do the same. The time it takes to publish a write-up like this is negligible compared to doing the research and writing the paper… And many more people will be able to read it!

Emphasis on "try". When the paper is purely the result of research done at Google, like this one, there might be complications. ↩

Almost differential privacy

2019-02-20T00:00:00+01:00

This post is part of a series on differential privacy. Check out the table of contents to see the other articles!

Let's continue where we left off. In the last article, we saw how to publish histograms in a privacy-preserving way. Adding noise to each count was enough to get ε-differential privacy. But we finished with a puzzling statement: I mentioned that if you don't know the categories in advance, the technique no longer works. In fact, the problem gets much trickier. We'll even need to introduce a variant of the original definition! Let's dive in.

Open-ended survey question

Let's say you're doing a survey where you asked people what's their favorite color. Instead of giving them a list of fixed options, you let them write whatever text they want. Lots of answers are going to be common colors: blue, green, pink… But real-world data is noisy, and you're surely going to get unpredictable answers. Some might be junk answers: people misunderstanding the question, or trolling the survey. Other might simply be rare colors. You want to publish a histogram of answers.

Let's use the same technique as before. What happens if we add Laplace noise of scale $1/\varepsilon$ to each category? We need to compare the output of this process for two databases that differ on a single element. There are two possibilities.

The two databases have the same categories

If you're lucky, the two databases have the same categories. For example:

In one, you got 10 green answers, 5 red, and 2 yellow.
In the other, you got 10 green answers, 5 red, and 3 yellow.

Then, adding noise to each category works fine. The only difference is in the yellow category. By adding noise, we hide the difference between the two values, exactly like before.

The two databases don't have the same categories

This is where it gets trickier. For example:

In one, you got 10 green answers, 5 red, 2 yellow.
In the other, you got 10 green answers, 5 red, 2 yellow, and one ultramarine.

Let's see what happens if you add noise to both. Each column will end up with a slightly different number than the real one. But there's something glaringly obvious: the categories are different!

No need to squint at the numbers to notice the difference between these two histograms! It's easy for an attacker to tell apart outputs with different categories. We call this a distinguishing event: the attacker can learn with 100% certainty which database is the right one. Thus, the process is not differentially private. How to fix this?

Maybe we could list all possible categories, and add noise to each of them, including zero counts. Unfortunately, it's not as simple as that. In our example, people can enter anything: there's an infinite number of possibilities. The good news is that at the cost of a slight relaxation in our privacy guarantee, we can overcome that problem.

A solution: thresholding

It's fairly difficult to make sure distinguishing events never happen¹. Instead, we can settle for the next best thing: we prevent them from happening most of the time. One way to do that is thresholding. Not only do we add noise to each category, but we also remove all categories with low counts. Let's say that our threshold is 5. In the example above, we would end up releasing only two categories:

There's a price to that strategy: we're losing rare categories. In this example, we didn't only drop the ultramarine category, but yellow as well. Any category whose count is close to 5 (or less) has a significant chance of being lost. Often, that's not a big problem: rare answers have a larger chance of being meaningless.

That solution isn't perfect from a privacy perspective. For example, what if the noise added to the ultramarine category is larger than 4? Then the total count is 5 or more, we end up publishing this category, and it breaks differential privacy. Fortunately, this doesn't happen too often: only 0.6% of the time with Laplace noise of parameter $1/\ln(3)$.

Tying it all together: $(\varepsilon,\delta)$-differential privacy

Our strategy is a little more complicated than before. We now have two parameters.

The amount of noise we're adding. Just like before, if we're aiming for $\varepsilon$-differential privacy most of the time, we need to add Laplace noise of scale $1/\varepsilon$.
The threshold we're using to drop rare categories, after adding noise. It induces a natural trade-off. The bigger the threshold, you more data you lose… But the bigger the threshold, the more you reduce the odds of having a distinguishing event.

Let's visualize this. For each threshold, what are the odds that by adding noise to a category with count 1, you end up above the threshold? The following graph assumes Laplace noise of parameter $1/\ln(3)$.

Using a logarithmic scale, the graph is a straight line. That makes sense: Laplace noise is a double exponential distribution.

Now, the choice of threshold is specific to the algorithm. For a different algorithm, or a different noise function, the same threshold might have a different effect. So it's not a good idea to use it directly to quantify privacy. Instead, we use the odds of a distinguishing event as an additional parameter to our modified definition.

Formal definition

From $\varepsilon$-differential privacy, we get $(\varepsilon,\delta)$-differential privacy. This new definition is stricly weaker than the original definition, and has a similar formulation. For all databases $D_1$ and $D_2$ which differ in only one individual, and all sets $S$ of outputs:

$$ \mathbb{P}[A(D_1)\in S] \le e^\varepsilon\cdot\mathbb{P}[A(D_2)\in S]+\delta. $$

The meaning of $\varepsilon$ is the same as before. The only new element is the $\delta$. It captures the odds that something goes wrong². In our example above, $\delta\approx0.006=0.6\%$. By using $(\varepsilon,\delta)$-differential privacy, we're saying that the algorithm is almost $\varepsilon$-differentially private. And here, almost means with probability $1-\delta$: the closer $\delta$ is to 0, the better.

Criticisms of the definition

As I said, you can see $\delta$ as the probability that something goes terribly wrong. For a privacy definition, this seems like a bad thing to have. Consider the following algorithm, which takes a database as input. With probability $1-\delta$, it returns 42. With probability $\delta$, it returns the entire database. Talk about a data leak! Still, this algorithm is $(0,\delta)$-differentially private.

This example illustrates that this $\delta$ parameter allows for catastrophic failures³. Knowing this, you have two options.

Either you work harder to predefine categories, or use more advanced techniques, and only use "true" differential privacy.
Either you accept that bad things can happen, and try to limit the risk by mandating a tiny $\delta$.

I'd argue that the second solution is not a bad choice. The probability of getting hit by lightning in your lifetime is on the order of $10^{-4}$. The probability of a given bit in your RAM being randomly flipped by a cosmic ray in one year is about $10^{-6}$. In many situations, it's reasonable to consider these a negligible risk.

My perspective is that everything in data protection is about risk mitigation. You'll never reduce the risk to 0. Even if you use "true" differential privacy, your implementation might have critical bugs. Or you might get hacked, and your entire anonymization strategy might become irrelevant. Or someone might drug you and hit you with a $5 wrench until you give them your database. What are the odds of this happening? If your $\delta$ is even smaller, it might be an acceptable price to pay for more convenience.

How to choose $\delta$?

Considering the catastrophic scenarios above, maybe our $\delta$ of 0.6% is a bit too large to use everywhere. But what's a good number? A common option is to pick a $\delta$ that is significantly smaller than $1/n$, where $n$ is the total number of people in the database. The reasoning goes as follows. Each person has, in the worst case, a $\delta$ chance that their data leaks. So the total odds that someone's data leaks is $\approx n\delta$: we need to make sure that this number is small enough⁴.

Luckily, in the problem above, you don't need huge thresholds to get tiny values of $\delta$. If you have a million users, and you want $n\delta<0.1$, a threshold of 15 is enough.

Cool properties

$(\varepsilon,\delta)$-differential privacy has the same convenient properties as differential privacy.

Composition: suppose you have two $(\varepsilon,\delta)$-differentially private mechanisms. Then, publishing the result of both satisfies $(2\varepsilon,2\delta)$-differential privacy.
Post-processing: suppose you have a $(\varepsilon,\delta)$-differentially private mechanism. Then if you make its output go through a fixed transformation, you still get $(\varepsilon,\delta)$-differential privacy.

That means that most of what we learned in the simpler case of predefined categories still applies. You can round noisy values to integers without risk. If the same person can be in multiple buckets, you can adapt the values of $\varepsilon$ and $\delta$. You can also compute sums, although you should be careful in how you adapt the threshold when doing so.

Future steps

With that, we covered the most frequent and easy use cases for differential privacy. Next, we'll take a closer look at this new definition, $(\varepsilon,\delta)$-differential privacy. We'll see that meaning of $\delta$ is actually a little more subtle than explained in this post. And this is the perfect excuse to introduce an important concept: the privacy loss random variable. Or you can also head over to the table of contents of this blog post series to decide what to read next!

Thanks to Frank McSherry and Antoine Amarilli for their helpful comments.

But, as it turns out, not completely impossible. The "Improved Adaptative Histogram" method described in this paper does exactly that, even if the space of possible categories is infinite. I don't know how it compares to the approach described in this post in terms of data loss & truthfulness. It'd be interesting to figure out! ↩
This intuition is technically incorrect, but it's a good first approximation. Most people can understand the idea of "a small chance that something goes wrong". The real interpretation is more complex, as explained in the next article in the series. ↩
For more fun examples, check out this blog post. Its author is one of the original creators of differential privacy. I recommend checking his other posts! ↩
Note that this assumes independence between all the possible data leakage events. This is wrong in general, but it's a good enough approximation in practice. ↩

Personal open access policy

2018-12-28T00:00:00+01:00

I do not provide free work for closed-access scientific venues.

By "free work", I mean peer review or organizational work, like participating in program committees. By "closed-access", I mean "which retains exclusive rights on papers, and in particular prevents authors from publishing them under a license like CC BY-NC-SA".

Now, a few answers to frequently asked questions.

Why?

My time, like yours, is precious. If I work on something without being paid for it, it's because I believe that it is somehow beneficial to the world. Scientific venues that do not allow authors to publish their work for free, and forbid others to reuse and republish it, do not meet this requirement. Scientific knowledge is meant to be universally accessible, well-organized and useful. Barriers, especially financial ones, impede scholarly work and discriminate against underfunded researchers and institutions.

Some closed-access venues definitely do some valuable things, e.g., bringing scientific minds together and providing organizational work. But scientific papers are the results of work done entirely by researchers. There is absolutely no reason why they should be owned by a third party. If closed-access venues shut down all at once tomorrow; the scientific community would quickly adapt, and the change would overall be extremely positive.

Scientific venues are gradually moving towards open access, but this is taking place very slowly. I wish to accelerate the process, and I think it is past the point where it makes any sense to encourage the old model.

What about submitting papers?

I try not to publish in closed-access venues. However, I do not make this a matter of absolute personal policy, like for reviewing, for two reasons.

First, I cannot take that decision entirely on my own. Scientific papers are typically collaborations. My coauthors should have a say in where we publish. This is particularly important since the stakes can be much higher for them than for me. I do not want to pursue a purely academic career, so publishing in a less prestigious venue does not cost me much. My coauthors might feel very differently, and I respect that.

Second, publishing a paper in a prestigious venue is not exactly free work. I'm personally getting a lot out of the deal: recognition, personal gratification, visibility of my work… This is very much in contrast to peer reviewing or organizational work, which researchers are doing because they feel a moral obligation to do so.

What about the reciprocity rule?

Researchers do free work because they know how important it is for the scientific community. The peer review process is crucial to publishing good science, and the organizing work behind conferences and journals is extremely valuable to researchers. So, if you publish scientific papers that are reviewed by your peers, you should also review papers.

I understand the importance of this rule, but I choose to do my share of peer reviewing in open access venues. This way, I not only contribute to my field of research, but I also encourage a better, more equitable and accessible publishing system.

Many people interpret the reciprocity rule differently, and think that researchers should review where they publish. I respectfully disagree, but I acknowledge that my position might seem ethically problematic if you feel this way. I think that the ethical problems of closed-access scientific publishing far outweigh these concerns.

What about gray areas, like…

… postprints?

The only difference between the postprint and the final published version is typically minor editing work, if any. I consider this distinction negligible. Thus, it's fine if the postprint can be published in open access, even if the publisher's version cannot.

… preprints?

The preprint version of a paper is essentially a draft. Once a paper has been improved after incorporating the comments of peer reviewers, all preprints published online should be replaced by the improved version.

Allowing preprints but not postprints to be published in open access encourages people to read the preprint, when a better version exists. In the worst case, the draft version contains factual errors that were fixed during peer review. Not replacing the preprint by the postprint is counterproductive, and completely nonsensical. The people providing reviewing work did so for free, and the result of their work should not be owned by a third party. If only preprints can be published in open access, then I do not consider the venue as open access.

… papers accessible online, but not under a permissive license?

Some publishers will tell you that they will make your paper freely accessible online through some portal, but will still forbid you to publish it elsewhere under a permissive license. These promises are worthless — if publishers lie or impose ridiculous conditions before giving people access to your papers, what are you going to do? Sue them? Anyway, the fundamental problem is that publishers should not own rights over scientific papers. They should not be able to hinder the dissemination of knowledge they did not create. I do not consider venues of these publishers to be open access.

… author fees for open access?

Some conferences allow authors to pay an additional fee for their work to be published in open access. Some fees are reasonable, on the order of 100€ or less, but when big publishers use this model, these fees are generally much higher: ≈$1000 for ACM, ≈$2000 for IEEE, and up to an eye-popping ≈$6000 for Elsevier. Needless to say, such high fees are absolutely ridiculous compared to the actual costs to the publisher.

Small fees might be acceptable, especially if they are mandatory and are only there to recoup publisher costs. Larger fees, however, creates a system that is unfair to researchers that cannot pay, and actively encourages them not to make their research available in open access. It also takes financial resources out of valuable research projects, and gives it to publishers instead, which are providing a comparatively negligible value. If this is the only way that a venue allows authors to publish their own papers, I do not consider it open access.

… embargo periods?

This refers to a model where authors have to wait a certain time before publishing their own papers on open repositories. Before this delay, the paper is only available behind paywalls or subscriptions. Again, it is unfair to students and researchers that are not well-funded, and is an artificial and arbitrary barrier to the dissemination of scientific knowledge. To me, venues which operate under this model are not open access.

… nonprofit organizations?

Some organizations earn money via copyrighted academic papers, and are doing good scientific things with this money. I understand how it came to be, but I still think that blocking access to scientific knowledge is not an acceptable way to fund such programs. We, as a society, should absolutely fund scholarly institutions and education programs. But we should do so via other means.

I agree that there are significant ethical differences between for-profit scientific publishers and nonprofit scholarly institutions. But a closed-access venue will not get my support simply because it is organized by a nonprofit.

… books, or other works that need heavy editing?

Paid editing work is either nonexistent or useless for scientific conferences and journals. But the situation is very different for books or monographs, most of which require professional editing before publication. This work should be compensated, and copyright is one way of achieving this. I do not know this space well. By default, my personal policy does not apply there.

… other types of free scientific work?

Some other types of free work are not explicitly covered by this policy.

As I already mentioned, publishing my own papers is not free work. Similarly, giving a talk at a conference is not necessarily free work either. It depends on the message, the audience, whether expenses are reimbursed…
Course materials, and other physical artifacts from education and mentoring efforts, should in principle also be freely accessible. But I could imagine that in some cases, benefits that come from these efforts might outweigh the ethical issues that I have with closed-access publishing.
There are certain types of work that I simply do not know enough to have an opinion about: reviewing grant proposals, participating in technical standards organizations, etc.

For all those examples, I decide what I do on a case-by-case basis rather than applying a blanket policy.

Why are you publishing this?

First, I want to draw a line in the sand for myself, and taking a public stance is an excellent way to do so. Second, I hope to start a discussion, and maybe get other people to consider adopting a similar policy.

If everyone stopped reviewing for closed-access venues, the entire scientific publishing model would change for the better overnight. Refusing to do free work will likely not hurt your career significantly, especially if you do not take a public position like I do.

Why aren't you going further?

My standards are relatively low: allowing authors to retain rights to their own papers should really be the absolute minimum. I think that journals and conferences should do more. For example, they should automatically publish all papers under a permissive license like CC BY, rather than simply allowing authors to do so.

I do not feel as strongly about this. For now, I'm okay with providing free work to venues which simply say "sure, feel free to put your papers under a permissive license yourself". It's good enough, and in some cases, it's the best compromise that program committees for major conferences could negotiate with publishers. I might reconsider this in the future.

Why are you so judgmental?

I hope I did not come across this way, but in case I did, I want to make the following explicit. I absolutely understand that you might not make the same choices than I do. I do not think badly of anyone choosing to do otherwise. Part of why I can afford to have this stance is because it is not costing me much. It is not my place to judge what other people should be doing, especially people who would be more personally impacted by such choices.

I do hope that more people try to accelerate the switch to open access. Tenured professors have more freedom to refuse to encourage closed-access conferences, even if these conferences are the most prestigious. People in positions of power within closed-access venues might lobby to change publishing policies. I think that they should, but I'm not in their shoes. To them, other things might seem more important or urgent to work on. It is about values, and it is not my place to tell you what your values should be.

Is this type of personal stance really the best way towards open access?

No. The best way is through funding agencies. They should mandate that all papers published using their grants must be published in open access. Laws should automatically give researchers the right to do so without fear of retaliation from publishers, no matter what the copyright agreement says. Some countries are already doing that. It is great, and should be generalized globally. These top-down mandates are the best path towards a world where 100% of scientific papers are published under an open access license. They have much more impact than personal policies like mine.

Also, if I already had some power within publishing institutions, I might try to accomplish change from the inside instead. This might be the most efficient path for someone already there. I am not in that position, though, and probably will not be in the near future. So, I am doing what I can, at my level.

Can I try to change your mind?

Sure. My contact info is at the bottom of the page. Note that I have already heard the following points, and taken them under consideration.

"You're going to make some people angry and possibly hurt your career."
"For-profit publishers are useful, here's a long list of valuable things they do."
"It is unethical to publish in a venue but refuse to peer review for it."
"Publishers generally don't bother you when you upload your papers on your personal website, so there's no pragmatic reason to care about these things."

If you want to try and convince me that I'm doing something wrong, I encourage you to bring new ideas to the discussion.

Note: none of these ideas is originally from me. I wrote this after being inspired by friends and colleagues who also chose not to support closed-access venues with free work. I'm not reproducing names for privacy reasons (not everyone is comfortable with taking a public stance), but most of the credit goes to them.

Differential privacy in practice (easy version)

2018-11-22T00:00:00+01:00

This post is part of a series on differential privacy. Check out the table of contents to see the other articles!

Previously, we saw that differential privacy was pretty awesome, and we looked at the formal definition. Now, how do we obtain it in practice? Let's start with the basics.

Counting unique users

Suppose you have a database, and you want to publish how many people in there satisfy a given condition. Say, how many have green eyes? Even if you have many people in your database, you can't just publish the true answer. Let's take a moment to understand why.

With differential privacy, we assume that the attacker knows almost all elements. They only have uncertainty about their target. Say they want to know whether their target has green eyes. If you output the real number $k$, they can compare it with the number of people with green eyes among the people they know. If it's $k-1$, then the target has green eyes. If it's $k$, then the target does not.

So, what do we do? We compute the exact answer, and we add noise. This noise will come from a probability distribution called the Laplace distribution. This distribution has a parameter, its scale, which determines how "flat" it is. It looks like this:

So, to get $\varepsilon$-differential privacy, we pick a random value according to $\text{Laplace}(1/\varepsilon)$, and we add this noise to the real value. Why does it work? Let's look at the distribution of the number we return, depending on whether the true count is $k=1000$ (blue line, the target doesn't have green eyes) or $k=1001$ (yellow line, the target has green eyes).

Let's say the real number is $k=1001$, and after adding noise, we published $1003$. Let's put ourselves in the attacker's shoes. What's the likelihood that the original number was $1001$ vs. $1000$? The hypothesis "$k=1001$" is a bit more likely: generating a noise of $2$ is more likely than a noise of $3$. How much more likely? It turns out that the ratio between these likelihoods is… $e^\varepsilon$! So the ratio of probabilities of differential privacy is satisfied.

This works no matter what the output is: the ratio will always be between $e^\varepsilon$ and $e^{-\varepsilon}$. If you want to double-check, you can either verify it on the graph, or do the math.

Counting things

OK, so counting unique users was pretty easy. Counting things must also be straightforward, right? Let's say you have a database of suggestions that people sent to your company using a feedback form. You want to publish the number of suggestions you received on a given day. Meanwhile, the attacker wants to get an idea of how many complaints their target published.

What's different about the previous scenario? Can't we just add noise picked from $\text{Laplace}(1/\varepsilon)$ and get $\varepsilon$-differential privacy? There's a catch: what if someone sent more than one complaint during one day? Let's say someone was super unhappy and sent five complaints. The other 1000 customers sent one complaint each. The influence of this one disgruntled customer will be larger than before. The two distributions now look like this:

The difference between the curves is much larger than before. Their ratio is at most $e^{5\varepsilon}$, so using a parameter of $1/\varepsilon$ only gives $5\varepsilon$-differential privacy. To fix this, we need to add more noise. How much more? It depends on the maximum contribution of one individual user. If the maximum amount of complaints in one day is 5, you must add 5 times the amount of noise. In this example, using $\text{Laplace}(5/\varepsilon)$ would give you $\varepsilon$-differential privacy.

Note that you can't fully automate this process: you need to know what the largest contribution can be. A human, with some knowledge over the process, must make a judgment call. In our case, this could be "users won't post more than five complaints per day".

What happens if that judgment call is wrong, and a user later decides to post 10 complaints in one day? To preserve the desired level of privacy, you need to clamp all values to the estimated maximum. In other words, for this outlier user, you would only count 5 complaints in the non-noisy sum.

This process can introduce unexpected bias in the data. So, be careful when estimating the largest contribution! If clamping only happens very rarely, you should be fine.

Summing or averaging numbers

Let's say each of your users gives your customer service a rating, between -10 and 10. You want to release the average rating. Computing an average is pretty much the same as computing a sum — add all ratings, then divide by the number of users¹. So, what do we do to the sum to achieve differential privacy?

Among all rating options, we only have to consider the worst possible case. How far can the two noise curves be from each other? If the values are all between -10 and 10, the greatest possible difference is $10-(-10)=20$. It happens when the attacker tries to determine whether a user voted -10 or 10.

Like in the previous example, you have to add noise of $Laplace(20/\varepsilon)$ to get $\varepsilon$-differential privacy. And just as before, you need to check that each value is between your theoretical minimum and maximum. If you find an anomalous value, e.g. lower than the minimum, you need to clamp it to the minimum before adding it to the sum.

In some cases, estimating these minimum or maximum values is difficult. For example, if you're computing the average salary in a large group of people, how should you estimate the upper salary limit? I don't see that problem as a usability flaw of differential privacy. Rather, it suggests that averages are not a meaningful metric in the presence of outliers. Removing these outliers is a good idea for both accuracy and privacy :-)

Releasing many things at once

OK. What if you don't want to release only one statistic, but many of them? Can you add noise to each of them and be fine? Well… it depends. The main question you have to ask yourself is: what is the maximum influence of one individual? There are two distinct possibilities.

Statistics are about different people

Suppose you want to release the number of users you have depending on their age ranges: 20-29, 30-39, 40-49, etc.

Each user will have an influence in at most one of these categories. Either someone is in a given age range, either they're in another one. This situation often appears when you're trying to release a histogram:

When you're in this case, you can safely add Laplace noise of scale $1/\varepsilon$ to each bucket count. There is no problematic interaction between buckets. Releasing the entire histogram is still $\varepsilon$-differentially private.

Pretty easy, right? Note that this histogram looks a bit weird: the counts are not integers, and one count ended up being negative! We can make it a bit less suspicious, by rounding all counts and replacing all negative numbers by 0.

This type of post-processing is allowed, thanks to a nifty property of differential property. If you take differentially private data, and make it go through a fixed transformation, you still get differential privacy². Convenient!

One person is in multiple statistics

OK, what if you're releasing multiple statistics, but this time, they might all be about the same user? Let's say that you want to publish how many of your users…

are younger than 35;
are using an iOS device;
are colorblind;
have started using your app less than a month ago.

The same user could be in all those categories! In this scenario, you can't add Laplace noise of scale $1/\varepsilon$ to each count and get $\varepsilon$-differential privacy. Instead, you have to consider each count as a separate data release. Thus, if you have $C$ different counts, you have to add Laplace noise of scale $C/\varepsilon$ to each of them. Each independent release will be $\varepsilon/C$-differentially private. And we can now use the composition property of differential privacy! This allows us to conclude that the entire release is $\varepsilon$-differentially private.

This works for any kind of statistics, not just unique counts. Want to release several pieces of information? Count the maximum influence of one single person, and "split" your $\varepsilon$ between each data release. This $\varepsilon$ is called your privacy budget: you choose $\varepsilon_1$, …, $\varepsilon_C$ whose sum is $\varepsilon$, and you release the $i$-th statistic with $\varepsilon_i$-differential privacy. This solution is more flexible than simply using $\varepsilon/C$-differential privacy on each statistic. If one statistic is more important, or more sensitive to noise, you can attribute more budget for it. The larger the budget portion, the lower the noise. Of course, you will have to add more noise to the other values.

Traps to avoid

The mechanisms we saw today are pretty straightforward. But anonymization is full of traps: there are still a number of things that can go wrong.

When summing or averaging numbers, clamping them to the minimum and maximum is essential. Otherwise, all guarantees fly out the window.
Pay attention how you implement this clamping in practice. Special values like NaN can lead to surprising behavior.
Of course, you're not allowed to cheat. You have to choose your privacy strategy in advance, then apply it, and release the noisy data. You can't double-check that it's accurate enough. Otherwise, you skew the randomness, and you lose the privacy guarantees.
When releasing histograms, it's important to choose each category in advance. If you have to look at your data to know what your categories are, you have to use more subtle methods.

This last point is often a problem in practice. For example, say that you want to count how many times each word was used in customer complaints. You can't make a definite word list: people could use words that you didn't predict, or make typos. In that case, the technique I described for histograms doesn't work.

Why doesn't it work? How to fix this? All this, and more, in the next article! Or you can also head over to the table of contents of this blog post series to pick what to read next.

If you want to be extra pedantic, you might also want to add noise to your total number of users. That depends on the flavor of definition that you choose. I'm not going to that level of detail here, and you probably shouldn't either. ↩
If you're wondering why this is true, here's a super short proof. If there was a post-processing function that would break the differential privacy property… The attacker could run it too, and distinguish between two outcomes. But it's impossible, because differential privacy forbids it :-) ↩

« So, how does your part-time PhD arrangement actually work? »

2018-09-01T00:00:00+02:00

Since December 2016, I work as an engineer for Google 50% of my time, and I do a PhD at the ETH in Zürich on the remaining 50%. Recently, some people have asked me a bunch of questions about this arrangement. So here's a blog post answering them, in no particular order.

Note: everything in this post is only from my own experience, and represents in no way the opinion of my employers.

So Google does part-time PhDs, hm?

Not really.

When I first thought about this project, I tried to look for Googlers who did this before me. I couldn't find anyone. Some folks studied (for Bachelor's, Master's degrees, etc.) while working at Google. Some got hired before completing their PhD, and finished it while working full-time or 80%. But I couldn't find anybody who did a PhD from start to finish while working there.

So I sort of made up my own project. After a lot of work, I got it approved, and now I'm in the middle of it. I can't detail the approval process publicly, but this was definitely neither standard nor easy. One important takeaway: I couldn't have done this as a new hire. Being a productive, full-time engineer for a couple of years gave me the credibility I needed.

In short: it is possible to negotiate a part-time PhD arrangement when you work at Google. But it depends on a lot of factors, many of which you have little to no influence over. You should not base your career plans on the assumption that you'll be able to do this at Google.

Why would you do this?

Because I was grumpy and unhappy with the other options :D

Many academics are doing research that has no real-world impact. Some of them don't care, and that's fine. But even for the others, it's difficult to figure out what problems are worth looking at. When you start doing research, you don't know what's impactful and what's useless. So, you pick a novel and interesting topic, and you hope it'll end up being useful. Then, before your realize it, you get corrupted by the many things that are wrong with academia. A few more years, and you completely forget that usefulness was ever a goal¹.

At the same time, I thought that the tech industry is in general not a great place to do research. Your job is to fix problems that your company has. Figuring out whether you're the first to solve them, generalizing the solution, publishing it for more people to use, educating yourself on vaguely related work… That's not really what they pay you for. If you're an experienced researcher working for a tech company, you might get this stuff done. You know the value of such work, and you're able to do it and sell it to higher-ups. When you don't know how to do research yet, it's not the same story.

My solution was to do both things at once. First, use my industry job to figure out which problems are worth solving. Then, solve them thoroughly, not only in the context of one team inside one company. Also, share the results with the world. And let's be honest: in the process, I also hoped to become an expert in the field =)

In short, I wanted to get all the good parts of a PhD, while avoiding the standard traps of academic research. I didn't want to stop having a constant stream of real-world problems to solve².

plus, what could possibly go wrong, right

How do you organize your time?

I usually work one week at Google, and the next week at the university.

During the first few months, I was pretty strict with my schedule. It worked okay. Nowadays, I often have research meetings during my Google time, or vice-versa. It makes things easier for everyone, and doesn't hurt my productivity too much, especially for early or late meetings. In addition, I sometimes switch entire weeks, for example when I travel abroad to do Google work for several weeks in a row.

What about context switching costs?

They're not a huge problem for me. Context switches affect different people differently. Like most people, I can't get any serious work done in a half-hour break between two meetings. But it's not that painful for me to work on a different thing than the day/week before.

In my humble opinion, the ability to work on several things in parallel is a core engineering skill. I suspect it's a crucial skill for many other jobs. You should aim to be good at it no matter what you do. Having two jobs makes it more obvious, but not fundamentally different.

[To be clear: I'm encouraging you to try and get better at this if you can. But like any other skill, this is workable for some people and not so much for others. Neurodiverse folks might have a harder time. If you're in a position of leadership, build teams that accommodate specific needs.]

What are the challenges?

Some things don't get divided by two when you start working part-time.

The obvious one is overhead. The number of Google emails didn't get halved when I started working half-time. Expense reports, performance reviews, administrative nonsense… Mandatory and pointless tasks cost you twice as much productivity.

Second, anxiety and impostor syndrome. I was prone to these when I started working at Google, but I sort of learned to live with them over time. They came back full-force when I started my PhD, and not only because of the unique challenges of research work³. I instinctively compared myself to peers a lot — I always did. But of course, working half-time, I couldn't get the same amount of work done, so I felt shitty about it.

Finally, I got (at least) twice as many interesting work-related opportunities. And since I had a tendency to work too much and accept too many responsibilities and tasks… This definitely got worse.

How do you deal with those?

For overhead, aggressive email filtering and management is part of the solution. I also learned to notice when my brain can't handle intellectual, creative, or stressful work anymore. When I catch myself procrastinating, I don't force myself to go back to the thing I was supposed to do. Instead, I do something that doesn't require as much brain, but is still productive. Answering emails, solving administrative stuff, writing blog posts… This way, I don't waste too much time, and the overhead doesn't cannibalize my real work too much.

To avoid overworking, I have some practical rules. I don't work on evenings, weekends or vacation days. I use a different laptop for my Google work (where I get most interruptions). I lock myself out of my work email outside office hours. I force myself to do sports. Those "practical" tricks help. But of course, the psychological aspect is the difficult part…

I try to remind myself that I'm not responsible for the stuff that I'm not in charge of. Most importantly, I started saying no to things. This is so difficult for me. It's not that my colleagues (or manager, advisor…) demand too much of me — they're awesome and respect boundaries. But there are always things to be done, ideas to research, stuff to try out, people to help and collaborate with. Be smarter than me: learn how to set limits for yourself, and say no to people, before starting a project like this.

On anxiety & impostor syndrome, here I some things I found helpful.

Getting feedback from peers as often as possible. That helped me a lot. I tend to have unrealistic expectations for myself. Having an external perspective prevents me from getting stuck into weird self-critical loops.
When considering my own recent work, I try to look back a couple of months rather than a couple of days/weeks. Being unproductive for a few days feels much worse when you're only doing this 50% of your time. Even though you know it's normal and happens to everyone, especially in research! Averaging over larger time periods always feels better.
I started to talk to a mental health professional, and it was one of the best decisions I took recently. I very much recommend doing so, if that's an option for you. Even if you feel okay, it's never a bad idea to monitor how well your brain is doing (and it's difficult to do on your own).

That sounds bad. Is it that bad?

It's challenging, so I would not go around recommending random people do the same. But no, it's not that bad =) I definitely don't regret starting this. Here are some of my favorite things about this arrangement.

I'm doing cool research that I know is impactful. When you're in academia, the main indicator you have for that are publications and grants. My research ends up changing policies and practices of a massive company. I'm improving the level of data protection for billions of users! It definitely makes it easier not to care too much about paper rejections ^^
I'm working my way towards becoming an expert in my field. Don't get me wrong: I'm definitely not there yet. But I'm learning every day. I'm getting a deep knowledge of some things, and a reasonably solid overview of many other things.
I became the go-to person people in my Google team to answer research-related questions. What's the state of the art for a given problem in my field, what are known impossibility results, how to publish papers and engage with the community… I can provide a unique perspective, which is rewarding and useful.
In my research lab, I also sort of became the go-to person to answer industry-related questions. I help colleagues understand the applicability of their work, and I can suggest potentially impactful research directions.
Talking about the same problems in both places leads to very different discussions. I get ideas from Google, bounce them off university colleagues, or vice-versa. The diversity of viewpoints I get this way is always fruitful and fascinating.

On a day-to-day basis, the main benefit is getting progress from both sides at once. When I use my research at Google, my productivity basically doubles. It's the same way when I can use my Google work to write papers or advance my research. And it happens quite often! More than often enough to compensate for the overhead problem I mentioned earlier.

What's the best thing about the arrangement?

I was anticipating all the good bits described above. It was very nice to see that I wasn't too optimistic, but this wasn't a big surprise. Those were why I wanted to do a PhD in the first place =)

What I didn't expect, though, was the perspective it gave me. Let me explain. The mindset in the tech industry is so different from academia. I know, it's obvious, but… this rift was much bigger than what I imagined. The two worlds have completely different values, social structures, incentives… And that translates to incredibly distinct cultures and practices.

When you get into one of these worlds, you tend to adopt its implicit culture and assumptions. The assimilation is real, even if your team is diverse and inclusive. Want to get an intuitive idea of how strong this phenomenon is? Go talk to a people who worked at the same big tech company for a while, or to tenured professors.

Being in two places at once forces you to question the assumptions and cultural baggage. I think it's pretty unique. I don't approach problems the same way as when I was working full-time (at Google, or in research internships). The way I judge my own work also changed a lot. I don't know how to describe it well, but I feel much more "free" as a result.

So I should do the same thing you did, right?

Before I answer that question with an enthusiastic "Yes! :D", I want to mention a couple of more things. You should definitely be aware of them before making this kind of decision.

Alternative options

Why are you interested in doing a part-time PhD? Think long and hard about this question. In particular, consider all other options that achieve the same goal.

Want to become an expert in a given field? There are plenty of ways of doing that. Read books or scientific papers⁴. Attend courses, in-person or online. Get a personal project to practice what you're learning. This will likely get you to the level you want to be, faster and with less headaches.

Want the qualification because it's a shiny thing to put on your resume? In computer science? Honest career advice: don't bother. Three years working in the tech industry will look better on your resume. (And you'll also make more money.) Relevant experience with personal or open-source projects can also have a big impact.

Want a career in academia? Or maybe you're not sure, and you want to it figure out? Doing a PhD full-time is the obvious option. You should also do that if you need a PhD to get to your career goals in your field. It's less risky than my part-time arrangement, and easier to organize⁵.

Want to learn how to do research? If you're working in a big tech company, there might be more research-y teams to switch to. Maybe you can get mentorship from a colleague who used to work in academia? Or collaborate with an external professor on specific projects? If you're not sure that you want to spend a large part of your time doing research, but want to try it out, this can be a solid option.

Public awareness message

Finally, I want to insist on one last thing: my experience was very much influenced by a lot of luck, privilege, and other external factors. If you attempt the same thing, you might run into hurdles that I didn't have. To illustrate that point, I made a list of things that I feel lucky about.

I'm a cis white dude, so when I was advocating for my weird new kind of PhD arrangement, people took me seriously by default. Ha! Grave mistake. Nonetheless and in all seriousness, I'm sure it helped me get what I wanted.
I had a previous experience of research (via internships during my Master's degree). Because of that, I already knew that I liked this type of work, which made the project less risky. It also gave me credibility when I was trying to find an advisor.
I don't have kids, strong family obligations, or serious health issues. The time management aspect of things would surely be much harder otherwise.
At Google, my manager was awesome and supportive of my project from day one. I couldn't have gone through the approval process without her. Many other colleagues were also kind enough to spend time and resources to help me at the time.
My current Google manager and my PhD advisor both give me a ton of freedom. To a large degree, I can choose what to work on, which allows me to make sure both my two jobs stay aligned. I've met people who weren't so lucky, and who had awful part-time PhD experiences as a result.
Both of my jobs are a 15-minute bike ride away from where I live. Things would be more difficult if I had a terrible commute.

This list is most likely incomplete.

I have other questions!

Fire away! Disclaimers: I can't answer much about Google processes if you're not working there. And I don't guarantee a quick response time. But don't hesitate to ask! My contact info is below =)

This does not accurately represent my current opinion of academia. At the time though, it was a real fear of mine. Nowadays, I have a more moderate view on the matter. Some academics manage to produce consistently impactful work. ↩
We have a lot of those at Google, and in particular in the privacy team. We're recruiting, by the way. Drop me a line if you're interested or would like more information =) ↩
I might write about these later, when I have a bit more experience, and a clearer picture of what they are exactly. ↩
Scientific papers are awfully opaque when you start reading them, but it quickly gets much easier. You'd be surprised! Also, if you don't have access to scientific papers because of paywalls, you should check out Sci-Hub. ↩
Also, don't fall for the sunk cost fallacy if it turns out not to be your thing. Quitting a PhD is okay. If you don't like it, go do things that make you happy instead. You won't be the first nor the last, and the experience will still be valuable. ↩

Differential privacy in (a bit) more detail

2018-08-16T00:00:00+02:00

This post is part of a series on differential privacy. Check out the table of contents to see the other articles!

As I mentioned in the previous article, differential privacy is pretty awesome. If I did a good job, you're now wondering what the real definition looks like. So in this post, I will go into a bit more detail into what differential privacy actually means, and why it works so well. There will be some math! But I promise I will explain all the concepts I use, and give lots of intuition.

The definition

We saw that a process satisfies differential privacy if its output is basically the same if you change the data of one individual. And by "basically the same", we meant "the probabilities are close".

Let's now translate that into a formal definition.

A process $A$ is $\varepsilon$-differentially private if for all databases $D_1$ and $D_2$ which differ in only one individual:

$$ \mathbb{P}\left[A(D_1)=O\right] \le e^\varepsilon\cdot\mathbb{P}\left[A(D_2)=O\right] $$

… and this must be true for all possible outputs $O$. Let's unpack this.

$\mathbb{P}\left[A(D_1)=O\right]$ is the probability that when you run the process $A$ on the database $D_1$, the output is $O$. This process is probabilistic: if you run it several times, it might give you different answers. A typical process might be: "count the people with blue eyes, add some random number to this count, and return this sum". Since the random number changes every time you run the process, the results will vary.

$e^\varepsilon$ is the exponential function applied to the parameter $\varepsilon>0$. If $\varepsilon$ is very close to 0, then $e^\varepsilon$ is very close to 1, so the probabilities are very similar. The bigger $\varepsilon$ is, the more the probabilities can differ.

Of course, the definition is symmetrical: you can replace $D_1$ by $D_2$ and vice-versa, and the two databases will still differ in only one individual. So we could replace it by:

$$ e^{-\varepsilon}\cdot\mathbb{P}\left[A(D_2)=O\right] \le \mathbb{P}\left[A(D_1)=O\right] \le e^\varepsilon\cdot\mathbb{P}\left[A(D_2)=O\right] $$

Thus, this formula means that the output of the process is similar if you change or remove the data of one person. The degree of similarity depends on $\varepsilon$: the smaller it is, the more similar the outputs are.

What does this similarity have to do with privacy? First, I'll explain this with an intuitive example. Then, I'll formalize this idea with a more generic interpretation.

A simple example: randomized response

Suppose you want to do a survey to know how many people are illegal drug users. If you naively go out and ask people whether they're using illegal drugs, many will lie to you. So you devise the following mechanism. The participants no longer directly answer the question "have you consumed illegal drugs in the past week?". Instead, each of them will flip a coin, without showing it to you.

On heads, the participant tells the truth (Yes or No).
On tails, they flip a second coin. If the second coin lands on heads, they answer Yes. Otherwise, they answer No.

How is this better for survey respondents? They can now answer Yes without revealing that they're doing something illegal. When someone answers Yes, you can't know their true answer for sure. They could be actually doing drugs, but they might also have answered at random.

Let's compute the probabilities of each answer for a drug user.

With probability 50%, they will say the truth and answer Yes.
With probability 50%, they will answer at random.
- They then have another 50% chance to answer Yes, so 25% chance in total.
- Similarly, in total, they have a 25% chance to answer No.

All in all, we get a 75% chance to answer Yes and a 25% chance to answer No. For someone who is not doing drugs, the probabilities are reversed: 25% chance to answer Yes and 75% to answer No. Using the notations from earlier:

$\mathbb{P}\left[A(Yes)=Yes\right] = 0.75$, $\mathbb{P}\left[A(Yes)=No\right] = 0.25$
$\mathbb{P}\left[A(No)=Yes\right] = 0.25$, $\mathbb{P}\left[A(No)=No\right] = 0.75$

Now, $0.75$ is three times bigger than $0.25$. So if we choose $\varepsilon$ such as $e^\varepsilon=3$ (that's $\varepsilon\simeq1.1$), this process is $\varepsilon$-differentially private. So this plausible deniability translates nicely in the language of differential privacy.

Of course, with a differentially private process like this one, you're getting some noise into your data. But if you have enough answers, with high probability, the noise will cancel itself out. Suppose you have 1000 answers in total: 400 of them are Yes and 600 are No. About 50% of all 1000 answers are random, so you can remove 250 answers from each count. In total, you get 150 Yes answers out of 500 non-random answers, so about 30% of Yes overall.

What if you want more privacy? Instead of having the participants say the truth with probability 50%, you can have them tell the truth 25% of the time. What if you want less noise instead, at the cost of less protection? Have them tell the truth 75% of the time. Finding out $\varepsilon$ and quantifying the noise for each option is left as an exercise for the reader =)

A generalization: quantifying the attacker's knowledge

Let's forget about the previous example and consider a more generic scenario. In line with the previous article, we will describe this scenario from the attacker's perspective. We have a mechanism $A$ which is $\varepsilon$-differentially private. We run it on some database $D$, and release the output $A(D)$ to an attacker. Then, the attacker tries to figure out whether someone (their target) is in $D$.

Under differential privacy, the attacker can't gain a lot of information about their target. And this is true even if this attacker has a lot of knowledge about the dataset. Let's take the stronger attacker we can think of: they know all the database, except their target. This attacker has to determine which database is the real one, between two options: one with their target in it (let's call it $D_{in}$), the other without ($D_{out}$)¹.

So, in the attacker's model of the world, the actual database $D$ can be either $D_{in}$ or $D_{out}$. They might have an initial suspicion that their target is in the database. This suspicion is represented by a probability, $\mathbb{P}\left[D=D_{in}\right]$. This probability can be anything between $0$ and $1$. Say, $0.9$ if the attacker's suspicion is strong, $0.01$ if they think it's very unlikely, $0.5$ if they have no idea… Similarly, their suspicion that their target is not in the dataset is also a probability, $\mathbb{P}\left[D=D_{out}\right]$. Since there are only two options, $\mathbb{P}\left[D=D_{out}\right]=1-\mathbb{P}\left[D=D_{in}\right]$.

Now, suppose the attacker sees that the mechanism returns output $O$. How much information did the attacker gain? This is captured by looking at how much their suspicion changed after seeing this output. In mathematical terms, we have to compare $\mathbb{P}\left[D=D_{in}\right]$ with the updated suspicion $\mathbb{P}\left[D=D_{in}\mid A(D)=O\right]$. This updated suspicion is the attacker's model of the world after seeing $O$.

With differential privacy, the updated probability is never too far from the initial suspicion. And we can quantify this phenomenon exactly. For example, with $\varepsilon=1.1$, here is what the upper and lower bounds look like.

The black line is what happens if the attacker didn't get their suspicion updated at all. The blue lines are the lower and upper bounds on the updated suspicion: it can be anywhere between the two. We can visualize the example mentioned in the previous article: for an initial suspicion of 50%, the updated suspicion is approximately between 25% and 75%.

How do we prove that these bounds hold? We'll need a result from probability theory, and some basic arithmetic manipulation. I reproduced the proof as simply as I could, but you still don't have to read it. If you want to, click here:

The proof is based on a theorem called Bayes' rule. Explaining the full intuition behind this theorem is a bit out of scope for this post. If you want to understand what it says and why it works, I recommend you read this guide. If you don't, just trust me: this theorem allows us to rephrase the updated suspicion in other terms.

$$ \mathbb{P}\left[D=D_{in}\mid A(D)=O\right]=\frac{\mathbb{P}\left[D=D_{in}\right]\cdot\mathbb{P}\left[A(D)=O\mid D=D_{in}\right]}{\mathbb{P}\left[A(D)=O\right]} $$

Let's interpret each of these terms. We recognize $\mathbb{P}\left[D=D_{in}\right]$, that's the initial suspicion of the attacker. $\mathbb{P}\left[A(D)=O\mid D=D_{in}\right]$ is the probability of getting output $O$ from database $D_{in}$, which we can simplify into $\mathbb{P}\left[A\left(D_{in}\right)=O\right]$. Finally, $\mathbb{P}\left[A(D)=O\right]$ is the probability that we get the output $O$, in the attacker's model of the world.

This last term is icky. We don't know its value. So let's make it disappear by considering the ratio between the two updated probabilities $\mathbb{P}\left[D=D_{in}\mid A(D)=O\right]$ and $\mathbb{P}\left[D=D_{out}\mid A(D)=O\right]$. The icky term will nicely go away:

$$ \frac{\mathbb{P}\left[D=D_{in}\mid A(D)=O\right]}{\mathbb{P}\left[D=D_{out}\mid A(D)=O\right]} = \frac{\mathbb{P}\left[D=D_{in}\right]}{\mathbb{P}\left[D=D_{out}\right]} \cdot\frac{\mathbb{P}\left[A\left(D_{in}\right)=O\right]}{\mathbb{P}\left[A\left(D_{out}\right)=O\right]} $$

Note that this isn't a meaningless math trick. This ratio of probabilities actually has a simple interpretation: it's what gamblers call betting odds. For example, on betting websites, the odds for the France v. Croatia game in the 2018 World Cup were 2:1. This means that according to bookies, the probability for France to win was twice as much as for Croatia. This corresponds to probabilities of about 67% and 33%, respectively.

Anyway, look! The two terms $\mathbb{P}\left[A\left(D_{in}\right)=O\right]$ and $\mathbb{P}\left[A\left(D_{out}\right)=O\right]$ are the ones from the differential privacy definition. So we know that their ratio is bounded:

$$ e^{-\varepsilon} \le \frac{\mathbb{P}\left[A\left(D_{in}\right)=O\right]}{\mathbb{P}\left[A\left(D_{out}\right)=O\right]} \le e^\varepsilon $$

If we plug this into the previous formula, we get a nice relation:

$$ e^{-\varepsilon}\cdot\frac{\mathbb{P}\left[D=D_{in}\right]}{\mathbb{P}\left[D=D_{out}\right]} \le \frac{\mathbb{P}\left[D=D_{in}\mid A(D)=O\right]}{\mathbb{P}\left[D=D_{out}\mid A(D)=O\right]} \le e^\varepsilon\cdot\frac{\mathbb{P}\left[D=D_{in}\right]}{\mathbb{P}\left[D=D_{out}\right]} $$

This relation is a reformulation of differential privacy. The original definition said that the probability distributions of outputs are similar. This relation says that the odds don't change too much after looking at the output. And the two formulations are equivalent: you could write the same proof in the other direction.

But back to our proof. All we need to do now is replace $\mathbb{P}\left[D=D_{out}\right]$ with $1-\mathbb{P}\left[D=D_{in}\right]$, do the same for $\mathbb{P}\left[D=D_{out}\mid A\left(D\right)=O\right]$, and solve for $\mathbb{P}\left[D=D_{in}\mid A\left(D\right)=O\right]$. You end up with the following bounds:

$$ \frac{\mathbb{P}\left[D=D_{in}\right]}{e^{\varepsilon}+\left(1-e^{\varepsilon}\right)\cdot\mathbb{P}\left[D=D_{in}\right]} \leq \mathbb{P}\left[D=D_{in}\mid A\left(D\right)=O\right] \leq \frac{e^{\varepsilon}\cdot\mathbb{P}\left[D=D_{in}\right]}{1+\left(e^{\varepsilon}-1\right)\cdot\mathbb{P}\left[D=D_{in}\right]} $$

which you can plot using your favorite software.

What does this look like for various values of $\varepsilon$? We can draw a generalization of this graph with pretty colors:

For larger values of $\varepsilon$, this gets scary quite fast. Let's say you're using $\varepsilon=5$. Then, an attacker can go from a small suspicion (say, 10%) to a very high degree of certainty (94%).

What about composition?

In the previous section, I formalized two claims I made in my last article. First, I explained what it means to quantify information gain. Furthermore, I picked an attacker with full background knowledge. If the attacker knows less information in the first place, the bounds we showed still hold.

What about the third claim? I said that differential privacy was composable. Suppose that two algorithms $A$ and $B$ are $\varepsilon$-differentially private. We want to prove that publishing the result of both is $2\varepsilon$-differentially private. Let's call $C$ the algorithm which combines $A$ and $B$: $C(D)=\left(A(D),B(D)\right)$. The output of this algorithm will be a pair of outputs: $O=\left(O_A,O_B\right)$.

The insight is that the two algorithms are independent. They each have their own randomness, so the result of one does not impact the result of the other. This allows us to simply write:

$$ \begin{align*} \mathbb{P}\left[C\left(D_{1}\right)=O\right] & =\mathbb{P}\left[A\left(D_{1}\right)=O_{A}\right]\cdot\mathbb{P}\left[B\left(D_{1}\right)=O_{B}\right]\\ & \leq e^{2\varepsilon}\cdot\mathbb{P}\left[A\left(D_{2}\right)=O_{A}\right]\cdot\mathbb{P}\left[B\left(D_{2}\right)=O_{B}\right]\\ & \leq e^{2\varepsilon}\cdot\mathbb{P}\left[C\left(D_{2}\right)=O\right] \end{align*} $$

so $C$ is $2\varepsilon$-differentially private.

Future steps

I hope that I convinced you that differential privacy can be an excellent way to protect your data (if your $\varepsilon$ is low). Now, if everything is going according to my master plan, you should be like… "This is awesome! I want to use it everywhere! How do I do that?"

I have good news for you: this blog post has sequels. Head over to the table of contents of this series to decide which one you want to read next!

Thanks to Chao Li for introducing me to the Bayesian interpretation of differential privacy, and to a3nm, Armavica, immae and p4bl0 for their helpful comments on drafts of this article (as well as previous ones).

This can mean that $D_{out}$ is the same as $D_{in}$ with one fewer user. This can also mean that $D_{out}$ is the same as $D_{in}$, except one user has been changed to some arbitrary other user. This distinction doesn't change anything to the reasoning, so we can simply forget about it. ↩

Why differential privacy is awesome

2018-07-30T00:00:00+02:00

This is the first blog post in a series about differential privacy. Check out the table of contents to see the next articles!

How to publish data about people while protecting their privacy? This question is far from new. Statistical agencies have grappled with it for decades. Computer scientists have proposed a whole bunch of creative notions to capture this idea. None of them was very satisfactory, though: all these notions were shown to be broken in some circumstances. They were also hard to apply without destroying the utility of the data.

This all changed in 2006, when four researchers introduced differential privacy. This new notion took a novel approach to defining privacy leakage, one that would prove much more rigorous and fruitful. So, what makes differential privacy special? How did it get so successful in adademic circles? Why did governments and tech companies start adopting it for their data publications?

This first article introducing differential privacy will attempt to answer that question. First, we'll describe the high-level intuition behind this successful notion. Then, we'll explain why it's so successful: why is it so much more awesome than all the definitions that came before?

The core idea behind differential privacy

Suppose you have a process that takes some database as input, and returns some output.

This process can be anything. For example, it can be:

a process calculating some statistics ("tell me how many users have red hair")
a de-identification strategy ("remove names and last three digits of ZIP codes")
a machine learning training process ("build a model to predict which users like cats")
… you get the idea.

To make a process differentially private, you usually have to modify it a little bit. Typically, you add some randomness, or noise, in some places. What exactly you do, and how much noise you add, depends on which process you're modifying. I'll abstract that part away and simply say that your process is now doing some unspecified ✨ magic ✨.

Now, remove somebody from your database, and run your new process on it. If the new process is differentially private, then the two outputs are basically the same. This must be true no matter who you remove, and what database you had in the first place.

By "basically the same", I don't mean "it looks a bit similar". Instead, remember that the magic you added to the process was randomized. You don't always get the same output if you run the new process several times. So what does "basically the same" means in this context? It means that you can get the exact same output from both databases with similar likelihood.

What does this have to do with privacy? Well, suppose you're a creepy person trying to figure out whether your target is in the original data. By looking at the output, you can't be 100% certain of anything. Sure, it could have come from a database with your target in it. But it could also have come from the exact same database, without your target. Both options have a similar probability, so there's not much you can say.

You might have noticed that this definition doesn't say anything about what the output data looks like. Differential privacy is not a property of the output data. It's very different from, say, $k$-anonymity, one of the first data privacy definitions. You can't look at the output data and determine whether it satisfies differential privacy. Instead, differential privacy is a property of the process: you have to know how the data was generated to determine whether it's differentially private.

That's about it for the high-level intuition. It's a little abstract, but not very complicated. So, why all the hype? What makes it so awesome compared to older, more straightforward definitions?

What makes differential privacy special

Privacy experts, especially in academia, are enthusiastic about differential privacy. It was first proposed by Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam Smith in 2006¹. Very soon, almost all researchers working on anonymization started building differentially private algorithms. Tech companies and governments are adopting it fast. So, why all the hype? I can count three main reasons.

You no longer need attack modeling

All definitions that came before needed some assumptions about the attacker. To choose the right notion, you needed to figure out the attacker's capabilities and goals. How much prior knowledge do they have? What auxiliary data are they allowed to use? What kind of information do they want to learn?

Doing in practice was difficult and very error-prone. Answering these questions is very tricky: in particular, you might not know exactly what the attacker wants or is capable of. Worse, there might be unknown unknowns: attack vectors that you didn't anticipate at all. For that reason, you couldn't make very broad statements with these old-school definitions. You had to make some assumptions, which you couldn't be 100% sure of.

By contrast, when you use differential privacy, you get two awesome guarantees.

You protect any kind of information about an individual. It doesn't matter what the attacker wants to do. Reidentify their target, know if they're in the dataset, deduce some sensitive attribute… All those things are protected. Thus, you don't have to think about the goals of your attacker.
It works no matter what the attacker knows about your data. They might already know some people in the database. They might even add some fake users to your system. With differential privacy, it doesn't matter. The users that the attacker doesn't know are still protected.

You can quantify the privacy loss

Differential privacy, like older notions, comes with a numeric parameter that you can tweak. There is a big difference, though, in how meaningful that parameter is. Take $k$-anonymity, for example. It tells you that each record in the output dataset "looks like" at least $k-1$ other records. But does the value of $k$ tell us about the level of protection?

The answer is… not much. There is no clear link between the value of $k$ and how private the dataset is. So choosing $k$ is very handwavy, and can't be justified in a formal way. The problem is even worse with other old-school definitions.

Differential privacy is much better. When you use it, you can quantify the greatest possible information gain by the attacker. The corresponding parameter, named $\varepsilon$, allows you to make formal statements. Suppose $\varepsilon=1.1$. Then, you can say: "an attacker who thinks their target is in the dataset with probability 50% can increase their level of certainty to at most 75%." Choosing the exact value of $\varepsilon$ isn't easy, but at least, it can be interpreted in a formal way.

And do you remember the previous point about attack modeling? It means you can change this statement in many ways. You can replace "their target is is the dataset" by anything about one individual. And you can add "no matter what the attacker knows" if you want to be extra-precise. Altogether, that makes differential privacy much stronger than all definitions that came before.

You can compose multiple mechanisms

Suppose you have some data. You want to share it with Alex and with Brinn, in some anonymized fashion. You trust Alex and Brinn equally, so you use the same definition of privacy for both of them. They are not interested in the same aspects of the data, so you give them two different versions of your data. Both versions are "anonymous", for the definition you've chosen.

What happens if Alex and Brinn decide to conspire, and compare the data you gave them? Will the union of the two anonymized versions still be anonymous? It turns out that for most definitions of privacy, this is not the case. If you put two $k$-anonymous versions of the same data together, the result won't be $k$-anonymous. So if Alex and Brinn collaborate, they might be able to reidentify users on their own… or even reconstruct all the original data! That's not good news.

With differential privacy, you can avoid this failure mode. Suppose that you gave differentially private data to Alex and Brinn. Each time, you used a parameter of $\varepsilon$. Then if they conspire, the resulting data is still protected by differential privacy. The level of privacy is now weaker: the parameter becomes $2\varepsilon$. So they still gain some information, but you can now quantify how much. This property is called composition.

This scenario sounds a bit far-fetched, but composition is super useful in practice. Organizations often want to do many things with data. Publish statistics, release an anonymized version, train machine learning algorithms… Composition is a way to stay in control of the level of risk as new use cases appear and processes evolve.

Conclusion

I hope the basic intuition behind differential privacy is now clear. If you remember a single thing, let this be this one-line summary: uncertainty in the process means uncertainty for the attacker, which means better privacy.

I also hope that you're now wondering how it actually works! What hides behind this magic that makes everything safe and private? Why does differential privacy have all the awesome properties I've mentioned? This is the exact topic of the next article in this series, which explains this in more detail while still staying clear of heavy math.

The idea was first proposed in a scientific paper^(pdf) presented at TCC 2006, and can also be found in a patent^(pdf) filed by Dwork and McSherry in 2005. The name differential privacy seems to have appeared first in an invited paper^(pdf) presented at ICALP 2006 by Dwork. ↩

δ-presence, for when being in the dataset is sensitive

2018-04-26T00:00:00+02:00

Remember $k$-map? We used this definition when the attacker didn't know who was in the dataset. Let's go back to this setting, with a slightly different scenario. You're no longer a doctor studying human sexual behavior. You're still a doctor, but this time, you're specialized in treating a particular chronic disease. Instead of running a survey, you're running a clinical trial for a new drug to treat this disease. Similarly, you want to share the data with other people.

At first glance, these two settings look similar — but there is a crucial difference. Which information is sensitive, exactly? For the survey, the answers of each participant are sensitive, as they reveal intimate details. But for the clinical study, being in the dataset is the sensitive information. If someone figures out that you've taken part in the study, they learn that you suffer from this disease.

So, what does it change in practice? Suppose that your dataset contains the following records:

ZIP code	age
85535	10
85535	12
85535	13
85535	13
85535	16
85535	43

You do a little research on who lives in ZIP code 85535. You learn that in this ZIP code:

5 people have ages between 10 and 19;
5 people have ages between 20 and 29;
10 people have ages between 30 and 39;
10 people have ages between 40 and 49;
and 20 people are 50 or older.

Transforming this part of your dataset to have it satisfy $5$-map is easy:

ZIP code	age
85535	10-19
85535	10-19
85535	10-19
85535	10-19
85535	10-19
85535	40-49

… But what has gone wrong there?

An attacker, using only public data, knows that there are 5 people aged between 10 and 19 in ZIP code 85535. Then, by looking at your de-identified dataset, the attacker can figure out that all of them are part of your data. Thus, they all have this specific disease. The attacker learned something sensitive about individuals, without re-identifying any record. Just like in the example of $l$-diversity!

We need yet another definition. Introducing… $\delta$-presence!

Definition

Remember what we counted for our previous privacy definitions? For each combination of quasi-identifier attributes:

for $k$-anonymity, we counted the number of records in the dataset;
and for $k$-map, we counted the number of records in the larger population.

What went wrong in our leading example? For certain attributes, these numbers were equal. To detect this, we now compute the ratio between those two numbers. Then, the $\delta$ in $\delta$-presence is the largest ratio across the dataset.

Consider the dataset above. The ratio for the records (85535, 10-19) is $5/5=1$, and the ratio for the records (85535, 40-49) is $1/10=0.1$. Thus, since we defined $\delta$ as the greatest ratio, we have $\delta=1$. Since the $k$ of $k$-map is always larger than the $k$ of $k$-anonymity, this is the maximum possible value of $\delta$. Saying that a dataset satisfies $1$-presence gives zero guarantees.

Whether $\delta=1$ is not the only interesting thing. We also want this value to be small. The lower, the better. Consider what it means if $\delta=0.95$. The attacker might learn that their target has a 95% chance of being in the dataset. It's not quite a 100% certainty, but it still can be problematic. For example, it might be more than enough for an insurance company to deny you coverage…

How do we get to a lower $\delta$ in our previous example? One solution would be to generalize the age further:

ZIP code	age
85535	10-39
85535	10-39
85535	10-39
85535	10-39
85535	10-39
85535	40-49

Then, the ratio for the records (85535, 10-39) becomes $5/(5+5+10)=0.25$. The ratio for record (85535, 40-49) is still $0.1$, so $\delta=0.25$. (Assuming that no other record in the dataset has ZIP code 85535, and all other records have a smaller ratio).

$\delta$-presence was first proposed by Nergiz et al. in a 2007 paper^(pdf). In this paper, the definition is a bit different. The authors compute not only the largest ratio, but also the smallest one. The $\delta$ parameter hides two parameters $\left(\delta_{\text{min}},\delta_{\text{max}}\right)$. This was done to protect against the symmetric attack: hiding that someone is not in the dataset. I never encountered a situation where this is a real concern, so I simplified it a bit for this post.

$\delta$-presence in practice

$\delta$-presence is computed from the ratios between quantities used in $k$-anonymity and $k$-map. While $k$-anonymity is very easy to compute, $k$-map is much harder. As such, $\delta$-presence has very similar practical characteristics than $k$-map. Since you don't typically have access to the full larger dataset, you can't compute $\delta$ exactly. You can use a pessimistic approximation if your data is a sample of a larger dataset that you own. You can also do the work of estimating $\delta$-presence by hand.

What about statistical approximations? Nergiz et al. proposed an interesting method in a followup paper^(pdf). Unfortunately, two of its requirements make it hardly usable in practical scenarios.

First, to run the algorithm, you need to "describe your beliefs about the world" (in a statistical sense). Unless you're a statistician, this is not something you can really do.
Second, computing the algorithm exactly is very expensive. The authors propose a lot of approximations to make it tractable… But then, using them makes the results even more uncertain.

Finally, if you still want to use this algorithm, you would also likely have to implement it yourself. I don't know of any available software that does it for you.

Conclusion

Like $k$-map, in theory, it often makes sense to use $\delta$-presence. It's a pity that both definitions are so difficult to use in practice! Having simpler (and more usable) approximation algorithms would be great… Which is why I have done some research work in that direction. And the results of this work will be the topic of a future post! =)

l-diversity, because reidentification doesn't tell the whole story

2018-02-19T00:00:00+01:00

Where does privacy risk come from, when releasing anonymized data? What exactly can go wrong? Sweeney provided the first obvious answer: privacy risk appears when you can reidentify a record. That makes sense, and led to the definition of $k$-anonymity. If you think your data is anonymous, but somebody pinpoints a record and figures out who it is, clearly, there's a problem.

But as researchers discovered shortly after, it's sometimes not enough. An attacker might figure out private information about someone, without reidentifying their record. Even if the dataset is $k$-anonymous. How does this magic work? First, we'll show how it works with an example, then we'll describe the natural solution: $l$-diversity.

An example
$l$-diversity: the obvious fix
Wait, that seems too easy
- Uncertain information can still be sensitive
- Probabilistic information gain
$l$-diversity in practice
Conclusion

An example

Suppose you have the following database, which contains everyone in the country.

name	ZIP code	age	diagnostic
Alice	4217	34	Common flu
Bob	4212	39	Healthy
Camille	4732	39	Otitis
Dan	4743	23	Otitis

(It's a rather small country.)

Now, you want to release an anonymized version of this database, for research purposes. Following the $k$-anonymity method, you start by wondering which columns are identifying. Let's see.

name is obviously identifying: we have to remove it completely.
ZIP code and age are quasi-identifiers. They can help you identify someone, but reducing their precision might prevent this.
diagnostic is sensitive, but since it's typically secret, we can consider it non-identifying¹.

So, let's make this data $k$-anonymous. Here, $k=2$, because it's a small country.

ZIP code	age	diagnostic
421*	30-39	Common flu
421*	30-39	Healthy
47**	20-39	Otitis
47**	20-39	Otitis

Since all combinations of ZIP code & age appear twice, this data is $2$-anonymous. But now, suppose an attacker wants to find Camille's diagnostic. The attacker knows that Camille has ZIP code 4732 and age 23. They can easily figure out that Camille's record is the third or fourth one, but cannot know which.

And there's the obvious problem: both records have the same diagnostic. So the attacker can deduce that Camille's diagnostic is "Otitis". Even without knowing which record is Camille's! $k$-anonymity wasn't enough to protect Camille's private information.

$l$-diversity: the obvious fix

So. Let's say that all users with the same quasi-identifier tuple are in the same bucket. If all sensitive values are the same within a bucket, we might leak private information. The obvious solution? Imposing some diversity in the sensitive values associated to the same (generalized) tuple.

This is $l$-diversity, as introduced^(pdf) in 2006 by Machanavajjhala et al. It builds on the definition of $k$-anonymity. $l$-diversity states that each bucket must have at least $l$ distinct sensitive values. Of course, each bucket should contain at least $l$ users: $l$-diversity implies $l$-anonymity.

Let's try to make the data above $2$-diverse.

ZIP code	age	diagnostic
4***	20-39	Common flu
4***	39	Healthy
4***	39	Otitis
4***	20-39	Otitis

Now, consider our attacker from earlier, targeting Camille (third row). Like before, the attacker is unable to know which records corresponds to Camille. But besides, they also can't know whether Camille was healthy, or suffered from otitis. The sensitive value stays private.

Wait, that seems too easy

You might have noticed it immediately: the definition of $l$-diversity has some flaws. Let's list two of them.

Uncertain information can still be sensitive

What's the key idea behind $l$-diversity? If the attacker has uncertainty over the sensitive value, then we avoid leaking private info. But consider the following database, which satisfies $2$-diversity:

ZIP code	age	diagnostic
42**	20-29	AIDS
42**	20-29	Hepatitis B
17**	30-39	Otitis
17**	30-39	Healthy

Suppose the attacker knows that their target has ZIP code 4235 and age 25. The target's record is one of the first two rows. The attacker can learn that their target either has AIDS, or hepatitis B. They can't be sure which one is the correct one… But they can infer that their target has a sexually transmitted infection. This information, of course, might be embarrassing for the target!

How to fix this? One solution could be to group diagnostics into categories, like diagnostic code families. Then, we can require that each bucket has $l$ different categories of diagnostics. This way, the attacker can't distinguish between STDs, external injuries, respiratory problems, etc.

Unfortunately, choosing these categories is a complicated policy question. There are many possible combinations of sensitive values. Making sure that none of them is sensitive sounds like a laborious task…

Probabilistic information gain

Consider the following database, again satisfying $2$-diversity:

ZIP code	age	diagnostic
42**	20-29	Lupus
42**	20-29	Lupus
42**	20-29	Lupus
42**	20-29	Lupus
42**	20-29	Lupus
42**	20-29	Lupus
42**	20-29	Healthy
42**	20-29	Lupus
42**	20-29	Lupus
42**	20-29	Lupus

Consider the same attacker as before: targeting someone with ZIP code 4235 and age 25. They can't know their target's diagnostic for certain. But they can get a strong suspicion that the target has lupus: 9 out of 10 records share this diagnostic! An insurance company might increase someone's premium because of a suspected pre-existing condition. Isn't that also a privacy issue?

How do we protect against this type of probabilistic information gain? Requiring that sensitive attributes are diverse is not enough. We need to also require that their distribution is roughly the same that the rest of the data. If 40% of the records are "healthy" in the overall data, then each bucket must also have roughly 40% of "healthy" records. This way, the attacker's knowledge can't change too much from the baseline. This is the core idea behind another definition named $t$-closeness. I won't go into details here, but you can read about it on Wikipedia or in the original paper^(pdf) who introduced this idea.

Note: this idea is also relevant if the sensitive attribute is numeric, like salary values. A yearly salary of €20,000 is very similar to €20,100: applying $l$-diversity doesn't make sense. By contrast, $t$-closeness can compare distributions in a more meaningful way.

$l$-diversity in practice

OK, so even with these flaws, how easy is it to use $l$-diversity in practice?

The good news: implementation

From an algorithmic perspective, $l$-diversity is very similar to $k$-anonymity. The basic blocks are the same: generalization and suppression. Finding the best strategy is also done using trial-and-error heuristics. The approach used for $k$-anonymity is straightforward to adapt to $l$-diversity.

Unsurprisingly, some software is available to implement it in practice. I won't list them all here, but most options introduced in my article about $k$-anonymity can also be used for $l$-diversity.

The bad news: policy

Choosing the right value of $k$ for $k$-anonymity is difficult, but $l$-diversity is certainly not better. No official guideline or regulation will help you choose the value of $l$. And it's at least as hard to quantify the "amount of privacy" obtained with a given choice of parameter.

Worse, the flaws described before mean that the question is even subtler than that. Should we classify the sensitive values into categories? Impose that sensitive values don't appear too often? If so, there are even more parameters that one has to choose, and no good way to choose them.

The other bad news: utility loss

$l$-diversity, despite its flaws, is strictly stronger than $k$-anonymity. And it should be relatively easy to use in practice, once we've chosen a policy… Despite this, it is hardly ever used. A health data de-identification specialist once told me that they only saw it in the wild a handful of times. By contrast, using $k$-anonymity is very common.

Why is that? I see two possible reasons.

First, the utility loss of $l$-diversity is too significant compared to $k$-anonymity. A study^(pdf) compared the utility loss of different anonymization strategies. It found that applying $3$-diversity dataset was worse than using $100$-anonymity! This particular result was for a classification accuracy task, but you get the idea².

Second, the privacy gains are not clear. Especially considering the flaws we described above… And fixing those flaws hurts utility even more. With $t$-closeness, we hinder the ability to link demographic features with diagnostics. But this type of analysis is exactly what healthcare researchers want to do! If the privacy definition goes completely against this idea, it won't get much love.

Conclusion

$l$-diversity isn't a definition that definitely addresses a particular threat model. Rather, it's a "fix" for one of $k$-anonymity's most obvious flaws. But in security, simply patching bugs one after the other isn't a great defense mechanism... For privacy definitions too, fixing only one attack doesn't get you very far.

Which might be over-optimistic: some people make their health issues public on social media, or the press can find out and publicize the medical history of personalities. If you're actually doing this to anonymize a real dataset, you should be more careful when classifying your columns. Here, we're going with a simple assumption for the sake of simplicity. ↩
It's hard to find many examples: negative results hardly ever get published… ↩

Book review: Crash Override

2018-01-10T00:00:00+01:00

Over the holidays, I read Zoë Quinn's book, Crash Override. Zoë Quinn is an independent games developer. After a gigantic harassment campaign known as GamerGate targeted her, she became an activist against online hate. She then funded an association to help other victims, the Crash Override Network. In this book, she tells the story of this huge campaign, and gives her thoughts on how to prevent similar horrors from happening.

This book is eye-opening, well-written, and inspiring. Because of the difficult topic, it's not an easy read. She tells her story in a very personal way, describing what she went through when thousands of trolls were harassing her and her close ones. The book is worth a read just for this testimony. It slaps you in the face, reminding you that what happens online is as real as what happens in the physical world. It gives you a healthy dose of empathy towards people you interact with on the Internet. Even if you're not an harasser, it's so easy to forget the actual person behind the nickname or the avatar…

This intense dose of empathy is critically valuable if you build tech products, or work in tech policy. Quinn describes everything that went wrong not only with the humans, but also with the technology. It's horrifying and infuriating. Popular tech products and social networks simply suck at dealing with abuse and harassment on their platforms. Quinn is doing an excellent job at analyzing their flaws, and detailing what can be explained by cluelessness and what is the sign of truly rotten ideologies. If you're working in tech, you really really really should learn about these issues. Go read this book.

It's chilling to read Quinn's descriptions of the mass movements behind GamerGate. It serves as a good reminder of how powerful and devastating group dynamics can be. Some of the things harassers did sound completely unreal. Quinn has to explain her situation to many people, and they often don't understand nor accept it: it sounds so ridiculous. It's difficult to believe actual humans would harass people this way. Especially people they've never met, in such a violent and persistent way.

But online trolls don't see their target as human, either. Instead, victims are seen as an abstract concept of "evil". Abusers think that everything they do to their victims is well-deserved. And the more they fight evil, the more feel like "heroes", fighters on the good side in the grand scheme of things. This is nothing new: anti-immigration rhetoric is a classical example of this phenomenon. Nonetheless, GamerGate really was the symptom of something profoundly rotten in parts of society. Victims realized that, of course, but nobody listened to them…

After talking about GamerGate, Quinn explains how to efficiently fight online hate. Given her personal experience and her work as an activist, she's uniquely qualified. She developed a set of principles, and offers concrete advice both for victims, bystanders, as well as technologists.

The technical advice for victims (or for people potentially at risk, which really means everyone) is sound, but doesn't get into much detail. If that's what you're looking for, consider reading The Smart Girl's Guide to Privacy (by Violet Blue). You can also check out the resources on the Crash Override's Network website.

The "human" advice, on the other end, is quite complete and excellent. Recommendations are concrete and sound (typically not "just go offline, stupid"). Many common reactions, both for victims and bystanders, are counter-productive. Quinn does an excellent job at detailing these and explaining what to do instead.

The end of the book surprised me. Quinn writes a whole section about empathy towards harassers and abusers. She explains that there was a time where she was an angry troll, posting nasty comments to people she didn't know online. Like her own abusers, she was doing this for very normal and human reasons. Status, positive feedback from your peers, technical prowess, acceptance within a social group… Even if the result is incredibly evil, this is very normal from a psychological perspective. Especially now, with all the gamification (and financial rewards!) of social networks.

This last part of her testimony really resonated with me. I was the same when I was younger. I browsed 4chan regularly. I was only a lurker: my English was bad, so I was afraid of people mocking me. Nonetheless, I found comfort in the nihilism, the irreverence, and the sense that it was okay to be different and weird. I almost never talked to anyone, but I still felt like these were people who could understand me. All the racism, sexism, general awfulness… Even that felt okay. It was obviously ironic¹, and there was also a lot of self-deprecating humor. But mostly, laughing at everything just came with the nihilism. Putting groups of people I felt like I didn't understand (especially women) in "enemy" boxes was very, very easy.

Then, my life got better and I progressively stopped going there. But reading Quinn's own account of being an online troll made me wonder: would I have taken part in GamerGate, had it been earlier? If I had had just a little bit more bottled-up anger and awkwardness? If my English had been better? I certainly wouldn't have done the right thing — even as a shy lurker, I didn't do anything right back then. This was a harsh (and unexpected) realization. For me, it was a valuable take-away. If you partly recognize yourself in what I clumsily tried to describe there, I would also advise reading this book.

Anyway, she ends the book saying that more abuse or violence towards harassers aren't going to make the hate go away. Instead, Quinn encourages speaking up in a non-confrontational way, listening and empathizing… After everything she went through, this is nothing short of inspiring!

It was not. I was extremely dumb, and extremely privileged, so I don't think I realized that. Or, more probably, I chose not to. Not seeing how wrong it was, not speaking up… This enabled real and horrific violence. I regret this time a lot, and feel ashamed when thinking of it, but there is no excuse for any of it. ↩

k-map, the weird cousin of k-anonymity

2017-10-10T00:00:00+02:00

Suppose that you're a doctor who studies human sexual behavior. You want to run a study with all the patients that you can find, but you don't find a lot of volunteers. You only end up with about 40 subjects.

After you've ran your study and collected data, you want to share this data with other researchers. You look at the attributes, and deduce that ZIP code and age are likely to be used in reidentification attacks. To share it in a safe way, you're thinking of $k$-anonymity.

When trying to find a strategy to obtain $k$-anonymity, you find out that you would have to lose a lot of information. For $k=10$, a rather small value, you end up with buckets like $20\le age\lt 50$. That makes sense: you have only few people in your database, so you have to bundle together very different age values.

But when you think about it, you start questioning whether you really need $k$-anonymity. Who are the attackers, in your scenario? The researchers with whom you share the data, and possibly unknown parties if the data ever leaks. None of these people have background information about who is in the dataset. Thus, the attacker doesn't just have to distinguish between different records, but to actually find the real identity of a record based on its information. This attacker has significantly weaker capabilities than for $k$-anonymity!

Let's look at two different rows in this database.

ZIP code	age
85535	79
60629	42

At first glance, the amount of information for this two individuals seems to be the same. But let's take a look at the values…

85535 corresponds to a place in Arizona named Eden. Approximately 20 people live in this ZIP code. How many people do you think are exactly 79 years old in this particular ZIP code? Probably only one.
60629 corresponds to a part of the Chicago metropolitan area. More than 100,000 people live there. How many of them are 42 years old? A thousand, at least, and probably more!

It seems that it would be very easy to reidentify the first row, but that we don't have enough information to reidentify the second row. But according to $k$-anonymity, both rows might be completely unique in the dataset.

Obviously, $k$-anonymity doesn't fit this use case. We need a different definition: that's where $k$-map comes in.

Definition

Just like $k$-anonymity, $k$-map requires you to determine which columns of your database are quasi-identifiers. This answers the question: what can your attacker use to reidentify their target?

But this information alone is not enough to compute $k$-map. In the example above, we assumed that the attacker doesn't know whether their target is in the dataset. So what are they comparing a given row with? With all other individuals sharing the same values in a larger, sometimes implicit, dataset. For the previous example, this could be "everybody living in the US", if you assume the attacker has no idea who could have this genetic disease. Let's call this larger table the reidentification dataset.

Once you picked the quasi-identifiers and the reidentification dataset, the definition is straightforward. Your data satisfies $k$-map if every combination of values for the quasi-identifiers appears at least $k$ times in the reidentification dataset.

In our example, this corresponds to counting the number of people in the US who share the quasi-identifier values of each row in your dataset. Consider our tiny dataset above:

ZIP code	age
85535	79
60629	42

We said earlier than the values of the first row matched only one person in the US. Thus, this dataset does not satisfy $k$-map for any value of $k\ge 2$.

How do we get a larger $k$? We could generalize the first value like this:

ZIP code	age
85***	79
60629	42

ZIP codes between 85000 and 85999 include the entire city of Phoenix. There are 36,000+ people between 75 and 84 years old in Phoenix, according to some old stats. It's probably safe to assume that there are more than 1,000 people who match the quasi-identifiers values of the first row. We saw earlier that the second row also matched 1,000+ people. So this generalized dataset satisfies 1000-map.

Attack model considerations

Wait a second, why does this feel like cheating? What happened there, to give us such a generous number so easily? This comes from the generous assumptions we made in our attack model. We assumed that the attacker had zero information on their target, except that they live in the US (which is implied by the presence of ZIP codes). And with only the information (ZIP code, age), you don't need a lot of generalization to make each row of your dataset blend in a large crowd.

To make this attack model stronger, you could assume that the attacker will use a smaller reidentification database. For example, suppose that your genetic disease you're studying requires regular hospital check-ups. The attacker could restrict their search only to people who have visited a hospital in the last year. The number of possible "suspects" for each value tuple gets smaller, so the $k$ of $k$-map decreases too¹.

$k$-map is inherently a weak model. So when choosing the quasi-identifiers and reidentification dataset, you have to think hard at what an attacker could do. If your attacker doesn't have lots of resources, it can be reasonable to assume that they won't get more data than, say, the voter files from your state. But if they can figure out more about your users, and you don't really know which reidentification dataset they could use, maybe $k$-anonymity is a safer bet².

And now, some practice

OK, enough theory. Let's learn how to compute $k$-map in practice, and anonymize your datasets to make them verify the definition!

… There's one slight problem, though.

It's usually impossible.

Choosing the reidentification dataset is already a difficult exercise. Maybe you can afford to make generous assumptions, and assume the attacker doesn't know much. At best, you think, they'll buy voter files, or a commercial database, which contains everyone in your state, or in the US. But… then what?

To compute the maximum $k$ such as your dataset verifies $k$-map, you would first need to get the reidentification dataset yourself. But commercial databases are expensive. Voter files might not be legal for you to obtain (even though an evil attacker could break the law to get them).

So, most of the time, you can't actually check whether your data satisfies $k$-map. If it's impossible to check, it's also impossible to know exactly which strategy to adopt to make your dataset verify the definition.

Exception 1: secret sample

Suppose you're not releasing all your data, but only a subset (or sample) of a bigger dataset that you own. Then, you can compute the $k$-map value of the sample with regard to the original, bigger dataset. In this case, choosing $k$-map over $k$-anonymity is relatively safe.

Indeed, your original dataset is certainly smaller than the reidentification dataset used by the attacker. Using the same argument as above, this means that you will obtain a lower bound on the value of $k$. Essentially, you're being pessimistic, which means that you're on the safe side.

Even if the attacker has access to the original dataset, they won't know which records are in the sample. So if the original dataset is secret, or if you've chosen the sample in a secret way, $k$-map is a reasonable definition to use, and you can compute a pessimistic approximation of it.

Exception 2: representative distribution

This case is slightly different. Suppose that you can make the assumption that your data is a representative (or unbiaised) sample of a larger dataset. This might be a good approximation if you selected people (uniformly) at random to build your dataset, or if it was gathered by a polling organization.

In this case, you can compute an estimate of the $k$-map value for your data, even without the reidentification dataset. The statistical properties which enable this, and the methods you can use, are pretty complicated: I won't explain them in detail here. They are mentioned and compared in this paper, which has references to the original versions of each of them.

Exception 3: using humans

For the case of our doctor earlier, if the dataset is small enough, a motivated data owner could actually do the job of an attacker "by hand". Go through each record, and try to map it to a real person, or estimate the chances of it being possible. We pretty much did that in this article!

This is very approximative, and obviously not scalable. But for our imaginary doctor, it might be a reasonable solution!

Implementations

ARX implements the methods from exceptions 1 and 2. Documentation for the first one can be found here. Instructions to estimate the number of unique values assuming uniformity can be found here. Originally, μ-ARGUS was the first software with this feature, but I couldn't run it on my machine, so I can't say much about it.

Conclusion

You might wonder why I wrote an entire article on a definition that is hardly used because of how impractical it is. In addition to the unique problems that we talked about in this article, the limitations of $k$-anonymity also apply. It's difficult to choose $k$, non-trivial to pick the quasi-identifiers, and even trickier to model the reidentification database.

The definition also didn't get a lot of attention from academics. Historically, $k$-anonymity came first⁴. Then, people showed that $k$-anonymity was sometimes not sufficient to protect sensitive data, and tried to find stronger definitions to fix it. Weaker definitions were, of course, less interesting.

Nonetheless, I find that it's an interesting relaxation of $k$-anonymity. It shows one of its implicit assumptions: the attacker knows that their target belongs to the dataset. This assumption is sometimes too pessimistic: it might be worth considering alternate definitions.

Choosing a privacy model is all about modeling the attacker correctly. Learning to question implicit assumptions can only help!

There is a generic version of this argument. Let's call your database $D$, and suppose $R$ and $R^\prime$ are two possible reidentification databases. Suppose that $R^\prime$ is "larger" than $R$ (each element of $R$ appears in $R^\prime$). Then if $D$ satisfies $k$-map with regard to $R$, it also satisfies $k$-map with regard to $R^\prime$. The reverse is not true. ↩
One simple consequence of the previous footnote is that if a dataset $D$ verifies $k$-anonymity, then it automatically verifies $k$-map for any reidentification dataset³. ↩
I didn't say this explicitly, but the reidentification dataset is always assumed to contain all rows from your dataset. It's usually not the case in practice because data is messy, but it's a safe assumption. Hoping that your attacker will just ignore some records in your data would be a bit overly optimistic. ↩
Latanya Sweeney first mentioned the idea behind $k$-map in this 2002 paper^(pdf), several years after the introduction of $k$-anonymity. ↩

Book review: Twitter and Tear Gas

2017-10-09T00:00:00+02:00

I recently finished reading Twitter and Tear Gas: The Power and Fragility of Networked Protest, by Zeynep Tufekci. It's a long yet dense essay on how modern protests work, and why they sometimes don't. Tufekci has a long experience as an activist in many different protests around the world. She also has a strong education in technology and in social sciences, and her work focuses on the intersection between the two. In short, she is the perfect person to write a book conceptualizing modern protests and their use of technology. Unsurprisingly, the essay makes for a fascinating and enlightening read.

Here's an example. What does it mean when many people march in the streets? It displays power: the power to send the word out, to convince people to join, to organize logistics. But the actual march isn't scary to people in power: the implications are. If an organization is able to gather many people for a march, then this movement is capable of other things. Boycotts, strikes, fundraisers for your political opponents, influence in the media…

All those things actually cause headaches to politicians, and make change more likely. The protest itself merely serves as a signal. A few decades ago, it was a strong signal: only very powerful movements could put a large number of people on the streets. So if you could pull off a large protest, it meant that your movement could do all those other annoying things. Social media and technological tools change this. With them, it's much easier to plan an event, get the word out, and have many people rally around a cause for an afternoon. This should be good news for protesters… Except it also means that large protests are no longer such a show of strength. "Easier" also means "less impressive". And the people in power have understood this.

Consider movements such as the anti-war demonstrations of the Bush era, Occupy, or the more recent Woman's March. Politicians were able to pretty much ignore protesters: once everyone gets home, nothing happens. Only the most motivated of political opponents might cause actual issues later on. Worse, their number is not directly related to the size of the protest itself. So the protest can be very impressive (especially when comparing it with historical protests), and still not scare anyone in power.

I picked this particular insight to try and convince you to read the book… But that's obviously only a tiny part of what is there. Tufekci provides simple concepts to understand how tech interacts with social movements. It's rigorous, detailed, and illustrated with plenty of historical examples. The author doesn't assume you know these examples already (even for "famous" events, like civil rights movements in the US). This is great for people like me with a limited knowledge of history ^^

Twitter and Tear Gas doesn't only evoke protests. At the intersection between tech and social movements, there are also misinformation campaigns, online harassment, social network policies and their consequences… Each of those is discussed in the book, always with the same academic rigor, lively examples, and clear writing.

The book is an excellent read from an intellectual perspective: it made many ideas clearer and simpler to understand for me. This feeling is the best indicator I know of good science! But you can also read this book as an instruction manual. How to build "muscle" for a movement, how to orient it towards the most efficients means of action, how to deal with misinformation and censorship… Using the technological tools that were developed in the last few decades.

Everyone working in tech could probably benefit from reading Twitter and Tear Gas. If you're an activist, I'd say it's pretty much required reading. Go buy it there or there, or if you can't afford it, download it for free: it's licensed under Creative Commons! (This excellent model of publishing alone is a good reason to buy the book.)

Biometrics: authentication or identification?

2017-09-27T00:00:00+02:00

Earlier this month, there was lots of chatter online about the new iPhone's FaceID feature: it allows you to unlock your device just by looking at it. Behind the scenes are some hardware and algorithms which create a 3D map of your face, and determine whether you're the phone's rightful owner.

Many people seemed to not understand the difference between authentication and identification. Both authentication and identification can use biometric data, like facial recognition. Nonetheless, these use cases are fundamentally different. I'll try to explain why — I hope this can enlighten the debate around features like this a little bit.

Authentication

Authentication is what you do when you log in to some Internet service, or when you unlock your phone. First, you announce your identity to the authentication system (e.g. a log-in page or lock screen). Then, you try to prove to the system that you're indeed who you pretend to be. For an Internet service, identity can mean your e-mail address. For a phone, it's more implicit: you're trying to prove you're the owner of the phone.

The attack model is the following: some evil person pretends to be you, and tries to prove it to the authentication system to get access to your data. This attacker can be of various types:

an abusive partner who wants to look into your phone,
a scammer who wants to steal your identity,
a spy who wants to penetrate a company's network…

Fundamentally, authentication protects against unauthorized access to data.

Identification

Identification is trying to figure out who someone is based on some characteristics they have, or data they produced. It's what the police does when running a fingerprint against a database of suspects. It's what privacy researchers do when they try to show that a data release has not been properly anonymized.

The attack model here is that somebody tries to find your identity. To succeed, an attacker needs to have a list of suspects, and enough information to distinguish who you are among all possibilities.

Good authentication vector vs. good identification vector

From these different attack models, a first distinction emerges.

If a piece of data is secret, it will work well as an authentication vector. Passwords, codes embedded in security keys, or one-time SMS codes, are classic examples.
If a piece of data is public, or at least known to the attacker, it can work as an identification vector. Names, dates of birth or phone numbers are good candidates.

A second distinction is on the amount of information present in the data.

To authenticate someone, you don't always need lots of info. For example, a 4-digit PIN code is enough to get a decent security on a phone, provided only a few retries are allowed.
To identify someone, you need more than this. Even if you somehow get your hands on a database which contains everyone's PIN code, each one would correspond to many people. A PIN code alone wouldn't be enough: you need some context or more data to reliably identify someone.

Biometrics for authentication

Biometrics seem to be blurring the line. Fingerprints are not exactly secret, right? Your face is probably also all over social media. So how come they are more and more used as authentication methods?

It turns out that the secretness of authentication data is not a required property. All we need is unforgeability: an attacker must not be able to impersonate you. If a secret is well-protected, it's difficult to falsify: the attacker can't imitate what they don't know. But biometric info can be quite unforgeable, even if it's not technically secret. It's easy to find what someone's face or fingerprint looks like, but it's hard to create a fake version of it.

Some folks have written excellent articles on the difficulty of bypassing biometric authentication. So, instead of diving into the details, I'll simply recommend this excellent post from Troy Hunt's blog.

Biometrics for identification

Information being public doesn't mean that there exists a central database containing everyone's data. This is especially true for biometric info. Most attackers don't have access to global fingerprint or facial recognition databases (yet)… But when they do, it definitely raises serious privacy concerns.

Classic identification attacks focus on finding the person behind a pseudonym or identifier. Identifiers can be phone numbers, e-mail addresses… Over time, you can change pseudonyms and identifiers¹. You can also maintain separate identities, for example when you use a different email address for services you don't trust.

Biometric identification doesn't have these nice properties. You can't change your face or your fingerprints! And you can't use a different right thumb with border agents of different countries, either.

Furthermore, you also have less control over your biometric information. You can decide not to interact with a given online service if you don't trust it. But if you're living a "normal" life in a Western city, your face will most certainly be caught and recorded by many surveillance cameras.

Creating a facial recognition database is becoming simpler and cheaper. In Russia, pro-Putin activists identified anti-government protestors using pictures gathered from social media². "Researchers" are creating algorithms to detect sexual orientation³ or gender identity⁴. They used data from dating apps or video sharing services, and didn't ask anyone for consent.

Using biometric data for identification is not inherently problematic. For example, it helps catching violent criminals. Yet, the privacy concerns are most definitely justified.

Are those really distinct problems?

So, biometric identification is creepy, but biometric authentication isn't always problematic. But wait. If people build biometric authentication systems… How do they recognize someone's face or fingerprint if they don't store it somewhere? Didn't the engineers behind FaceID had to build a biometric database? Couldn't evil people use this for identification?

Not necessarily. For many of those tools, it is a specific design goal to not make biometric identification easier. This is achieved through a series of risk mitigation mechanisms⁵:

The biometric data exists only on the user's phone, not in a central place. The phone vendor doesn't need to unlock your phone! So it doesn't need this information. The database doesn't exist in the first place.
The data lives in a specific piece of hardware called "Secure Enclave". This chip encrypts and stores secrets independently of other parts of the phone. Even if a hacker takes control of your phone, or a thief steals it, they can't read the biometric data stored on it. Building a biometric database from hacked iPhones is near-impossible.
Pictures taken during authentication are immediately discarded. Only the pictures used for enrollment (when you set up FaceID) are stored. This way, you know what is stored on your phone. This way, the chip doesn't store pictures that you wouldn't want stored there.

In addition, some fingerprint systems store only partial information on their users. Remember how a 4-digit PIN was enough for certain authentication systems? Similarly, partial biometric data can be enough to be a good authentication vector. So even if data leaks, the exact info might not be enough to uniquely identify someone.

Authentication is a different problem than identification. Thus, a system designed for the former can also mitigate risk against the latter.

Does this mean we shouldn't worry about biometric authentication systems? Ha! No.

Point of failure 1: the tech

I'm quite confident that Apple's new FaceID system is reasonably secure. Zero-days vulnerabilities for iOS are worth millions. That's a good sign that Apple has a strong security team who know what they're doing.

But there should be a lot of healthy skepticism when anyone introduces a new system like this. Data breaches happen all the time. If a biometric authentication system is badly designed, the potential consequences are catastrophic.

Point of failure 2: the people

Did I convince you that authentication and identification are not the same thing? Excellent. Will most people understand the distinction? I'm not exactly optimistic =(

FaceID will probably make people more comfortable with facial recognition itself. And if the technology gets normalized, this will lead to more problematic uses being more easily accepted.

This week, I heard about future plans for the London public transportation system. They are considering facial recognition as a replacement for magnetic cards containing tickets. Have your face recognized when you enter and leave the subway, get charged later. This is an identification system. The privacy implications are vastly different, and the consequences of security incidents could be catastrophic.

Will people understand the difference?

Are you thinking "wait a second, I can't change my social security number…"? Excellent point! This is one of the many reasons why SSNs make such terrible identifiers. ↩
Here's a scary article about this thing. Their success rate was pretty terrible, but this didn't stop them. And the tech is getting better fast. ↩
Here's an article that does a good job at explaining why this is terrible science (and ethics). ↩
Example press coverage. I think I've seen good criticism of it at the time but I can't find it anymore =( ↩
From Apple's FaceID Security Guide (PDF). ↩

k-anonymity, the parent of all privacy definitions

2017-08-14T00:00:00+02:00

In 1997, a PhD student named Latanya Sweeney heard about an interesting data release. A health insurance organization from Massachusetts had compiled a database of hospital visits by state employees, and had thought that giving it to researchers could encourage innovation and scientific discovery. Of course, there were privacy considerations: allowing researchers to look at other citizens health records seemed pretty creepy. So they decided to do the obvious thing, and remove all columns that indicated who a patient was: name, phone number, full address, social security number, etc.

As you can probably guess, this didn't end so well. In this article, I'll describe and analyze Sweeney's successful reidentification attack, and I'll explain the privacy definition that Sweeney invented to prevent this type of attack in the future: $k$-anonymity.

What went wrong?
How to prevent this attack?
- Definition of $k$-anonymity
  - What types of data are reidentifying?
  - How to choose $k$?
- How to make a dataset $k$-anonymous?
How convincing really is $k$-anonymity?

What went wrong?

Some demographic information was left in the database, so researchers could still compile useful stats: ZIP code, date of birth, and gender were all part of the data. Sweeney realized that the claims of the Massachusetts governor, who insisted that the privacy of state employees was respected (all identifiers were removed!), were perhaps a little bit over-optimistic. Since the governor himself was a state employee, Sweeney decided to do the obvious thing and reidentify which records of the "anonymized" database were the governor's.

With just $20, Sweeney bought the public voter records from Massachusetts, which had both full identifiers (names, addresses) and demographic data (ZIP code and date of birth), and contained the governor's information. Guess how many records matched the governor's gender, ZIP code, and date of birth inside the hospital database? Only one, and thus, Sweeney was able to know which prescriptions and visits in the data were the governor's. She posted all of it to his office, showing theatrically that their anonymization process wasn't as solid as it should have been.

Several factors made this attack possible. Some are obvious, but not all:

The hospital data contained demographic information that could be used to distinguish between different records.
A secondary database was available to figure out the demographic information about the target.
The target was in both datasets.
And the demographic information of the target (ZIP code, date of birth, and gender) was unique within both datasets: only one record had the demographic values of the governor.

At first glance, these factors appear to be necessary: remove one of them and suddenly, the attack no longer works. (Try it! It's a good mental exercise.)

How to prevent this attack?

As per our previous remark, removing one of the factors should be enough to prevent attacks like these. Which ones can we afford to remove, while making sure that the data can be used for data analysis tasks?

We could remove all demographic information from the data, or even all information that might be linked to a person using auxiliary sources. Unfortunately, this would also severely hinder the utility of the data: correlations based on age, gender, and geographic info are very useful to researchers!
Society probably should do something about the existence of public (or commercially available) data sources that can be used in reidentification attacks. However, this is a complex political issue, so a little bit out of scope for a data owner who wants to publish or share an anonymized version of their data — in practice, there's pretty much nothing we can do about it.
Again, there's not much we can do. We have no way to modify the secondary (public) dataset. We could decrease the probability that a random target is in our dataset by sub-sampling it, but all people in the sample would still be at risk, so this is obviously not a satisfying solution.
Now, this is the interesting point. Maybe suppressing all demographic values would render the data useless, but there might be a middle ground to make sure that the demographic values are no longer unique in the dataset.

This last suggestion is the basic idea of $k$-anonymity.

Definition of $k$-anonymity

A dataset is said to be $k$-anonymous if every combination of values for demographic columns in the dataset appears at least for k different records.

For example, this dataset is $2$-anonymous:

ZIP code	age
4217	34
4217	34
1742	77
1742	77
4217	34

This one isn't:

ZIP code	age
4217	34
1742	77
1743	77
4217	34

Notice that we need every combination of values to appear at least $k$ times. Thus, even if each individual value of each column appears $2$ times in the following dataset, it's not $2$-anonymous:

ZIP code	age
4217	34
1742	34
4217	77
1742	77

The intuition is that when a dataset is $k$-anonymous for a sufficiently large $k$, the last requirement for a successful reidentification attack is broken. An attacker might find out the demographic information of their target using a secondary database, but then this demographic information will be linked to $k$ different individuals, so it will be impossible to know which one is their info.

What types of data are reidentifying?

Note that we've only talked about "demographic information", which is pretty vague. ZIP codes, age, gender are all good candidates for reidentification attacks, because they're public (or easily findable) information that is also often found in sensitive datasets (especially medical ones). In general, the data owner should consider which columns might be used by the attacker they're concerned about.

These columns, not necessarily sensitive themselves but which might be used in a reidentification attack, are called quasi-identifiers (or QIs). There is no universal list of quasi-identifiers, it depends on the attack model. If some data types are almost always QIs (ZIP code, age, gender…), many more depend on the context (like timestamps, medical conditions, physical characteristics…). The question to ask is: would the person who's trying to attack our dataset have access to these values through public or commercially available data?

I'll try to write more about attack modeling and data classification later. This is not as easily explainable as the various mathematical definitions of privacy: it has lots of human components and as such, is always a bit fuzzy. Which makes it even more interesting! :D But I digress.

How to choose $k$?

Short answer: ¯\_(ツ)_/¯

Longer answer: nobody knows. In the healthcare world, when medical data is shared with a small number of people (typically for research purposes), $k$ is often chosen between $5$ and $15$. This choice is very arbitrary and ad hoc. To the best of my knowledge, there is no official law or regulation which suggests a specific value. Some universities, companies or other organizations have official guidelines, but the vast majority don't.

To pick a parameter for a privacy definition, one needs to understand what's the link between the parameter value, and the risk of a privacy incident happening. But this is difficult: if $k$-anonymity is relatively easy to understand, estimating risk quantitatively is extremely tricky. I'm also going to write a bit about this later on!

Regulators don't want to include specific parameter values in laws or guidelines, since there is no convincing argument to be made for a given choice, and the level of risk depends on many more fuzzy parameters (how valuable the data is, how bad would a privacy incident be, etc.).
Data owners don't know how to choose the parameter either, so they usually buy the services of a privacy consultant to do this choice (and take care of the anonymization process). This consultant doesn't know either what's the "good" choice, but they usually have more practical experience of what are common values in the industry for similar levels of risk.

This is my first "real" blog post, about the most basic anonymity definition there is, and I've already reached my second digression to say "notice how it's actually super fuzzy and thus, complicated to apply in practice?". Isn't privacy fun? :D

How to make a dataset $k$-anonymous?

So, suppose we picked our quasi-identifiers and $k=2$. Even with such a low value for $k$, our original dataset will likely not be $k$-anonymous: there will be many records with unique combinations of quasi-identifier values.

The two main building blocks used to transform a dataset into a $k$-anonymous table are generalization and suppression.

Building block 1: generalization

Generalization is the process of making a quasi-identifier value less precise, so that records with different values are transformed (or generalized) into records that share the same values. Consider the records in this table:

ZIP code	age
4217	34
4217	39
1742	75
1691	77

The numerical values of these records can be transformed into numerical ranges, so that the resulting table verifies $2$-anonymity:

ZIP code	age
4217	30-39
4217	30-39
1***	75-79
1***	75-79

The idea of generalization is to make demographic information more imprecise to satisfy our privacy requirements, but still allow useful data analysis to be done. In our example, changing precise ages into age ranges is probably enough to analyze whether a disease affects young or old people disproportionately.

Transforming a numerical value into a range is one of the most typical ways of performing generalization. Other ways include removing a value entirely (e.g. transforming a gender value into "gender unknown"), or using a generalization hierarchy (e.g. transforming an ICD-10 diagnosis code into a truncated code, or the corresponding block).

Two types of generalization

Generalization strategies can be classified into two categories: global and local. Consider the following table:

ZIP code	age
4217	34
4217	34
1742	34
1742	31

Global generalization means that a given value for a given column will always be generalized in the same way: if you decide to transform age 34 into age range 30-34 for one record, all records that have ages between 30 and 34 will be transformed into this fixed range of 30-34. Using global generalization, the example could be transformed into:

ZIP code	age
4217	30-34
4217	30-34
1742	30-34
1742	30-34

Local generalization doesn't have that constraint: it allows you to pick a different generalization for each record. A value 34 in the age column might stay untouched for one record, and generalized for other:

ZIP code	age
4217	34
4217	34
1742	30-34
1742	30-34

Global generalization usually makes it easier to do data analysis on generalized data; while local generalization allows to keep more utility at the cost of a slightly more complex data representation.

Building block 2: suppression

In our previous example, our records had relatively "close" demographic values, which allowed generalization to keep reasonably accurate information while still ensuring $2$-anonymity. What if the table is instead:

ZIP code	age
4217	34
4217	39
1742	75
1691	77
9755	13

The first four records can be grouped in two pairs as above, but the last record is an outlier. Grouping it with one of the pairs above would mean having very large ranges of values (age between 10 and 39, or ZIP code being completely removed), which would significantly reduce the utility of the resulting data. So a simple solution to deal with such outlier values is simply to remove them from the data. Using both generalization and suppression on this example could lead to the same $2$-anonymous table as before:

ZIP code	age
4217	30-39
4217	30-39
1000-1999	75-79
1000-1999	75-79

Using this method, there are usually strictly less records in the transformed table than in the original. On large datasets, allowing a small percentage of suppressed records typically allows the result to be $k$-anonymous without requiring too much generalization.

Algorithms

$k$-anonymity is the oldest privacy definition, it's relatively simple to understand, so it has been quickly adopted by the healthcare community for their data anonymization needs. As a result, there has been a lot of research on how to transform a dataset into a $k$-anonymous table.

The problem of finding an optimal strategy for $k$-anonymity is NP-hard, for basically any reasonable definition of optimality. This paper^(pdf) presents a few such results, if you're interested in this kind of thing ^^

A list of approximation algorithms for the optimal $k$-anonymization problem can be found in this paper^(pdf) (Table 4, page 11). 18 different algorithms are listed, and I don't even think the list is exhaustive! The paper contains many links to the original papers, and to some comparisons between methods. Sadly, there is no unified benchmark to know how all these algorithms perform on various data analysis tasks.

In practice

Unless you're a PhD student working on your literature review, you're probably not looking for a bunch of links to research papers about complicated $k$-anonymization algorithms. If you're a data owner trying to transform a dataset to get a $k$-anonymous table, you may be looking for software instead.

As of 2017, the main open-source tool for data anonymization is ARX. Its interface is a bit difficult to understand at first, but it works fairly well on small to moderately large datasets, and implements a lot more than just $k$-anonymity algorithms. It used to feature only global generalization techniques¹, but this apparently changed recently.

There are other tools available online, but none of them is anywhere as usable as ARX. Many of them are listed in the Related software page of ARX's website. I've tried most of them, only to get convinced that none of them really reached the point of being a usable product. UTD Anonymization Toolbox is probably the only one worth a look: it requires to use the command-line and impractical configuration files to work, but it implements a local generalization algorithm (the first of its kind, named Mondrian^(pdf), a very cool technique with better utility preservation than global generalization algorithms).

On the commercial side, I've only heard of a toolkit developed by the consulting company Privacy Analytics. The intended audience seems to be people who know little about privacy: it looks very shiny, but I didn't manage to understand which anonymity property or algorithms they were using ^^ You can get a free trial by filling up a form on their website, but I can only assume the real version is very expensive, since there is no mention of price anywhere.

How convincing really is $k$-anonymity?

$k$-anonymity is simple to understand, and it seems intuitively obvious that reidentification attacks are well mitigated when a dataset is transformed to become $k$-anonymous. However, it only mitigates this particular kind of attack. We assumed that all that the attacker wanted was to select a target, point at a record, and say "this record corresponds to my target" with a high certainty. This matches Sweeney's original attack, but how realistic is this?

When an attacker successfully reidentifies someone in a dataset, it's not necessarily a privacy issue. Consider the voter files from earlier. By law, this data is public, and contains full names. It's very easy for an attacker to point at a random record and shouting "hey, I reidentified this person!": the identification is right there in the dataset. This "attack" always succeeds, but it's not really interesting, nor particularly creepy… Why is that?

In Sweeney's example, the creepy thing isn't just finding the data subject associated with a given record. The sensitive information linked with the record (in our leading example, diagnostics and drug prescriptions) is where the creepiness comes from! The leak of sensitive information associated to one given individual is the problem, not the reidentification itself.

$k$-anonymity doesn't really capture this idea. The definition just prevents you from knowing the real identity of an anonymized record. But maybe there are other attacks that allow you to find out sensitive information about someone, without finding with absolute certainty which record is theirs?

As I'll explain in future articles, other types of attacks do exist, and many other definitions have been proposed in order to mitigate them too. Nonetheless, $k$-anonymity is still used in the healthcare world, in large part because of its simplicity and utility preservation compared to other definitions.

A previous version of this post claimed that only global generalization was available in ARX. Sorry for the factual mistake! I should have read the docs more closely =) ↩

Beginnings

2017-07-17T00:00:00+02:00

Hi there! I'm Damien. I have no idea how people usually start blogs, so bear with me while I figure this out.

I see this place as a way to publish things that are too long for Twitter, too opinionated for Wikipedia, and not pretentious enough for Medium. I'm going to try and keep it to three themes: privacy, research, and privacy research. I'm not (yet) a specialist in any of these. Hopefully, thanks for my jobs and personal interest in those topics, I can add something valuable to what's written online about them.

The following is the vision I have of these three themes. This should give an idea of what I intend to talk about in this blog =)

Privacy

It's difficult to define what privacy encompasses. It's easier to realize when you don't have enough privacy — through bad surprises, uneasy feelings of creepiness, or real risks to your safety.

When a parent or a partner installs stealthy software on your phone to spy on your texts and calls, that's an invasion of your privacy. When a company sells your name, address and purchase history to some sketchy third-party that sends you targeted ads, the uneasy feeling you get comes from a lack of privacy. Full-body scanners in certain airports are an attack to one's bodily privacy. Data leaks are a risk to users' privacy.

Privacy issues usually come from a lack of transparency, of control, or both. In an ideal world, everybody would know exactly who has access to which data about them and why. Personal data collection would not happen without informed consent, and people would have a right to access, modify and delete data that other people or organizations hold about them.

The fuzziness, and the complexity of the issues in this space, are part of what I find interesting about them. I have done many privacy reviews for Google products, and there is always something interesting and new with each of them. Would users expect this behavior? Is this deletion action clear enough? Could someone re-identify this aggregated data?

Like security, privacy is of particular importance for marginalized communities. Having your phone number leaked online is much more problematic if you're a high-profile political activist, or a closeted LGBTQ+ blogger. Harassment of folks that belong to minorities is a major problem, and badly-designed sharing interfaces or insufficient anti-abuse tools can lead to dramatic consequences. Designing tools that deal with potentially sensitive data, and failing to consider these specific risks, is highly irresponsible. And you can easily guess what I think of compliance-based privacy programs…

I also try to avoid absolutist viewpoints. They are hardly ever constructive, and they are often dangerous. I know people who refuse to use Signal because it's not available without Google Play Services, while continuing to communicate via cleartext SMS messages. For most practical problems, there is no perfect solution. Focusing on defending against a hypothetical all-powerful targeted attacker is usually pointless. Instead, I try to focus on realistic threat models, usable tools, and risk mitigation.

Research

I started a part-time PhD after two years of software engineering at Google.

To solve an engineering problem, the path is quite straightforward. Grasp the scope of the problem, design a solution, validate the design with coworkers and stakeholders, write code, verify that the solution is "good enough", then productionize it. Once the problem disappeared, there's no time to think about it more: there are other problems to solve, other fires to put out.

The whole process is fun and rewarding, but I'm frustrated by the ending. What if we could design a simpler or more efficient solution? Prove that it works in a wider range of situations? Share the idea behind it with more people, and see whether they get inspired and solve other problems? Doing all of this is not immediately rewarding, but I think it can have a deeper, and longer-lasting impact, than core engineering work.

I optimistically think that academia is the place to do that. Compare the solution to what's out there already, make more experiments, write proofs, figure out what additional impact it could have. Share the results with as many people as possible. It might not be worth the time, but I think it's worthwhile to give it a try. There are certainly interesting things to learn along the way.

The one thing that I'm afraid of is spending time solving the wrong problems. Finding a "good problem" is not easy: a good problem must be difficult enough to not have been already solved, but simple enough to have a chance at tackling it. Identifying practical problems and their precise constraints is also hard, when the main source of inspiration is other academics' work.

I'm frustrated about the lack of incentives to do research work as a software engineer, but the incentives of academia are even more broken. Publication metrics are a bad way to estimate one's impact, especially in the short term. The peer review process is terribly implemented in practice. The whole system makes it painfully slow to gather feedback, and the little feedback you get is imprecise. The idea of having my work praised only to realize much later that it didn't make a difference in practice… It's even scarier to me than the idea of not finding joy and impact in my research, and deciding to quit.

But I'm not exactly pessimistic :D I feel lucky and enthusiastic about this part-time project. Continuing to do engineering work for Google gives me an endless input of complicated real-world problems to tackle, many of which seem to be good candidates for research projects. I am surrounded by impressively smart and passionate coworkers on both sides, whose feedback is invaluable. And I don't feel extremely attached to the idea of having an academic career or even getting the title at the end of my PhD, so I don't really feel the pressure to publish everything and anything just to increment some counters.

All in all, this sounds like a fun and challenging adventure. I'm excited to see what I'll learn along the way!

Privacy research

My research, like my engineering job at Google, will focus on privacy. This is a field whose boundaries are not very well-defined, and that has very distinct sub-fields. Some researchers focus on user research to understand the perceptions of real people with regard to their personal data (there are a bunch of them at Google). Very little math is involved. Some are designing algorithms that have provable privacy-related properties, like private set intersection or differentially private surveys. Lots of math there! ^^ Some study the problem of anonymizing (or de-identifying) a dataset, so it can be used by more people or shared with third parties. Some focus on onion routing, on online tracking, on cryptocurrency, on privacy policies, on genetic privacy, on social networks, and the list is far from exhaustive. So… what am I doing exactly?

My PhD project is about making it easier for data owners to understand and protect the personal information contained in their databases. I see this goal as having two main subcomponents.

Risk analysis. There are lots of organizations, companies or governments which sit on large databases with personal information, and it's difficult for them to realize how sensitive it is. Leaking your users' country of origin is intuitively less of a problem than leaking their e-mail addresses, which in turn is not as big a deal as leaking their credit card information. Sadly, doing this type of inventory and risk analysis is currently pretty difficult: it requires time, investment, and specific expertise. It shouldn't have to be this way, so I'm working towards building tools that make this easier.
Anonymization. Once you realized how sensitive your data is, you hopefully will want to take steps to protect it. There are many ways to lower the risk of bad people having access to your database: encryption, access controls, or many other security techniques. Another option is to modify the database, in a way that makes sure that somebody with access to it can't deduce creepy things about the individuals whose data is in the database. I'm working towards making this process easier and more understandable for data owners.

I could (and hopefully, I will!) talk at length about these two things. They have already been studied by many people over the past ~15 years (especially anonymization), but I think that there is a lot of room for more vulgarization on the topic, and significant improvements to do on the research side. On the anonymization topic in particular, I feel like it is urgent to work towards bridging the gap between research advances and concrete use cases.

Maybe I'll realize along the way that I'm looking at the wrong problems, or that it proves more difficult than I thought to improve the state of the art. But as I've been told, that's part of what makes it challenging and fun ^^

Ted is writing things

Empirical privacy metrics: the bad, the ugly… and the good, maybe?

Converters between differential privacy variants

Pure DP to zero-concentrated DP

Zero-concentrated DP to approximate DP

Gaussian DP to approximate DP

Gaussian DP to zero-concentrated DP

Frequently asked questions

Paper highlight: Evaluations of Machine Learning Privacy Defenses are Misleading

Five stages of accepting provably robust anonymization

Choosing things privately with the exponential mechanism

A simple example

A generic statement and a simple optimization

More results?!

Mapping privacy-enhancing technologies to your use cases

Collecting data privately

Computing on data privately

Joining data privately

Sharing data privately

Final comments

What anonymization techniques can you trust?

Randomize identifiers

Remove identifiers altogether

Apply rule-based techniques

Aggregate the data

What do these attacks have in common?

What to do, then?

Is differential privacy the right fit for your problem?

A simple litmus test

Robust analyses: well-suited to differential privacy

Analyses that are sensitive to small changes in the data

Small populations

Finding outlier individuals

Preserving linkability

Making the analysis more robust

What if I don’t know how the data will be used?

Research post: Differential privacy under partial knowledge

Some positive results

A distinction between active and passive attackers

Problems with dependencies in the data

Difficulties making composition work

Conclusion & perspectives

A bottom-up approach to making differential privacy ubiquitous

Averaging risk: Rényi DP & zero-concentrated DP

Basic idea

Rényi differential privacy

Zero-concentrated differential privacy

tl;dr

A list of real-world uses of differential privacy

Apple

Facebook

Full URLs Data Set

Movement Range Maps

Google

Community Mobility Reports

Gboard next-word prediction models

Gboard out-of-vocabulary word discovery

Search Trends Symptoms Dataset

Shopping

Trends

Urban mobility data

Vaccination Search Insights

Israel's Ministry of Health

LinkedIn

Audience Engagements API

Labor Market Insights

Microsoft

Global victim-perpetrator synthetic dataset

Telemetry collection in Windows

U.S. Broadband Coverage Dataset

OhmConnect

United States Census Bureau

Post-Secondary Employment Outcomes

2020 Census Redistricting Data

2020 Census Demographic and Housing Characteristics File

Wikimedia Foundation

Page view statistics

Editor statistics

Other deployments

Caveats & comments