Ted is writing things

On privacy, research, and privacy research.

Five hard lessons learned by privacy engineers

In a previous blog post about privacy in AI, I listed five hard truths that privacy experts know about such systems. Writing this was a bit frustrating: I spent the entire time wanting to yell « this isn't actually specific to AI! » at my screen. Yelling at a screen isn't a productive use of time, so here's a follow-up blog post instead. It generalizes the five facts about privacy in AI into five hard truths that privacy experts know… about any real-world software system, really.

Just like the hard truths about privacy in AI, those lessons are well-known by experts, and junior folks will learn them quickly by experience, and/or when hearing stories from their senior peers. Yet, they can be unexpected and surprising to people who don't focus on privacy.

1. By default, processes leak information

A process taking in personal data will often leak some of this data. This happens in two main ways.

  • The output of the process will contain more personal data than you expect. Pictures will have identifying metadata, statistics will be more revealing as you think, synthetic data generation actually leaks private data, etc.
  • The execution of the process will produce information via side channels (for example, logs, transfers to third-parties, execution traces), and this information might contain personal data.

Both kinds of data leakage can go unnoticed for a long time. To catch them and limit their frequency and impact, you need privacy engineers: people who can understand what privacy goals your organizations need to reach, and help you reach those goals by reviewing and improving technical designs.

2. This actually matters in practice

Inadvertent data leakage is not just a theoretical problem. It causes real harm to real people, and can have serious business impact.

A particularly salient example of inadvertent data leakage is Grindr's closeness feature. Grindr is a dating app mostly used by gay men. When you open it, it shows you a list of profiles, and how far each is from your current location. On its own, this distance is not enough to geo-locate you… but of course, all you need are a few data points to precisely triangulate someone. Which actually happens in practice.

When a badly-designed product harms the privacy of its users, this can have serious consequences for the business. Consider Google Buzz, an early and short-lived attempt at building a social network. Weak privacy settings created overwhelmingly negative press coverage at launch. Follow-up lawsuits saddled Google with a consent decree whose total compliance cost over the years likely reaches billions of dollars.

Perhaps the most tangible-yet-overlooked kind of concrete harm of consumer technology is domestic abuse. For example, any product that includes some kind of automatic sharing of location or activity is sure to be noticed by abusive people who want to track their partner without knowledge or consent. It takes careful system design and thoughtful UX choices to mitigate this kind of threat.

3. Adversaries are smarter than you

Security and privacy both suffer from the same plague: the temptation to consider a system safe because one can't think of a way to attack it. But nefarious people have a lot of advantages: they are more numerous, more diverse, and more motivated than defenders inside your organization. And they have a lot more time, too: privacy review typically happens once per launch, but attackers have all the lifetime of your product to attack it.

This translates to a common situation where privacy failures come as a surprise to the people who built the software. They once felt very confident that they had thought of every possible scenario! But once they're proven wrong, hindsight bias kicks in: the vulnerability seems retrospectively obvious. So the wrong lesson is learned, as people think that next time, they would be able to anticipate the "obvious" issue and make the system safer.

Instead, experienced folks know that attackers are always going to find unexpected ways to exploit systems, and plan accordingly. They will design systems with defense in depth so that individual mitigations can fail without leading to a critical failure. They will bring in diverse teams of privacy experts to help them think creatively about what can go wrong. And they will use provably robust privacy technology when appropriate. Which neatly brings us to our next point!

4. Robust protections exist, but there is no silver bullet

In the data protection community, we love privacy-enhancing technology, and not just because it combines two of our favorite interests (math and protecting people's data). It's because our jobs are usually full of hard-to-quantify uncertainty and risk, which forces to do a lot of judgment calls. It's very hard to anticipate what attackers might do next, so we have to rely on experience and intuition to evaluate what is good enough… which typically doesn't leave us with a great deal of confidence. And it's very difficult to give to executives the hard numbers or the confident statements that they would like from us.

By contrast, robust privacy-enhancing technologies give us precise, solid statements. They say things like: anyone who gets access to the data at this stage of the process cannot learn more than this amount of information. This is mathematically proven and quantified. A breath of fresh air in a field so full of blurry trade-offs! When a potential privacy risk can be addressed by robust privacy-enhancing tech, it's often a very promising solution. Typical use cases are things like: sharing insights about sensitive data while controlling re-identification risk, jointly computing statistics with an untrusted partner, training a machine learning model in an anonymous way…

But there is no silver bullet: at best, you remove specific kinds of privacy risk, under precise assumptions. Sometimes, this can make a big difference! But at a minimum, you still need to do a holistic privacy review of your system, and make sure that all your assumptions hold. And there are many other privacy risks that cannot be solved with technology: harassment and other abuse vectors, confusing UX design, problematic retention practices, insider risk, and so on. So you should use robust privacy-enhancing technology as a solution to specific problems, not treat it as a magic silver bullet.

Side-note: if you're wondering whether privacy tech is right for you, or are looking for help in deploying it, I can help! My independent consultancy, Hiding Nemo, focuses on helping organizations do more with data with respect and compliance built-in, using privacy-enhancing technology. Don't hesitate to reach out!

5. Everything is harder at scale

Privacy law (and also, common courtesy) mandates that if a user asks you to delete their data, you should actually do that. If you're running a small service that relies on a single database, this is pretty easy. But if you're a sprawling multinational corporation, split across dozens of business units in different countries, each with their own IT systems… this can be close to impossible. Scale can turn a conceptually simple requirement into a fractally complex problem.

Most privacy risks have the same characteristic: they get a lot harder to mitigate with complexity and scale. Growth is often the main goal of a business, so it's hard to push back against scaling up a system. Instead, privacy engineers often try to reduce complexity: building a centralized data catalog, consolidating infrastructure, designing simpler systems with clear properties, and so on. This has a lot of other benefits, like security, reliability, or engineering velocity: well-run organizations will continuously invest in such efforts.

The cost of scale is not limited to engineering problems. Say that you have a feature that only gets used once a year on average, and unclear UX design leads 0.01% of your users to misunderstand how it works. With a popular enough service — say, with 100 million users — the failure mode will happen to dozens of people, every day.

Bonus hard lesson: Honesty is… not the norm

I ended the privacy in AI blog post by pointing out that AI vendors are not being particularly honest about the privacy properties of the models they train. Sadly, this is not specific to AI either.

Privacy looks simple from a distance — just be respectful with my data! — but the details can get very complicated. Does "letting advertising partners access a real-time bidding API" count as "selling people's data"? Does "removing emails and phone numbers" constitute anonymization? What fits under the "legitimate interest" umbrella in GDPR? When is consent "freely given"?

There are a lot of gray areas, and sometimes the principled answer to these questions isn't very convenient for the business. So PR departments routinely use that ambiguity to put out statements that sound good, but don't actually mean anything concrete and hide less-than-ideal data practices. It's infuriating to privacy experts: it's dishonest of course, but it also makes it harder to do the right thing! People have learned to not believe anything companies say about privacy. So when companies actually try to do the right thing, it's difficult to communicate about it and gain trust in the process.

This forces privacy professionals to find other arguments to push for changes inside their organizations. Getting a robust compliance story, mitigating reputational risk, unlocking business opportunities, improving velocity with good data hygiene… A major part of the job is to find ways of achieving good privacy outcomes without relying on the good will of the business.


Thanks to Curtis Mitchell for the helpful feedback on an earlier version of this blog post.

Feedback on these posts is welcome! Reach out via e-mail (se.niatnofsed@neimad) for comments and suggestions.
Interested in using privacy-enhancing technology to do more with your data, with respect and compliance built-in? I can help! Check out the website of my independent consultancy, Hiding Nemo, to learn more.