Lowering the cost of anonymization

a PhD thesis

2  Defining anonymization

Embrace diversity.
Unite—
Or be divided,
robbed,
ruled,
killed
By those who see you as prey.
Embrace diversity
Or be destroyed.
(Octavia Butler, Parable of the Sower)

The first step towards solving any problem is defining precisely what we are trying to achieve. In the case of anonymization, this is a surprisingly difficult endeavor. Informally, the objective is simple: anonymized data should not reveal sensitive information about individuals. This informal goal is simple and intuitive, the difficulty is to convert it into a formal statement.

First, what kind of information counts as sensitive? Different people will have different expectations and opinions about what they would prefer to keep private about themselves. In some cases, there might be an obvious answer: for example, in a healthcare dataset, diagnostic information is sensitive. But in many other examples, it is not so obvious. Information that might seem benign to most people might reveal private facts about individuals: for example, revealing one’s gender in an old dataset might out a transgender person to their colleagues.

One solution to this difficulty is to focus on the “about individuals” part of the informal definition above. The reasoning goes as follows. If it is impossible to associate any information with a specific individual, then that person is protected. Their anonymity is preserved.

Organizations, when communicating about how they use anonymization to protect people’s privacy, typically focus on this aspect. The US Census Bureau states that it is “not permitted to publicly release [one’s] responses in any way that could identify you” [199]; the relevant statute specifies that it may not “make any publication whereby data furnished by any particular […] individual […] can be identified” [313]. Similarly, some companies state that anonymized data “cannot be associated with any one individual” [198].

Note the use of “could” or “can” in these definitions. They limit what can potentially be inferred from the data. This is in stark contrast with the misleading claim, all too common among companies and research organizations, that some data they publish is anonymized because it does not immediately reveal people’s identities [93, 94, 331]: there is a large difference between making reidentification vaguely non-trivial and making it impossible. This misuse of terminology leads to the frequent misconception that anonymization is simply impossible [59, 287].

So, how to formalize the impossibility of reidentifying individuals? Over the past few decades, a large number of definitions has been proposed. These data privacy definitions fall into two broad categories. The first approach is to try to define a criterion that applies to the data itself. A dataset that passes this criterion is considered “anonymized enough”, and the problem is then to explore the different ways to transform data so the end result satisfies the criterion. In this chapter, we will introduce four such definitions, and explain the shortcomings of each, and of this general approach.

A second possibility is to consider anonymization as a property of the process, rather than its outputs. This is one of the core insights of differential privacy, a definition that has encountered remarkable success since its introduction: a quick online search shows that in 2020, more than 3000 scientific papers mentioning differential privacy have been published. We will introduce this notion, explain what advantages it provides, and detail how its guarantees can be interpreted.

This fundamental idea of a process-based definition of anonymization has proved very fruitful. In the decade following the introduction of differential privacy, a large number of variants and extensions of the original definition have been proposed. In the third and main part of this chapter, we propose a taxonomy of these variants and extensions. We list all the definitions based on differential privacy we could find in a systematic literature survey, and partition them into seven categories, depending on which aspect of the original definition is modified. We also establish a partial ordering of relative strength and expressibility between all these notions. Furthermore, we list which of these definitions satisfy some of the same desirable properties that differential privacy provides.

All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).