Lowering the cost of anonymization

a PhD thesis

2.1  From syntactic to semantic privacy

Before differential privacy was introduced, a number of privacy researchers tried to find a good definition for anonymized data, and develop algorithms to modify a dataset so it satisfies a given definition. In this section, we list four of those definitions. We start with -anonymity, an influential notion that was the starting point for an active area of research. We then present -diversity, -map, and -presence.

Although these definitions are no longer widely used in anonymization research, making a short tour of these notions is valuable. Their differences help understanding the various types of privacy risks from data releases, and natural approaches to mitigating these risks. Their history illustrates the difficulty of coming up with a satisfying definition of anonymized data, making it easier to appreciate the insights behind differential privacy. Finally, we will revisit some of these notions in later chapters of this thesis, where we show that some of them can still be relevant and useful.

Throughout this thesis, we denote by the space of possible datasets. Each dataset is a list of records with values in an arbitrary set . We denote by the number of records in . Unless specified otherwise, each individual in a dataset is associated to a single record. We number each individual , the corresponding record is denoted by . Typically, can be the set of integers , the set of real numbers , some finite set of possible categories, the set of strings, or a combination of the above. These notations, and others used in this section, are summarized on the notations page.

All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).