Lowering the cost of anonymization

a PhD thesis

Personal contributions

Almost all this thesis is the result of work done in collaboration with other researchers, engineers, and other colleagues. It was an utmost privilege to work with so many exceptional people. Much of the credit goes to them, although I am solely responsible for all the mistakes that might still be present. Large portions of this work have also been published in some other form in other venues. This section gives more context on existing publications and gives credit where credit is due.

The historical overview of syntactic definitions of privacy, and the introduction of differential privacy present in Section 2.1 merely presents existing work. They are based on a series of blog posts published on my personal website [98].

The systematization of knowledge on variants and extensions of differential privacy exposed in Section 2.2 was done in collaboration with Balazs Pejo. We contributed equally to this work, whose short version was published as [102]. The long version is available on arXiv [101], and is currently under review.

The exploration on differential privacy with partial knowledge presented in Sections 3.1 and 3.2 is the result of a collaboration with Elisabeth Krahmer and Esfandiar Mohammadi, supervised by David Basin. I developed and wrote most of the original paper, but the core insights behind the concept acceptable normalizations emerged from productive discussions with Esfandiar, who also pushed for and came up with most of the composition results. Elisabeth proved the results on -reducibility, and implemented the experiments in these sections; she and David also found critical flaws in initial versions of many results. The corresponding paper is available on arXiv [100] and is currently under review.

The work on cardinality estimators presented in Section 3.3 and published in [99] was done in collaboration with David Basin and Andreas Lochbihler. I defined the original problem, which originated from a Google privacy review, came up with the core results, and wrote most of the original paper. However, this could not have happened without Andreas and David, and I am grateful for their patience and mentorship: the clear definition of the attacker model is theirs, they found multiple critical flaws in earlier versions of the proofs, and I learned how to write scientific papers thanks to the many rounds of reviews they performed on my initial drafts.

The original design and implementation of KHyperLogLog is largely due to Pern Hui Chia. The work written up in Section 4.1 is the result of a collaboration with Wei-Yen Day, Miguel Guevara, Chao Li, Milinda Perera, Daniel Simmons-Marengo, Qiushi Wang, and myself; it was published as [76]. I contributed to the production implementation of the system, helped with thorny design issues, re-implemented it in SQL, ran the experimental validation, and published the experimental code as open-source software. I also helped write the paper and convince various internal stakeholders that this was worth publishing.

The differentially private SQL engine from Section 4.2 was designed and built by Bryant Gipson, William Lam, Daniel Simmons-Marengo, Royce J Wilson, and Celia Zhang; since the publication of the corresponding paper as [388], many more people have contributed to it. My personal contribution consists mostly in reviewing the design of various elements, and turning the core engineering contribution into a scientific paper by framing the contributions, organizing them into a coherent story, defining the experimental setup, and performing many rounds of reviews.

The work on partition selection in Section 4.3.2 was done in collaboration with Bryant Gipson and James Voss. I came up with the original insight, write-up and experiments, Bryant did the initial math for the closed-form formula, and James implemented it and properly wrote it up, fixing a number of subtle problems in the process.

I wrote the initial proof of concept of Privacy on Beam, mentioned in Section 4.3.3; Miraç Vuslat Başaran, Christoph Dibak, and Maria Telyatnikova turned it into a production-grade, open-source library, with contributions from Skye Berghel, Manushree Vijayvergiya, and James Voss. Privacy on Beam is the Go version of a tool previously designed and implemented in C++ by Kareem Amin, Jenny Gillenwater, Alex Kulesza, Andrés Muñoz Medina, and Sergei Vassilvitski.

All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).