4.1 Analyzing privacy risks at scale ✿
We saw in Section 2.1.6 that differential privacy is a property of the mechanism, not of its output. This core insight makes this definition ill-suited for an important class of problems encountered in practice: risk analysis. Differential privacy can give strong guarantees about the reidentifiability (or lack thereof) of some data, assuming the algorithm that output this data was specifically built to be differentially private. What happens when that is not the case, when the data we are interested in has already been generated? Is there a way to quantify, or at least estimate, the reidentifiability risk?
This problem is of significant practical relevance. Large companies often have many different datasets, whose number and size grows organically with the company and over time. This raises a number of policy concerns: who should have access to what data, how long should it be retained, etc. This is one of the roles of privacy engineers: they must take inventory of existing data and make such policy decisions. Unfortunately, the need for comprehensive and consistent data governance practices is often not understood until a certain amount of organic growth has taken place, and manual efforts alone are costly and unlikely to succeed: manual reviews requiring expert judgement do not scale. Privacy engineers need a way of quantifying the risk associated with datasets, to help them prioritize which datasets to look at first, and assist them in policy efforts.
Reidentifiability risk, which we mentioned above, is one such risk. Another one is joinability: is there an obvious way to join two datasets that are not supposed to be joined and link the information of the same individual across their different identities? In this section, we present a possible approach to tackling these problems, and analyzing datasets to detect high reidentifiability or joinability risks. Ironically, this approach is based on a novel sketching algorithm, KHyperLogLog (KHLL), and relies on the same cardinality estimators that we have shown in Section 3.3 not to be private.
Precisely quantifying these risks based only on the analysis of existing datasets (without knowledge of how they have been generated) is impossible in general: our methods are therefore heuristics. They are meant to serve as prioritization and detection tools, to assist engineers in the difficult task of building and operationalizing a data governance program.