4.1.1 Introduction ✿
Understanding and monitoring the privacy state of production systems is a crucial element of privacy engineering. Data-flow analysis enables one to know which data processing workflows are reading which data and generating which outputs. Static and dynamic code analyses help to understand what binaries do at a more granular level e.g., whether they use pre-approved APIs, whether they read or write sensitive data types, or whether they use safe defaults. Potential violations of privacy policies can be surfaced during code review sessions, before an engineer is allowed to run workflows in production systems.
However, while data flow analysis and code analysis are powerful tools, characterizing the data to assess their sensitivity (such as we propose in this chapter) can often be useful or necessary. While human reviewers often have good intuitions and context about the sensitivity of data, there are obvious limitations: humans may make mistakes, and are limited in how much they can review. An automated analysis system, on the other hand, can be accurate and scalable. Where humans would have to settle for evaluating a system or dataset once, automated systems can be re-run. This provides regression testing for privacy characteristics and enables data custodians to be confident about the properties of their systems.
Automatic evaluation becomes challenging as dataset size increases and as data becomes increasingly heterogeneous. While brute force approaches could work for smaller datasets, their runtime and memory requirements become unworkable when run on petabytes of data.
Reidentifiability and joinability
We identify two characteristics of datasets that are often useful during privacy impact assessments: reidentifiability and joinability, and we develop a new, scalable, automated approach to measuring them. While these terms may have different connotations, we define them below, and use them consistently throughout this work.
Reidentifiability is the potential risk that the identity of some users can be recovered in some pseudonymous datasets. A good practice is to keep such reidentifiable datasets guarded with strict access control and access audit trails. As companies collect more information to build useful services it can be difficult to manually determine when a dataset becomes reidentifiable and requires more careful handling. The ability to estimate the reidentifiability of datasets automatically and efficiently reduces the work required of data custodians to manually label the different datasets.
Joinability measures whether datasets are linkable by unexpected join keys. Sometimes it is necessary to retain multiple datasets with different ID spaces. In those cases data custodians should avoid linking the two datasets to respect the choices of users who maintain separate identities. As an example, consider a website that can be used either signed-in or signed-out. A user may choose to use the website while signed-out to separate activities from their signed-in identity. If the website operator maintains datasets about activities of both signed-in and signed-out users, it might accidentally include granular information (e.g. web browser user agent) in both datasets that could allow the signed-in and signed-out identities to be linked. In that case, we would say that the identities in the two datasets are joinable.
Some syntactic privacy definitions we listed in Chapter 2, like -anonymity or -map, suggest a possible approach to quantifying reidentifiability: for each characteristic, we could quantify the total number of people associated to the same point. Similarly, taking the simplest approach to joinability, we could say that two datasets are joinable if there exists a pair of data fields, similar in content, that can serve as a join key. To measure this similarity between fields, or measure when the values of a field are a subset of the values of another field (a notion we call containment later), using the Jaccard index is a popular option [Jac02, Bro97].
However, managing reidentifiability and joinability risks at scale is more challenging than it appears. The naive approach requires memory proportional to the size of the dataset, which becomes extremely difficult as dataset sizes climb into the petabytes. Our experiments reveal how costly this naive approach is even for datasets in the order of gigabytes (see Section 4.1.4). Linear runtime and sublinear memory are necessary for large-scale data analysis.
Contributions
We present the KHyperLogLog (KHLL) algorithm and demonstrate how it can be used to efficiently characterize both the reidentifiability and joinability of very large datasets. Adapted from the field of cardinality estimation, KHLL produces quantitative measures for reidentifiability and joinability using only a single pass over the dataset and minimal memory. Both reidentifiability and joinability of datasets can be estimated using the compact data structures (colloquially known as “sketches”) of KHLL rather than raw data. In addition, the approach is format-agnostic, allowing it to analyze any dataset without modification. We have validated that KHLL is fast, parallelizable and accurate on both proprietary and public datasets.
KHLL can be used in a wide range of applications.
KHLL can be used to quantitatively measure the reidentifiability risk of a dataset, or a fraction of its columns. More precisely, it can be used to calculate the uniqueness distribution of an arbitrary combination of columns with respect to a user identifier. This can inform data custodians about the sensitivity of data and its risks so they can plan a suitable and appropriate data strategy (e.g., access controls, audit trails, or stronger anonymization) at points in the life cycle of the data, including data collection, usage, sharing, retention, or deletion.
The efficiency of KHLL provides data custodians a powerful analysis tool for exploring different data sanitization strategies. Many data analysis tasks (e.g., experimentation to improve service availability, anti-spam and fraud detection) can use a projection (view) of a high-dimensional dataset that is protected with -anonymity. KHLL can be run on different possible combinations of columns in the dataset at once to estimate how much data will be lost as a function of the technique. Together, the data custodian and data analysts can decide how to trade off the utility of the data projection and the reidentifiability risk: which columns should be included, suppressed, generalized or made disjoint in separate datasets, and whether the data needs stronger guarantees like differential privacy.
Consider the Netflix prize dataset, which contains the movie ids and ratings given by different users at different dates (year, month and day). Analyzing the dataset using KHLL, we obtain results that mirror those of Narayanan and Shmatikov [NS08]. While no single column has high uniqueness (e.g., we observe that all movies included in the dataset are rated by at least 50 users), the combination of movie ratings and dates are highly unique. An efficient analysis using KHLL might have helped the Netflix team measure the reidentifiability risks, explore alternatives for treating the data, or to potentially conclude that the risk was too high to share the data externally.
In cases where data custodians regularly produce -anonymous (or the like) datasets, KHLL can be further used as a regression test. KHLL analysis can be run on the output as part of the anonymization pipeline to expose any implementation bugs, or to alert on any unexpected changes to the characteristics of the input data.
KHLL can also enable efficient joinability assessment to protect user privacy. If an organization collects data about users under multiple ID spaces in different contexts (e.g., signed in vs signed out), KHLL can be used to keep the IDs separate, respecting the choice of users to conduct certain activities in certain contexts. For example, KHLL analysis can be run on two datasets of different IDs, and be used to detect data columns in the two datasets that are similar (high containment in either direction) and that are highly unique. These data columns are potential join keys that can be used to trivially link the two ID spaces. To mitigate joinability risks, engineers can choose to suppress or generalize one of the columns, or use access controls to prevent someone from using the columns to join the two identifiers. The analysis can be run periodically and attached to an alerting system that notifies engineers if joinability exceeds pre-specified limits (e.g., to quickly detect whether any newly added columns increase joinability risks). Joinability assessment is highly intractable with pairwise comparisons of raw data, but KHLL enables joinability approximation based on its compact data structures (sketches).
Periodic KHLL-based joinability analyses have enabled us to uncover multiple logging mistakes that we were able to quickly resolve. One instance was the exact position of the volume slider on a media player, which was mistakenly stored as a 64-bit floating-point number. Such a high entropy value would potentially increase the joinability risk between signed-in and signed-out identifiers. We were able to mitigate the risk by greatly reducing the precision of the value we logged. In other cases, we have mitigated joinability risks by dropping certain columns entirely, or by ensuring that the access control lists of both datasets are disjoint.
If data custodians label their datasets with information about the semantics of certain columns, KHLL can be used to propagate labels through the system and find inconsistencies. If two columns have a high containment score (in either direction), they are likely to share the same semantics. If one of the columns is labelled but the other is not, then the label can be copied to the second column, and if the two columns have different labels then engineers can be alerted that one of the labels is likely incorrect. The scalability of KHLL means that it can be used to propagate labels across large datasets, and that the label correctness can be repeatedly checked by re-running analysis periodically.
Although not a primary purpose, an additional side effect of making a powerful analysis tool available to different roles in an organization is the increased awareness of anonymization and user privacy. Data custodians, engineers and analysts can discuss the analysis results with each other, gain a better understanding of reidentifiability risks when working with user data, and understand why further anonymization may be necessary.
For all of these use cases one needs to keep in mind the estimation errors of KHLL (see Section 4.1.3 and Section 4.1.3). It is possible that KHLL may underestimate reidentifiability or joinability risks (e.g., KHLL might miss values that are unique to a single user). In general, data custodians could use KHLL to estimate risks and impacts on data utility when exploring an appropriate data protection and anonymization strategy, but then use exact counting to execute the strategy. While the joinability analysis using KHLL might be sensitive to data formats and transformations, the efficiency of KHLL makes it the best regression test for data joinability that we are aware of.
The rest of this section is organized as follows. We start by describing the design goals and challenges in Section 4.1.2. We then provide some background on cardinality estimation and present our KHLL algorithm in Section 4.1.2. Next, we describe the use of KHLL for reidentifiability and joinability analysis in Section 4.1.3. We evaluate the performance and accuracy of KHLL empirically in Section 4.1.4, and we summarize our work and discuss related and future work in Section 4.1.5.