4.1.4 Experimental validation ✿
In this section, we validate the efficiency and accuracy of KHLL experimentally. Efficiency is validated in a production environment, so the experimental results provided are as realistic as possible. To provide reproducible validation results about accuracy, we have implemented a version of the KHLL algorithm in BigQuery Standard SQL, and simulated the computation of uniqueness distribution and joinability using publicly available datasets. The code for the experiments can likely be adapted for other SQL engines, and is shared on GitHub [143].
Efficiency
The KHLL algorithm and the metrics we estimate with it have been implemented in a proprietary production environment in Go using the MapReduce programming model [95]. For each column (or specified combinations of columns) in the dataset, the MapReduce outputs a KHLL sketch in one pass.
To reason about efficiency, we implemented two naive MapReduce algorithms. The first one, Count Exact (CE), computes the exact variants of the metrics that KHLL estimates. The second one, Count Exact Single (CES), computes the same set of metrics exactly, but analyzes only one single column during a given MapReduce run.
We designed the CE MapReduce to output, for each column, a dictionary of column values to ID sets (allowing the computation of various statistics that a KHLL sketch approximates). One would expect that emitting the entire column-value-to-ID-set dictionary would result in substantial memory overhead. The CES MapReduce is a more realistic simplification of CE which outputs the tuples of column values and ID sets of only one single specific column.
The test dataset on which we ran our performance analyses represents the data-flow graph of various production operations at Google. Each row in this dataset represents a node in the data-flow graph and has about 50 data columns describing the properties of the operations as well as the input and output data. One of the data columns specifies the globally unique name of the machine where the operation is run or controlled. We used this machine name as the ID column in our analyses. Note that this dataset does not contain any user data and is not privacy sensitive as these are not necessary for performance measurements.
MapReduce type | Input size | Algorithm | CPU usage (vCPUs) | RAM usage (GBs) | Peak RAM (GB) | Output size (GB) | Runtime (s) |
All columns |
1 GB | CE | 4.01e+3 | 1.14e+4 | 9.34e+0 | 1.04e+0 | 5.83e+2 |
KHLL | 9.78e+2 | 2.08e+3 | 9.45e-1 | 1.60e-3 | 1.43e+2 | ||
100 GB | CE | 3.53e+5 | 3.40e+6 | 1.10e+2 | 2.64e+0 | 1.93e+4 | |
KHLL | 6.25e+4 | 3.11e+4 | 2.00e+0 | 3.46e-3 | 2.63e+2 | ||
1 TB | CE | (n.a.) | (n.a.) | (n.a.) | (n.a.) | (n.a.) | |
KHLL | 9.92e+5 | 2.37e+6 | 2.52e+0 | 3.50e-3 | 1.13e+4 | ||
Single column | 1 GB | CES | 7.23e+2 | 4.76e+3 | 8.07e-1 | 1.57e-2 | 5.76e+2 |
100 GB | CES | 4.47e+4 | 1.30e+5 | 1.79e+0 | 2.35e-1 | 1.10e+3 |
We ran KHLL, CE and CES on several subsets of the test dataset in a shared computational cluster at Google. These analyses were provided computational resources at a priority class that is typically used for batch jobs. Measuring the performance metrics of jobs in a shared computational cluster is not straightforward since any given machine can host multiple unrelated jobs with varying priority classes that can consume machine resources in unpredictable ways. So we focused on the performance metrics which a typical customer of a commercial computational cluster (e.g., Amazon EC2, Google GCE) would pay for.
Table 4.1 shows the performance metrics of the MapReduce runs. As one can see, KHLL is consistently more efficient than CE across various metrics. Performance differs by 1 or 2 orders of magnitude even for the relatively small datasets in our experiment. In fact, the CE MapReduce jobs for analyzing all data columns in an 1 TB dataset became too expensive to be completed in the shared computational cluster. Interestingly, per our test dataset, it is even more memory efficient (though slightly slower) to compute the KHLL sketches of all data columns in a single MapReduce run, than to analyze a single data column using CES. This performance disparity is critical in practice, as it is the difference between an analysis that is feasible to run, and one that is too slow to be worthwhile.
In various production setups at Google, KHLL scales to analyze hundreds of datasets, each containing potentially trillions of rows and tens of thousands of data columns measuring petabytes in sizes, to produce millions of sketches in each run.
Accuracy of uniqueness distribution estimation
We measured the accuracy of the estimated uniqueness distribution using three publicly available datasets. The first two are taken from the Netflix Prize dataset which was released in 2006 and shown to be reidentifying for a significant fraction of the users [294]. We estimate the uniqueness distribution of (a) movies, and (b) tuples of movie and date. Note that we do not consider the entire list of movies (or respectively all pairs of movie and date) associated with individual pseudo-identifiers. We could analyze this easily but it will be less interesting for our validation purposes, as most of the values will be unique. The third dataset is the 2010 US Census [96] through which we count the number of distinct individuals associated with a given (ZIP code, age) tuple. The corresponding uniqueness distribution gives an indication of the reidentifiability of these quasi-identifiers within the US population. As we will see, the three datasets present different uniqueness profiles, allowing us to test the accuracy of KHLL estimation in different situations.
We simulate the KHLL algorithm, with parameters and . We learned from our production settings that gives a good trade-off between precision and memory usage. was chosen as the smallest possible parameter of the HLL++ library available in BigQuery Standard SQL. We use in our production pipelines, which gives a comparable degree of precision in ID cardinality estimation (4% vs. 3%). As HLL++ counts elements exactly at lower cardinalities, this has a negligible influence on our estimations. For each dataset, we compare the KHLL-based estimate to the true distribution, which is computed exactly (without approximation) using standard SQL operators.
Figure 4.5a plots the cumulative uniqueness distribution of movies in the Netflix Prize dataset. It allows an analyst to quickly answer the question: how much data do we lose if we only release the movies which have been rated by more than users, for all possible values of . The uniqueness of movies is low: the median number of associated users per movie is larger than 500. Figure 4.5b plots the cumulative uniqueness distribution of the tuples of movie and date. The typical uniqueness of this setting is high: over 80% of the (movie, date) tuples are associated with 10 or less unique IDs.
Figure 4.5c shows the cumulative uniqueness distribution of tuples (ZIP code, age) in the US census. Each individual in the census dataset appears in only a single row, different from the case with Netflix datasets where ratings from the same user exist on several separate records. The uniqueness of (ZIP code, age) tuples is variable: a significant portion of possible values is associated to only a few individuals, but many of the (ZIP code, age) tuples associate with larger than 100 individuals.
Across all three experiments, we observe that the estimate of uniqueness distribution using KHLL is accurate.
Accuracy of joinability and containment estimation
As described in Section 4.1.3.0, joinability is most interesting from the privacy perspective when a pseudonymous ID space becomes joinable with PII. Specifically if columns and are joinable, and that uniquely identifies a pseudonymous ID space while uniquely identifies PII. Three conditions are important for estimating such risk: the ratio of values uniquely identifying , the ratio of values uniquely identifying , and the containment of in (or containment of in ).
The estimate of ratio of column values uniquely identifying an ID can be seen as an estimator of the parameter based on an observation of a binomial distribution of parameters and . It is well-known that a binomial distribution of parameters and has a variance of , so the estimator which divides the result of the distribution by has a variance of , or a standard distribution of . Therefore, we focus our experiments on estimating the containment metric, as defined in Definition 77.
Using hashes, and assuming and have the same cardinality, the estimate of containment falls within 5% of the true value over 90% of the time, and always stays within 10% of the true value. This is true regardless of the cardinality of the intersection of and . Figure 4.7 shows the median as well as the 5th and 95th percentiles of the containment estimation, for cardinalities of and .
When and have different cardinalities, however, precision can suffer. Figure 4.8 shows the median as well as the 5th and 95th percentiles of the estimation of , where true value is fixed at 50%, , and varies between 50,000 and 2,000,000 (so, the cardinality ratio ranges from 0.5 to 20).
We can observe that the larger the cardinality ratio gets, the worse the precision becomes. This is expected: since we compute using the inclusion-exclusion principle, and the error of the estimation is proportional to the cardinality estimated, the estimation error of should be roughly proportional to . Since the value of is roughly proportional to , the error ratio of the containment estimation will grow linearly with the cardinality ratio. This is what we observe in practice.