From a KHLL sketch of , one can estimate both the cardinality of the column and the number of unique IDs associated with individual column values i.e., the uniqueness distribution. The latter allows us to efficiently estimate the proportion of column values that are reidentifying as well as statistics such as the min, max, and median uniqueness.
One can plot the uniqueness distribution to visualize the percentage of column values or IDs meeting some -anonymity thresholds. Figure 4.2 is an example histogram of how many column values are associated with each number of unique IDs. Tweaked slightly, Figure 4.3 plots the cumulative percentage of values not meeting varying -anonymity thresholds. A “more anonymous” dataset exhibits the curve on the left as most of the column values are expected to be associated with a large number of IDs. Conversely, a highly reidentifying dataset will exhibit the curve on the right.
The cumulative distribution is particularly useful as it estimates how much data will be lost when applying a -anonymity threshold. This allows one to determine a threshold that preserves some data utility while ensuring a reasonable privacy protection, especially when other risk mitigation measures are in place such as access control, audit trails, limited data lifetime or noising of released statistics. We can see that for the left curve, one can choose a high threshold with low data loss, while on the right even a moderate threshold will result in dropping a large portion of the dataset.
Given the efficiency of KHLL analysis, one could set up periodic analyses to assure that the uniqueness distribution does not change over time, to monitor, for example, that no more than X% of values should have less than -anonymity.
In addition to analyzing the uniqueness distribution of individual columns, KHLL can be used to analyze any combinations of columns, including a complete row. This can be done, for example, by simply concatenating the values of multiple columns, and treating the combination as a new column. The reidentifiability risk will grow as the number of dimensions increases. For example with movie recommendations, the combination of movie name, rating and date of recommendation can be highly unique .
Using a KHLL sketching algorithm with and 512 HLL buckets would give us an estimated error rate of 2% for the value cardinality and about 4% error rate for ID cardinalities. A higher error rate of ID cardinality estimates is tolerable given that we are more concerned about column values that associate with low number of IDs. In those cases, the algorithm will use HLL++ in sparse representation mode, which gives good estimates with minimal errors. Note that the trade-off between accuracy and efficiency is configurable.
Meanwhile, as a KHLL sketch effectively maintains uniform random samples of column values, the estimated distribution does come with sampling bias. Specifically, it is possible that the estimated distribution may miss some outlier column values that associate with a large or small number of IDs. One mitigation is to run multiple analyses using different hash functions (with random seeds) to reduce the chance of outliers being consistently left out of the analysis.
Estimating the joinability of two datasets through a pair of columns say and from their KHLL sketches is also straightforward. Recall that containment of in is given by . To compute this, we need only the cardinality of and , plus the cardinality of their intersection.
Using the inclusion-exclusion principle, we estimate . The union of and can be easily estimated by merging the KHLL sketch of and that of . An alternative approach for computing the set intersection is by identifying the hash values that exist in both the KHLL sketches of and and computing the minmax of the smallest hashes on both sides. This would allow us to quantify the error rate of individual set intersections directly, but the error rate will vary as the minmax of the hashes will vary for different pairs of columns. We prefer the simpler inclusion-exclusion-based approach.
In addition to determining the joinability of columns, KHLL sketches can provide insights into the potential joinability of identities in two datasets. For example, pseudonymous IDs in one dataset could be reidentified if the dataset is joinable with another dataset containing Personally Identifiable Information (PII), and if the join keys are associated with one pseudonymous ID and PII respectively (see Figure 4.4).
Specifically, given a pair of sketches of two columns and in datasets and respectively, we could estimate:
- whether is highly contained in or vice-vers;
- whether uniquely identifies in ;
- whether uniquely identifies in .
KHLL allows us to estimate all the above conditions. Specifically, the first level of KHLL which is essentially a KMV sketch can be used to estimate the cardinality of and , the cardinality of the intersection, and thus containment. Meanwhile, the second level of KHLL, consisting of HLL sketches, gives the uniqueness distribution and thus the ratio of uniquely identifying values easily.
Estimating the joinability of large datasets is a hard problem. Naively estimating the pairwise joinability of datasets involves a quadratic number of full-table scans. The number of scans needed can increase quickly, especially for large datasets with many data columns.
However, as shown in Figure 4.5, KHLL allows us to estimate joinability from the sketches alone. This is a huge saving: it allows us to scan each dataset only once, and then pairwise compare the sketches rather than original datasets.
The sketching process can be agnostic to the underlying data when the schema of the datasets are well defined. For example when using protocol buffer messages  we can analyze new datasets without any need to configure information about columns, especially when the semantic types of data columns are properly annotated .
The sketching process can also be distributed. The respective data owners do not need to grant a central service access to the raw data, but simply to agree on the sketching method and to upload the sketches to a central repository. The sketches containing the hash values of potentially sensitive data including user IDs should still be treated with care such as by limiting the access and ensuring a short lifetime.
Using sketches for joinability analysis comes with some risks, both of reporting columns are potential join keys even though they do not present an actual risk of joinability, and of missing potential join keys.
First, note that the containment metric is agnostic to the semantics of the underlying data. Specifically, containment (or Jaccard) does not distinguish between columns that use a similar range of values but are semantically different. As an example, we could falsely determine port_number and seconds_till_midnight columns to be joinable since they both have an extensive overlap in the integer range of : this can induce false positives. The rate of false positives can be mitigated by requiring a large cardinality threshold on the potential join keys.
Second, false negatives are also possible, and can arise for multiple reasons. For example, the containment metric will fail to detect similar columns that have been encoded differently (e.g., base64 versus raw string) or have undergone some slight transformations (e.g., a microsecond timestamp versus the coarsened millisecond version). This is a hard problem in practice. The system could potentially support some common transformations or encodings when the semantic type of a data column is known, but there is no way to handle all possibilities.
The containment metric can also be unreliable when set sizes are highly skewed. When the expected error rate of a set is larger than the cardinality of a much smaller set, the estimate for the set intersection computed using the inclusion- exclusion principle will be unreliable. One could potentially complement the containment metric with some other similarity scores like the similarity between the frequency distribution of the potential join keys.
Importantly, while KHLL can evaluate the pairwise joinability of datasets based on individual columns, estimating the joinability of datasets through arbitrary combinations of columns remains practically infeasible given that intractable number of potential column combinations. One could however systematically approach this by testing the joinability between combinations of high-risk columns, for example, involving only those that have high uniqueness.
Finally, pairwise joinability analysis does not readily detect multi-hop joinability. For example, when a dataset is joinable with dataset , and is joinable with dataset through two different pairs of join keys, we will not detect that is joinable to . Such multi-hop joinability analysis could be similarly estimated using clustering and graph traversal algorithms such as label propagation .