The scale of data and systems in large organizations demands an efficient approach for reidentifiability and joinability analysis. KHyperLogLog (KHLL) innovates on approximate counting techniques to estimate the uniqueness distribution and pairwise containment of very large databases at scale.
The efficiency and scalability of risk analysis using KHLL represent a practical and useful tool for large organizations in protecting user privacy. It provides an objective, quantifiable and replicable measure of reidentifability of datasets. The KHLL algorithm also presents a novel and practical approach for tackling the joinability risks of datasets and ID spaces. The efficiency of KHLL further enables periodic analyses of complex production systems that evolve over time.
Using the frequency of (combinations of) column values as a proxy to measure reidentifiability is not new: this is the core intuition behind -anonymity, which we introduced in Section 2.1.1. Rather than just estimating -anonymity, the KHLL algorithm estimates the entire uniqueness distribution, which is useful for evaluating the impact to data loss with -anonymization. The reidentifiability and joinability risks as estimated using KHLL can serve to help determine a suitable anonymization strategy, particularly when considering the different contexts and use cases of anonymization.
The problem of gathering metadata and organizing datasets in large organizations has been described on several occasions. In , the authors detail a search engine for datasets, which gathers metadata at dataset level; while in , the authors explain how to detect and propagate column-level semantic annotations using assisted labeling and data-flow analysis. The tools we develop to automatically detect joinability of datasets could be used for a similar purpose. Inconsistent semantic annotations between data columns with high similarity scores (containment or Jaccard) can be automatically detected with the correct annotations propagated accordingly.
Cardinality estimators have been the subject of a significant body of research [45, 152, 195, 394]. The techniques we propose are largely independent of the particular algorithm chosen to approximate ID cardinality. Beyond privacy, estimating value distribution is also essential to database query optimization and data stream processing. Most research in this domain focuses on sketching values of high frequency (i.e. statistical heavy hitters or top-k queries). The closest analogue of KHLL was presented by Cormode et al. ; it combines the Count-Min sketch  and the LogLog sketch  for value distribution estimation. However, the Count-Min algorithm biases towards values of high frequency, which is not helpful for evaluating the impact of -anonymity given the typical choices of are much smaller than the frequency of heavy hitters.
MinHash  and SimHash  are two popular algorithms for estimating the Jaccard similarity of datasets. The KHLL algorithm leverages the Minimum Values in the first level of the two-level sketch for estimating Jaccard and containment scores, using a similar log -space memory footprint. A possible improvement might be to adapt from HyperMinHash , capable of estimating Jaccard using only -space. Yet, given that the bulk of memory usage by KHLL actually comes from the second level of the sketch for estimating the uniqueness distribution, we have not explored the feasibility of adapting KHLL to have a HyperMinHash-like data structure in the first level.
Finally, despite the extensive research for detecting data similarity, we have not seen any prior work tackling the problem of automatically detecting possible joinability between different ID spaces across datasets.
While KHyperLogLog is memory efficient, it still requires a linear pass over the data, which in practice is the main performance bottleneck. Techniques to produce sketches suitable for joinability analysis without scanning the entire dataset would be helpful, for example by sampling input records. It would also be interesting to pursue further innovative uses of cardinality estimation techniques in privacy enhancing technologies, for example for automatically adding semantic labels to datasets, detecting violations of retention policies, or for suggesting effective anonymization strategies.