Lowering the cost of anonymization

a PhD thesis

3.3.1  Preliminaries

Many data analysis applications must count the number of distinct records in a large stream with repetitions. These applications include network monitoring [142], online analytical processing [309, 346], query processing, and database management [387]. A variety of algorithms, which we call cardinality estimators, have been developed to do this efficiently, with a small memory footprint. These algorithms include PCSA [154], LogLog [120], and HyperLogLog [152]. They can all be parallelized and implemented using frameworks like MapReduce [95]. Indeed, their internal memory state, called a sketch, can be saved, and sketches from different data shards can be aggregated without information loss.

Using cardinality estimators, data owners can compute and store sketches over fine-grained data ranges, for example, daily. Data analysts or tools can then merge (or aggregate) existing sketches, which enables the subsequent estimation of the total number of distinct records over arbitrary time ranges. This can be done without re-computing the entire sketch, or even accessing the original data.

Among cardinality estimators, HyperLogLog [152] and its variant HyperLogLog++ [195] are widely used in practical data processing and analysis tasks. Implementations exist for many widely used frameworks, including Apache Spark [18], Google BigQuery [48], Microsoft SQL Server [203], and PostgreSQL [318]. The data these programs process is often sensitive. For example, they might estimate the number of distinct IP addresses that connect to a server [369], the number of distinct users that perform a particular action [21], or the number of distinct devices that appeared in a physical location [286]. To illustrate this point, we present an example of a location-based service, which we will use throughout this section.

Example 10. A location-based service gathers data about the places visited by the service’s users. For each place and day, the service stores a sketch counting the identifiers of the users who visited the place that day. This allows the service’s owners to compute useful statistics. For example, a data analyst can merge the sketches corresponding to the restaurants in a neighborhood over a month, and estimate how many distinct individuals visited a given restaurant during that month. The cost of such an analysis is proportional to the number of aggregated sketches. If the queries used raw data instead, then each query would require a pass over the entire dataset, which would be much more costly.

Note that the type of analysis to be carried out by the data analyst may not be known in advance. The data analysts should be able to aggregate arbitrarily many sketches, across arbitrary dimensions such as time or space. The relative precision of cardinality estimation should not degrade as sketches are aggregated.

Fine-grained location data is inherently sensitive: it is extremely reidentifiable [171], and knowing an individual’s location can reveal private information. For example, it can reveal medical conditions (from visits to specialized clinics), financial information (from frequent visits to short-term loan shops), relationships (from regular co-presence), sexual orientation (from visits to LGBT community spaces), etc. Thus, knowing where a person has been reveals sensitive information about them.

As the above example suggests, an organization storing and processing location data should implement risk mitigation techniques, such as encryption, access controls, and access audits. The question arises: how should the sketches be protected? Could they be considered sufficiently aggregated to warrant weaker security requirements than the raw data?

To answer this question, we model a setting where a data owner stores sketches for cardinality estimation in a database. The attacker can access some of the stored sketches and any user statistics published, but not the raw data itself. This attacker model captures the insider risk associated with personal data collections where insiders of the service provider could gain direct access to the sketch database. In this paper, we use this insider risk scenario as the default attacker model when we refer to the attacker. In the discussion, we also consider a weaker attacker model, modeling an external attacker that accesses sketches via an analytics service provided by the data owner. In both cases, we assume that the attacker knows the cardinality estimator’s internals.

The attacker’s goal is to infer whether some user is in the raw data used to compute one of the accessible sketches. That is, they pick a target user, They choose a sketch built from a stream of users, and must guess whether their target is in this stream. The attacker has some prior knowledge of whether their target is in the stream, and examining the sketch gives them a posterior knowledge. The increase from prior to posterior knowledge determines their knowledge gain.

Consider Example 10. The attacker could be an employee trying to determine whether their partner visited a certain restaurant on a given day, or saw a medical practitioner. The attacker might initially have some suspicion about this (the prior knowledge), and looking at the sketches might increase this suspicion. A small, bounded difference of this suspicion might be deemed to be acceptable, but we do not want the attacker to be able to increase their knowledge too much.

Thus, our goal is to understand whether a realistic attacker could gain significant knowledge about a target by looking at a HyperLogLog sketch; or any other output from a different cardinality estimator.

Related work

Prior work on privacy-preserving cardinality estimators has been primarily focused on distributed user counting, for example to compute user statistics for anonymity networks like Tor. Each party is usually assumed to hold a set of data, or a sketch built from , and the goal is to compute the cardinality of , without allowing the party to get too much information on the sets with . The attacker model presumes honest-but-curious adversaries.

In [369], the authors propose a noise-adding mechanism for use in such a distributed context. In [279], each party encrypts their sketch, and sends it encrypted to a tally process, which aggregates them using homomorphic encryption. In [21], the authors propose a multiparty computation protocol based on Bloom filters to estimate cardinality without the need for homomorphic encryption, and in [136], this approach is shown to be vulnerable to attacks, and a more secure variant of the protocol is presented.

Some cardinality estimation protocols who achieve differential privacy have been proposed, based either on Bloom filters [9], FM-sketches [374]. However, these sketches cannot be aggregated without significant accuracy loss, or using secure aggregation methods in a multiparty setting; this last option is explored in [77] using LogLog sketches. Another line of research focuses on counting distinct elements in a scalable way, and making only the output differentially private, not the sketches itself [72].

Our attacker model, based on insider risk, is fundamentally different to these models: the same party is assumed to have access to a large number of sketches; they must be able to aggregate them and get good estimates. We claim that cardinality estimators must necessarily make a hard choice: the sketches can be made private, or they can be aggregated without catastrophic accuracy degradation, but both properties are mutually exclusive.

Our privacy definition for cardinality estimators is a variant of differential privacy, presented in Definition 6 (Section 2.1.6). It uses a data-generating probability distribution to capture the attacker’s uncertainty, in a similar way of variants introduced in Section 2.2.5. We explain in Section 3.3.1 why we need a custom privacy definition for our setup and how it relates to differential privacy and Pufferfish privacy (Definition 35, in Section 2.2.5). Some deterministic algorithms, which do not add noise, have been shown to preserve privacy under this assumption [36, 46, 176].

Our setting has some superficial similarities to the local differential privacy model, where the data aggregator is assumed to be the attacker. This model is often used to develop privacy-preserving systems to gather statistics [38, 40, 141, 146]. However, the setting and constraints of our work differ fundamentally from these protocols. In local differential privacy, each individual sends data to the server only once, and there is no need for deduplication or intermediate data storage. In our work, the need for intermediate sketches and unique counting leads to the impossibility result.

Finally, some work has been done on the security properties of HyperLogLog sketches, in particular their resistance to adversarial manipulation [330], this direction of research is largely orthogonal to the one studied in this work.

Cardinality estimators

In this section, we formally define cardinality estimators, and prove some of their basic properties. The notations introduced below, and used throughout this section, are summarized on page 35.

Cardinality estimators estimate the number of distinct records in a stream. The internal state of a cardinality estimator is called a sketch. Given a sketch, one can estimate the number of distinct records that have been added to it (the cardinality).

Cardinality estimator sketches can also be aggregated: two sketches can be merged to produce another sketch, from which we can estimate the total number of distinct records in the given sketches. This aggregation property makes sketch computation and aggregation embarrassingly parallel. The order and the number of aggregation steps do not change the final result, so cardinality estimation can be parallelized using frameworks like MapReduce.

We now formalize the concept of a cardinality estimator. The records of the multiset are assumed to belong to a large but finite set .

Definition 64. A deterministic cardinality estimator is a tuple , where:

  • is the empty sketch;
  • is the deterministic operation that adds the record to the sketch and returns an updated sketch;
  • estimates the number of distinct records that have been added to the sketch.

Furthermore, the operation must satisfy the following axioms for all sketches and records .

Idempotence: .
Commutativity: .

These axioms state that ignores duplicates and that the order in which records are added is immaterial. Ignoring duplicates is a natural requirement for cardinality estimators. Ignoring order is required for this operation to be used in frameworks like MapReduce, or open-source equivalents like Hadoop or Apache Beam. Since handling large-scale datasets typically requires using such frameworks, we consider commutativity to be a hard requirement for cardinality estimators.

We denote by the sketch obtained by adding successively to , and we denote by the set of all sketches that can be obtained inductively from by adding records from any subset of in any order. Note that is finite and of cardinality at most . Order and multiplicity do not influence sketches: we denote by the sketch obtained by adding all records of a set (in any order) to .

Later, we give several examples of deterministic cardinality estimators, all of which satisfy Definition 64.

Lemma 2. for all .

Proof. This follows directly from the idempotence and commutativity properties. □

In practice, cardinality estimators also have a operation. We do not explicitly require the existence of this operation, since ’s idempotence and commutativity ensure its existence as follows.

Definition 65. To merge two sketches and , choose some such that . We define to be the sketch obtained after adding all records of successively to .

Note that this construction is not computationally tractable in general, even though in practical scenarios, the operation must be fast. This efficiency requirement is not necessary for any of our results, so we do not explicitly require it either.

Lemma 3. The operation in Definition 65 is well-defined, i.e., it does not depend on the choice of . Furthermore, the merge operation is a commutative, idempotent monoid on with as neutral record.

Proof. Let and be two sketches. If , we denote by the result of adding records of successively to :

Let and be two sets such that , and let be a set such as . Then . Using the two properties of the add function, we get . The same reasoning leads to : the merge function does not depend on the choice of .

In addition, note that . Thus, commutativity, associativity, idempotence, and neutrality follow directly from the same properties of set union. □

Note that these properties of the merge operation are important for cardinality estimators: when aggregating different sketches, we must ensure that the result is the same no matter in which order the sketches are aggregated.

Existing cardinality estimators also satisfy efficiency requirements: they have a low memory footprint, and and run in constant time. These additional properties are not needed for our results, so we omit them in our definition.

We now define precise cardinality estimators.

Definition 66. Let be a set of cardinality taken uniformly at random in . The quality of a cardinality estimation algorithm is given by two metrics:

Its bias
Its variance .

Typically, cardinality estimators used for practical applications are asymptotically unbiased: . In the rest of this work, we assume that all cardinality estimators we consider are perfectly unbiased, so . Cardinality estimators are often compared by their relative standard error (RSE), which is given by:

A cardinality estimator is said to be precise if it is asymptotically unbiased and its relative standard error is bounded by a constant. In practice, we want the relative standard error to be less than a reasonably small constant, for example less than 10%.

Let us give a few examples of cardinality estimators, with their memory usage in bits () and their relative standard error. As a first step, they all apply a hash function to the user identifiers. Conceptually, this step assigns to all users probabilistically a random bitstring of length 32 or 64 bits. Since hashing is deterministic, all occurrences of a user are mapped to the same bitstring.

  • K-Minimum Values [29], with parameter , maintains a list of the smallest hashes that have been added to the sketch. With -bit hashes, it has a memory usage of , and its unbiased estimator has a RSE of approximately  [45].
  • Probabilistic Counting with Stochastic Averaging [154] (PCSA, also known as FM-sketches) maintains a list of bit arrays. When an record is added to an FM-sketch, its hash is split into two parts. The first part determines which bit array is modified. The second part determines which bit is flipped to , depending on the number of consecutive zeroes at the beginning. With registers and -bit hashes, its memory usage is , and its RSE is approximately .
  • LogLog [120] maintains a tuple of registers. Like PCSA, it uses the first part of the hash to pick which register to modify. Then, each registers stores the maximum number of consecutive zeroes observed so far. Using registers and a -bit hash, its memory usage is bits, and its standard error is approximately .
  • HyperLogLog [195] has the same structure and add operation as LogLog, only its estimate operation is different. Using registers and a -bit hash, it has a memory usage of bits, and its standard error is approximately .
  • Bloom filters can also be used for cardinality estimation [310]. However, we could not find an expression of the standard error for a given memory usage in the literature.

All these cardinality estimators have null or negligible () bias. Thus, their variance is equal to their mean squared standard error. So the first four are precise, whereas we do not know if Bloom filters are.

All examples above are deterministic cardinality estimators. For them and other deterministic cardinality estimators, bias and variance only come from the randomness in the algorithm’s inputs. We now define probabilistic cardinality estimators. Intuitively, these are algorithms that retain all the useful properties of deterministic cardinality estimators, but may flip coins during computation. We denote by the set of distributions over .

Definition 67. A probabilistic cardinality estimator is a tuple , where

  • is the empty sketch;
  • is the probabilistic operation that adds the record to the sketch and returns an updated sketch;
  • is the probabilistic operation that merges two sketches and ; and
  • estimates the number of unique records that have been added to the sketch.

Both the and operations can be extended to distributions of sketches. For a distribution of sketches and an record , denotes the distribution such that:

For two distributions of sketches and , denotes the distribution such that:

We want probabilistic cardinality estimators to have the same high-level properties as deterministic cardinality estimators: idempotence, commutativity, and the existence of a well-behaved operation. In the deterministic case, the idempotence and commutativity of the operation was sufficient to show the existence of a operation with the desired properties. In the probabilistic case, this no longer holds. Instead, we require the following two properties.

  • For a set , let denote the sketch distribution obtained when adding records of successively into . The mapping from to must be well-defined: it must be independent of the order in which we add records, and possible repetitions. This requirement encompasses both idempotence and commutativity.
  • For two subsets and of , we require that:

These requirements encompass the results of Lemma 3.

These properties, like in the deterministic case, are very strong. They impose that an arbitrary number of sketches can be aggregated without losing accuracy during the aggregation process. This requirement is however realistic in many practical contexts, where the same sketches are used for fine-grained analysis and for large-scale cardinality estimation. If this requirement is relaxed, and the cardinality estimator is allowed to return imprecise results when merging sketches, our negative results do not hold.

For example, Tschorsch and Scheuermann proposed a cardinality estimation scheme [369] which adds noise to sketches to make them satisfy privacy guarantees in a distributed context. However, their algorithm is not a probabilistic cardinality estimator according to our definition: noisy sketches can no longer be aggregated. Indeed, [369] explains that “combining many perturbed sketches quickly drives [noise] to exceedingly high values.” In our setting, aggregation is crucial, so we do not further consider their algorithm.

Privacy for cardinality estimators

In this section, we describe our system and attacker model, and present the privacy definition we use in our setting.


Figure 3.7: System and attacker model

Figure 3.7 shows our system and attacker model. The service provider collects sensitive data from many users over a long time span. The raw data is stored in a database. Over shorter time periods (e.g. an hour, a day, or a week), a cardinality estimator aggregates all data into a sketch. Sketches of all time periods are stored in a sketch database. Sketches from different times are aggregated into sketches of longer time spans, which are also stored in the database. Estimators compute user statistics from the sketches in the database, which are published. The service provider may also publish the sketches via an analytics service for other parties.

The attacker knows all algorithms used (those for sketching, aggregation, and estimation, including their configuration parameters such as the hash function and the number of buckets) and has access to the published statistics and the analytics service. They control a small fraction of the users that produce user data. However, they can neither observe nor change the data of the other users. They also do not have access to the database containing the raw data.

In this work, we mainly consider an internal attacker who has access to the sketch database. For this internal attacker, the goal is to discover whether their target belongs to a given sketch. We then discuss how our results extend to weaker external attackers, which can only use the analytics service. We will see that for our main results, the attacker only requires access to one sketch. The possibility to use multiple sketches will only come up when discussing mitigations strategies in Section 3.3.4.

Privacy definition

We now present the privacy definition used in our main result. Given the system and attacker just described, our definition captures the impossibility for the attacker to gain significant positive knowledge about a given target. This privacy requirement is very weak: reasonable definitions used for practical algorithms would likely be stronger. Working with a weak definition strengthens our negative result: if a cardinality estimator satisfying our weak privacy definition cannot be precise, then this is also the case for cardinality estimators satisfying a stronger definition.

Definition 68. A cardinality estimator satisfies -sketch privacy above cardinality if for every , , and , the following inequality holds:

Here, the probability is taken over:

  • a uniformly chosen set , where is the set of all possible subsets of cardinality ; and
  • the coin flips of the algorithm, for probabilistic cardinality estimators.

This notion is a good example of the conclusion of Section 2.2: multiple variants and extensions of differential privacy are combined to find a definition that fits exactly to a specific use case. It has three important characteristics.

No background knowledge: this definition assumes an attacker that has total uncertainty over the data: no record is known, and the input distribution is assumed to be completely uniform. The absence of any partial knowledge makes it a special case of definitions from Section 2.2.5; in Section 3.1.3, we saw that in this specific case, APK-DP and PPK-DP are equivalent. In practice, a realistic attacker might have more information about the data, so a stronger privacy definition would model this prior knowledge by a larger family of probability distributions. More precisely, since the records of are uniformly distributed in , the prior knowledge from the attacker is exactly . A realistic attacker would likely have a larger prior knowledge about their target. However, any reasonable definition of privacy would also include the case where the attacker does not have more information on their target than on other users and, as such, would be stronger than -sketch privacy.
Asymmetry: we only consider the positive information gain by the attacker. Following a reasoning to the prior-posterior bound property of DP shown in Section 2.1.6, -sketch privacy imposes an upper bound on the probability that notion of knowledge gain is similar to the one in given the observation , but no lower bound. In other words, the attacker is allowed to deduce with absolute certainty that . In practice, both positive and negative information gains may present a privacy risk. In our running example (see Example 10), deducing that a user did not spend the night at their place of residence could be problematic.
Minimum cardinality: we only require a bound on the information gain for cardinalities larger than a parameter . In practice, could represent a threshold over which it is considered safe to publish sketches or to relax data protection requirements. Choosing a small (like ) strengthens the privacy definition, while choosing a large (like ) limits the utility of the data, as many smaller sketches cannot be published.

The first characteristic is a variant of DP alongside our dimension B (Section 2.2.5), while the former two are reducing the scope of the definition in a similar way than variants from dimension N (Section 2.2.3). Later, in Section 3.3.3, we consider even weaker privation notions, by combining our definition with variants from dimensions Q (Section 2.2.2) and V (Section 2.2.4).

-sketch privacy is trivially strictly weaker than differential privacy, and can almost be expressed as an instance of pufferfish privacy (Definition 35), except its asymmetry. It is immediate to show that -sketch privacy satisfies both privacy axioms (Definition 9).

We emphasize again that these characteristics, which result in a very weak definition, make our notion of privacy well-suited to proving negative results. If satisfying our definition is impossible for an accurate cardinality estimator, then a stronger definition would similarly be impossible to satisfy. For example, any reasonable choice of distributions used to represent the prior knowledge of the attacker would include the uniform distribution.

All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).