4.2.4 Accuracy ✿
In this section, we explore the accuracy of our system by running numerical experiments and provide analytical reasoning about the relationship between accuracy and various parameters.
Experimental accuracy
We assess accuracy experimentally using TPC-H [87], an industry standard SQL benchmark. The TPC-H benchmarks contains a dataset schema and queries that are similar to those used by analysts of personal data at real-world organizations. In addition, the queries contain interesting features such as joins and a variety of aggregations. We generate a TPC-H dataset with the default scale factor of . We treat suppliers or customers as “users”, as appropriate. Our metric for accuracy is median relative error, the same one used in [209]; a smaller median relative error corresponds to higher utility.
We compute the median relative error of 1,000,000 runs for our model over TPC-H Query 1 using three different aggregation functions and in Table 4.4. We compare our results to 1,000,000 runs of Flex over the same query, and 10,000 runs (due to performance considerations) of PINQ over the same query. To present a fair comparison, we disabled -thresholding and compared only one result record to remove the need for stability bounding. In addition, each run of the experiment was performed using a function of fixed sensitivity, controlled by supplying the function with a lower bound of and an upper bound of the value in the Q1 column. The bounds were fixed to minimize accuracy loss from contribution clamping.
Function | Q1 | Our model | PINQ | Flex |
-COUNT(*) | ||||
-AVG(l_extendedprice) | 1 | |||
-MEDIAN(l_extendedprice) | 1 | |||
-COUNT(*) | 2 | |||
1 Unsupported functions
2 Intentionally incorrect sensitivity to demonstrate contribution bounding
For our experiments with PINQ and Flex, we also set sensitivity to our previously determined Q1 listed in Table 4.4. The results are close to our model’s results, but because neither PINQ nor Flex can enforce contribution bounds for datasets with multiple contributions per user, incorrectly set sensitivity can result in query results that are not differentially private. Such incorrectly set bounds can be seen in experiments in Johnson et al. [209] and McSherry’s analysis [276], and in the last row of Table 4.4, where PINQ and Flex report errors far below what are required to satisfy the -DP predicate.
With correctly set sensitivity bounds our model’s results are comparable to PINQ’s results for count and average. Implementation differences in our median function mean that our error is lower by a factor of 2. Both PINQ and our model outperform Flex’s result for count by around an order of magnitude. We do not report errors for average and median for Flex because Flex does not support those functions.
We also ran our system over a selection of TPC-H queries containing joins, the experimental results are presented in Table 4.5. Similarly, we report the median relative error of 1,000,000 runs for each query using . We report the impact of -thresholding (the ratio of suppressed records), suggesting that our model is -DP. was set with , where is the number of distinct users in the underlying dataset: either customers or suppliers, depending on the query.
Query | Q | Experimental error | -thresholding rate | ||
Q4 | |||||
Q13 | |||||
Q16 | 1 | ||||
Q21 | 1 | ||||
1 Results uninterpretable due to high levels of -thresholding
Q4 represents how our system behaves when very little -thresholding occurs. Q16 and Q21 demonstrate the opposite, both queries exhibit a very large error that differs from the theoretical error due to most partitions being removed by the threshold because of their small user count. Indeed, this is by design: as partition user counts approach 1, the ratio of -thresholding approaches . Finally, Q13 represents a more typical result, a query containing mostly medium user count partitions with a long tail of lower count partitions. A moderate amount of -thresholding occurs which increases error compared to Q4, but the results are still quite accurate.
Impact of parameters on utility
In this section we explore the relationship between utility and various parameters, which must be adjusted to balance privacy and utility [13].
First, let us discuss the effect of on the accuracy of the query results. Note that is inversely proportional to the Laplace scale parameter used by differentially private aggregators to add noise. Hence, an increase in leads to a decrease in utility.
Proposition 30. The median error from Laplace noise of scale is .
Proof. Let be the median noise. By symmetry, we have:
and thus:
□
Similarly, the effect of the number of aggregations is straightforward. When a single query contains aggregations, the privacy budget is split equally among them: each aggregation will satisfy -differential privacy, so the median error degrades inversely with the number of aggregations.
Changing the value of also impacts partition selection: recall that the threshold was defined in Theorem 12 as
In particular, if all other parameters are fixed, : multiplying by two means that we will be able to keep partitions with approximately half as many users.
What is the effect of varying for a fixed ? This causes the threshold to change, which changes the number of records dropped due to thresholding. Similarly, changing modifies the number of records dropped due to contribution bounding. We first perform experiments on TPC-H Query 13 with and varying to quantify the impact on partitions returned; Figure 4.10 displays the results. The figure shows that as increases exponentially, the proportion of partitions thresholded decreases somewhat linearly.
Next, we analyze the effect of for a specific artificial query. Consider the following query after rewriting.
SELECT ANON_COUNT(*) FROM (SELECT uid, ROW_NUMBER() as rn FROM Table GROUP BY uid, rn) TABLESAMPLE RESERVOIR ( ROWS PARTITION BY uid)
Suppose that there are users in the database. Suppose each user appears in a number of partitions distributed according to a fixed probability distribution , and call the (random) number of partitions in which user appears. Then the proportion of partitions dropped due to reservoir sampling is:
Figure 4.11 shows the effect of on with and various distributions . All distributions have similar behavior as increases, but the median percent error declines at different speeds based on distribution shape.
Finally, we analyze the effect of clamping on accuracy using model input distributions. Since clamping occurs at the -DP aggregation level, we focus on input sets that have at most one row per user.
Consider finding ANON_AVG(col, L, U)
, where there are
values of col
uniformly
distributed on . If
the input distribution is symmetric, symmetric clamping will not create bias, so we clamp only one end: consider
clamping bounds
and such
that
and .
We analyze expected error since median error is noisier when running experiments, and the behavior of
both metrics are similar.
We plot the impact of the upper clamp bound on total expected error for uniform col
with
in Figure 4.12,
using ,
, and various
values of .
The optimal point on each curve is marked with a circle. To maximize accuracy, overestimating the
spread of the input set must be balanced with restricting the sensitivity. Analysis with the other
aggregation functions yields similar results.
Note that overestimating the spread of the input set creates less error than
underestimating it; the curves in Figure 4.12 do not grow very fast after the optimal
point. This can be explained analytically: the unclamped expected mean of col
is
; for col
clamped
between
and , it
is:
so the expected error of the clamped mean is
Compare the expected error to the median noise added by ANON_AVG(col, L, U)
, which is approximately
. The clamping error grows
quadratically with , while the noise
error only grows linearly with .
The behavior for other distributions is similar.