ArticlePDF Available

How Can We Analyze Differentially-Private Synthetic Datasets?

Authors:

Abstract

Synthetic datasets generated within the multiple imputation frame-work are now commonly used by statistical agencies to protect the confidentiality of their respondents. More recently, researchers have also proposed techniques to generate synthetic datasets which offer the formal guarantee of differential privacy. While combining rules were derived for the first type of synthetic datasets, little has been said on the analysis of differentially-private synthetic datasets generated with multiple imputations. In this paper, we show that we can not use the usual combining rules to analyze synthetic datasets which have been generated to achieve differential privacy. We consider specifically the case of generating synthetic count data with the beta-binomial synthetizer, and illustrate our discussion with simu-lation results. We also propose as a simple alternative a Bayesian model which models explicitly the mechanism for synthetic data generation.
... For z B to satisfy differential privacy for a given privacy budget, > 0, the hyperparameters α i;B must be sufficiently large. When is small, however, the requirements for the α i;B become prohibitively high, resulting in a prior distribution which would dominate the data and thereby hinder the utility of the synthetic data (Charest, 2011). The approach of Machanavajjhala et al. (2008) can be viewed as a special case of the approach of Quick (2019), and thus a more detailed derivation of its properties will be discussed in the following subsection. ...
... As a result, the restrictions imposed on z and the values of the hyperparameters, a and b, can be released without leaking sensitive information about the true data. Previous work (e.g., Charest, 2011) has considered treating synthetic data as noisy versions of the truth and using measurement error models (informed by the true hyperparameters) in an attempt to remove the differentially private noise and recover the true data -while unexplored here, disclosing hyperparameters like a and b should help facilitate analyses of this nature. ...
Preprint
CDC WONDER is a web-based tool for the dissemination of epidemiologic data collected by the National Vital Statistics System. While CDC WONDER has built-in privacy protections, they do not satisfy formal privacy protections such as differential privacy and thus are susceptible to targeted attacks. Given the importance of making high-quality public health data publicly available while preserving the privacy of the underlying data subjects, we aim to improve the utility of a recently developed approach for generating Poisson-distributed, differentially private synthetic data by using publicly available information to truncate the range of the synthetic data. Specifically, we utilize county-level population information from the U.S. Census Bureau and national death reports produced by the CDC to inform prior distributions on county-level death rates and infer reasonable ranges for Poisson-distributed, county-level death counts. In doing so, the requirements for satisfying differential privacy for a given privacy budget can be reduced by several orders of magnitude, thereby leading to substantial improvements in utility. To illustrate our proposed approach, we consider a dataset comprised of over 26,000 cancer-related deaths from the Commonwealth of Pennsylvania belonging to over 47,000 combinations of cause-of-death and demographic variables such as age, race, sex, and county-of-residence and demonstrate the proposed framework's ability to preserve features such as geographic, urban/rural, and racial disparities present in the true data.
... Rubin's rules have not been widely used with DP synthetic data generation, and we are only aware of three existing works studying the combination. Charest [9] studied Rubin's rules with a very simple early synthetic data generation algorithm, and concluded that Rubin's rules are not appropriate for that algorithm. Zheng [55] found that some simple one-dimensional methods developed by the multiple imputation community are in fact DP, but not with practically useful privacy bounds. ...
Preprint
Full-text available
While generation of synthetic data under differential privacy (DP) has received a lot of attention in the data privacy community, analysis of synthetic data has received much less. Existing work has shown that simply analysing DP synthetic data as if it were real does not produce valid inferences of population-level quantities. For example, confidence intervals become too narrow, which we demonstrate with a simple experiment. We tackle this problem by combining synthetic data analysis techniques from the field of multiple imputation, and synthetic data generation using noise-aware Bayesian modeling into a pipeline NA+MI that allows computing accurate uncertainty estimates for population-level quantities from DP synthetic data. To implement NA+MI for discrete data generation from marginal queries, we develop a novel noise-aware synthetic data generation algorithm NAPSU-MQ using the principle of maximum entropy. Our experiments demonstrate that the pipeline is able to produce accurate confidence intervals from DP synthetic data. The intervals become wider with tighter privacy to accurately capture the additional uncertainty stemming from DP noise.
... Yet one of the most appealing uses of differential privacy is the generation of synthetic data, which is a collection of records matching the input schema, intended to be broadly representative of the source data. Differentially private synthetic data is an active area of research [1,2,5,11,12,19,25,27,29,30,43,45,46,[48][49][50][52][53][54][55] and has also been the basis for two competitions, hosted by the U.S. National Institute of Standards and Technology [40]. ...
Preprint
Full-text available
We propose AIM, a novel algorithm for differentially private synthetic data generation. \aim is a workload-adaptive algorithm, within the paradigm of algorithms that first selects a set of queries, then privately measures those queries, and finally generates synthetic data from the noisy measurements. It uses a set of innovative features to iteratively select the most useful measurements, reflecting both their relevance to the workload and their value in approximating the input data. We also provide analytic expressions to bound per-query error with high probability, which can be used to construct confidence intervals and inform users about the accuracy of generated data. We show empirically that AIM consistently outperforms a wide variety of existing mechanisms across a variety of experimental settings.
... • Achieving -differential privacy via Laplace noise addition to unagreggated attribute data for = 0.01, 0.1, 1, 10, 25, 50, 100, which covers the usual range of differential privacy levels observed in the literature [17], [3], [4], [30] plus some very large values. For each value, five differentially private data sets were generated and utility and confidentiality metrics were averaged over the five data sets. ...
Preprint
Anonymization for privacy-preserving data publishing, also known as statistical disclosure control (SDC), can be viewed under the lens of the permutation model. According to this model, any SDC method for individual data records is functionally equivalent to a permutation step plus a noise addition step, where the noise added is marginal, in the sense that it does not alter ranks. Here, we propose metrics to quantify the data confidentiality and utility achieved by SDC methods based on the permutation model. We distinguish two privacy notions: in our work, anonymity refers to subjects and hence mainly to protection against record re-identification, whereas confidentiality refers to the protection afforded to attribute values against attribute disclosure. Thus, our confidentiality metrics are useful even if using a privacy model ensuring an anonymity level ex ante. The utility metric is a general-purpose metric that can be conveniently traded off against the confidentiality metrics, because all of them are bounded between 0 and 1. As an application, we compare the utility-confidentiality trade-offs achieved by several anonymization approaches, including privacy models (k-anonymity and $\epsilon$-differential privacy) as well as SDC methods (additive noise, multiplicative noise and synthetic data) used without privacy models.
... There is a key conflict between fulfilling our dual responsibilities as researchers to making high quality inferences on data that are meaningful for policy and social welfare while simultaneously ensuring that the data are protected to preserve the confidentiality of the respondents who have consented to our use of their (potentially sensitive) information. These two objectives are intricately linked, given that the preservation of respondent confidentiality by the researcher, in addition to being a legal requirement, is also essential for the preservation of trust between the researcher and respondent, which in turn is linked to data accuracy (Charest, 2010;Reiter, 2012). As the amount of microdata continues to increase, there is a commensurate increase in the need for privacy protecting methods of data analysis that simultaneously minimize loss of information (Armstrong et al., 1999;Brand, 2002). ...
Article
Full-text available
In public use data sets, it is desirable not to report a respondent's location precisely to protect subject confidentiality. However, the direct use of perturbed location data to construct explanatory exposure variables for regression models will generally make naive estimates of all parameters biased and inconsistent. We propose an approach where a perturbation vector, consisting of a random distance at a random angle, is added to a respondent's reported geographic co‐ordinates. We show that, as long as the distribution of the perturbation is public and there is an underlying prior population density map, external researchers can construct unbiased and consistent estimates of location‐dependent exposure effects by using numerical integration techniques over all possible actual locations, although coefficient confidence intervals are wider than if the true location data were known. We examine our method by using a Monte Carlo simulation exercise and apply it to a real world example using data on perceived and actual distance to a health facility in Tanzania.
... Perturbation of microdata has also been shown to achieve differential privacy if the perturbation mechanism can be represented as misclassification matrix that contains no zeros [52]. Differential privacy mechanisms can also be used to produce fully synthetic data for release, including frequency tables using Beta-Binomial [7] or Poisson [46] synthesizers, as well as contingency tables and OLAP cubes [4]. The latter approach adds Laplace noise to the Fourier projection of the source table before projecting back to create a synthetic table in the integer domain. ...
Preprint
Full-text available
Case records on identified victims of human trafficking are highly sensitive, yet the ability to share such data is critical to evidence-based practice and policy development across government, business, and civil society. We propose new methods to anonymize, publish, and explore data on identified victims of trafficking, implemented as a single pipeline producing three data artifacts: (1) synthetic microdata modelled on sensitive case records and released in their place, mitigating the privacy risk that traffickers might link distinctive combinations of attributes in published records to known victims; (2) aggregate data summarizing the precomputed frequencies of all short attribute combinations, mitigating the utility risk that synthetic data might misrepresent statistics needed for official reporting; and (3) visual analytics interfaces for parallel exploration and evaluation of synthetic data representations and sensitive data aggregates, mitigating the accessibility risk that privacy mechanisms or analysis tools might not be understandable or usable by all stakeholders. Central to our mitigation of these risks is the notion of k-synthetic data, which we generate through a distributed machine learning pipeline. k-synthetic data preserves privacy by ensuring that longer combinations of attributes are not rare in the sensitive dataset and thus potentially identifying; it preserves utility by ensuring that shorter combinations of attributes are both present and frequent in the sensitive dataset; and it improves accessibility by being easy to explain and apply. We present our work as a design study motivated by the goal of creating a new privacy-preserving data platform for the Counter-Trafficking Data Collaborative (CTDC), transforming how the world's largest database of identified victims is made available for global collaboration against human trafficking.
... In addition to membership and attribute attacks, the framework of differential privacy has garnered a lot of interest [40][41][42]. The key idea is to protect the information of every individual in the database against an adversary with complete knowledge of the rest of the dataset. ...
Article
Full-text available
Background: Machine learning (ML) has made a significant impact in medicine and cancer research; however, its impact in these areas has been undeniably slower and more limited than in other application domains. A major reason for this has been the lack of availability of patient data to the broader ML research community, in large part due to patient privacy protection concerns. High-quality, realistic, synthetic datasets can be leveraged to accelerate methodological developments in medicine. By and large, medical data is high dimensional and often categorical. These characteristics pose multiple modeling challenges. Methods: In this paper, we evaluate three classes of synthetic data generation approaches; probabilistic models, classification-based imputation models, and generative adversarial neural networks. Metrics for evaluating the quality of the generated synthetic datasets are presented and discussed. Results: While the results and discussions are broadly applicable to medical data, for demonstration purposes we generate synthetic datasets for cancer based on the publicly available cancer registry data from the Surveillance Epidemiology and End Results (SEER) program. Specifically, our cohort consists of breast, respiratory, and non-solid cancer cases diagnosed between 2010 and 2015, which includes over 360,000 individual cases. Conclusions: We discuss the trade-offs of the different methods and metrics, providing guidance on considerations for the generation and usage of medical synthetic data.
Article
We introduce the DP-auto-GAN framework for synthetic data generation, which combines the low dimensional representation of autoencoders with the flexibility of Generative Adversarial Networks (GANs). This framework can be used to take in raw sensitive data and privately train a model for generating synthetic data that will satisfy similar statistical properties as the original data. This learned model can generate an arbitrary amount of synthetic data, which can then be freely shared due to the post-processing guarantee of differential privacy. Our framework is applicable to unlabeled mixed-type data, that may include binary, categorical, and real-valued data. We implement this framework on both binary data (MIMIC-III) and mixed-type data (ADULT), and compare its performance with existing private algorithms on metrics in unsupervised settings. We also introduce a new quantitative metric able to detect diversity, or lack thereof, of synthetic data.
Preprint
Full-text available
We propose a general approach for differentially private synthetic data generation, that consists of three steps: (1) select a collection of low-dimensional marginals, (2) measure those marginals with a noise addition mechanism, and (3) generate synthetic data that preserves the measured marginals well. Central to this approach is Private-PGM, a post-processing method that is used to estimate a high-dimensional data distribution from noisy measurements of its marginals. We present two mechanisms, NIST-MST and MST, that are instances of this general approach. NIST-MST was the winning mechanism in the 2018 NIST differential privacy synthetic data competition, and MST is a new mechanism that can work in more general settings, while still performing comparably to NIST-MST. We believe our general approach should be of broad interest, and can be adopted in future mechanisms for synthetic data generation.
Article
The dissemination of synthetic data can be an effective means of making information from sensitive data publicly available with a reduced risk of disclosure. While mechanisms exist for synthesizing data that satisfy formal privacy guarantees, these mechanisms do not typically resemble the models an end‐user might use to analyse the data. More recently, the use of methods from the disease mapping literature has been proposed to generate spatially referenced synthetic data with high utility but without formal privacy guarantees. The objective for this paper is to help bridge the gap between the disease mapping and the differential privacy literatures. In particular, we generalize an approach for generating differentially private synthetic data currently used by the US Census Bureau to the case of Poisson‐distributed count data in a way that accommodates heterogeneity in population sizes and allows for the infusion of prior information regarding the underlying event rates. Following a pair of small simulation studies, we illustrate the utility of the synthetic data produced by this approach using publicly available, county‐level heart disease‐related death counts. This study demonstrates the benefits of the proposed approach’s flexibility with respect to heterogeneity in population sizes and event rates while motivating further research to improve its utility.
Article
To avoid disclosures, Rubin proposed creating multiple, synthetic data sets for public release so that (i) no unit in the released data has sensitive data from an actual unit in the population, and (ii) statistical procedures that are valid for the original data are valid for the released data. In this article, I show through simulation studies that valid inferences can be obtained from synthetic data in a variety of settings, including simple random sampling, probability proportional to size sampling, two-stage cluster sampling, and stratied sampling. I also provide guidance on specifying the number and size of synthetic data sets and demonstrate the benet of including design variables in the released data sets.
Article
Multiple imputation was rst conceived as a tool that statistical agencies could use to handle nonresponse in large sample, public use surveys. In the last two decades, the multiple imputation framework has been adapted for other statistical contexts. As examples, individual researchers use multiple imputa- tion to handle missing data in small samples; statistical agencies disseminate multiply-imputed datasets for purposes of protecting data conden tiality; and, survey methodologists and epidemiologists use multiple imputation to correct for measurement errors. In some of these settings, Rubin's original rules for combining the point and variance estimates from the multiply-imputed datasets are not appropriate, because what is known|and therefore in the conditional expectations and variances used to derive inferential methods|diers from the missing data context. These applications require new combining rules and methods of inference. In fact, more than ten combining rules exist in the