ArticlePDF Available

How Can We Analyze Differentially-Private Synthetic Datasets?

Authors:

Abstract

Synthetic datasets generated within the multiple imputation frame-work are now commonly used by statistical agencies to protect the confidentiality of their respondents. More recently, researchers have also proposed techniques to generate synthetic datasets which offer the formal guarantee of differential privacy. While combining rules were derived for the first type of synthetic datasets, little has been said on the analysis of differentially-private synthetic datasets generated with multiple imputations. In this paper, we show that we can not use the usual combining rules to analyze synthetic datasets which have been generated to achieve differential privacy. We consider specifically the case of generating synthetic count data with the beta-binomial synthetizer, and illustrate our discussion with simu-lation results. We also propose as a simple alternative a Bayesian model which models explicitly the mechanism for synthetic data generation.
... For pðz B jy B ; a B Þ to satisfy differential privacy for a given privacy budget, > 0, the hyperparameters a i;B must be sufficiently large. When is small, however, the requirements for the a i;B become prohibitively high, resulting in a prior distribution which would dominate the data and thereby hinder the utility of the synthetic data (Charest, 2011). The approach of Machanavajjhala et al. (2008) can be viewed as a special case of the approach of Quick (2021), and thus a more detailed derivation of its properties will be discussed in the following subsection. ...
... As a result, the restrictions imposed on z and the values of the hyperparameters, a and b, can be released without leaking sensitive information about the true data. Previous work (e.g., Charest, 2011) has considered treating synthetic data as noisy versions of the truth and using measurement error models (informed by the true hyperparameters) in an attempt to remove the differentially private noise and recover the true data-while unexplored here, disclosing hyperparameters like a and b should help facilitate analyses of this nature. ...
Article
CDC WONDER is a web-based tool for the dissemination of epidemiologic data collected by the National Vital Statistics System. While CDC WONDER has built-in privacy protections, they do not satisfy formal privacy protections such as differential privacy and thus are susceptible to targeted attacks. Given the importance of making high-quality public health data publicly available while preserving the privacy of the underlying data subjects, we aim to improve the utility of a recently developed approach for generating Poisson-distributed, differentially private synthetic data by using publicly available information to truncate the range of the synthetic data. Specifically, we utilize county-level population information from the US Census Bureau and national death reports produced by the CDC to inform prior distributions on county-level death rates and infer reasonable ranges for Poisson-distributed, county-level death counts. In doing so, the requirements for satisfying differential privacy for a given privacy budget can be reduced by several orders of magnitude, thereby leading to substantial improvements in utility. To illustrate our proposed approach, we consider a dataset comprised of over 26,000 cancer-related deaths from the Commonwealth of Pennsylvania belonging to over 47,000 combinations of cause-of-death and demographic variables such as age, race, sex, and county-of-residence and demonstrate the proposed framework’s ability to preserve features such as geographic, urban/rural, and racial disparities present in the true data.
... Ideally, we wish to leverage the full value of the sensitive data whilst still maintaining individual privacy. This goal is driving new lines of research in statistics, information theory, and machine learning [4,6,2]. ...
Preprint
Full-text available
Machine Learning (ML) is accelerating progress across fields and industries, but relies on accessible and high-quality training data. Some of the most important datasets are found in biomedical and financial domains in the form of spreadsheets and relational databases. But this tabular data is often sensitive in nature. Synthetic data generation offers the potential to unlock sensitive data, but generative models tend to memorise and regurgitate training data, which undermines the privacy goal. To remedy this, researchers have incorporated the mathematical framework of Differential Privacy (DP) into the training process of deep neural networks. But this creates a trade-off between the quality and privacy of the resulting data. Generative Adversarial Networks (GANs) are the dominant paradigm for synthesising tabular data under DP, but suffer from unstable adversarial training and mode collapse, which are exacerbated by the privacy constraints and challenging tabular data modality. This work optimises the quality-privacy trade-off of generative models, producing higher quality tabular datasets with the same privacy guarantees. We implement novel end-to-end models that leverage attention mechanisms to learn reversible tabular representations. We also introduce TableDiffusion, the first differentially-private diffusion model for tabular data synthesis. Our experiments show that TableDiffusion produces higher-fidelity synthetic datasets, avoids the mode collapse problem, and achieves state-of-the-art performance on privatised tabular data synthesis. By implementing TableDiffusion to predict the added noise, we enabled it to bypass the challenges of reconstructing mixed-type tabular data. Overall, the diffusion paradigm proves vastly more data and privacy efficient than the adversarial paradigm, due to augmented re-use of each data batch and a smoother iterative training process.
... Other early applications include Eno and Thompson (2008); Cano et al. (2010); Blum et al. (2011);Xiao et al. (2011). Several papers also explicitly adapted the ideas from the statistical community to the DP context Machanavajjhala et al., 2008;Charest, 2011;McClure and Reiter, 2012a). The approach of Machanavajjhala et al. (2008) was later extended in Quick (2021) and Quick (2022). ...
Preprint
Full-text available
The idea to generate synthetic data as a tool for broadening access to sensitive microdata has been proposed for the first time three decades ago. While first applications of the idea emerged around the turn of the century, the approach really gained momentum over the last ten years, stimulated at least in parts by some recent developments in computer science. We consider the upcoming 30th jubilee of Rubin's seminal paper on synthetic data (Rubin, 1993) as an opportunity to look back at the historical developments, but also to offer a review of the diverse approaches and methodological underpinnings proposed over the years. We will also discuss the various strategies that have been suggested to measure the utility and remaining risk of disclosure of the generated data.
... The final two articles of this special issue address various aspects of the difficult problem of inference under the constraint of differential privacy. Unlike SDC methods, the parameters of the differential privacy mechanism are not secret and can be used to adjust statistical analyses through a measurement error modeling approach as shown in Charest (2011). Another example is Goldstein and Shlomo (2020) who use a Bayesian framework to account for the noise addition when conducting statistical analyses based on generalized linear modeling. ...
Article
Full-text available
This article is an introduction to the 13 articles in the JSSAM special issue on Privacy, Confidentiality, and Disclosure Protection. We also provide background information to place the articles into context.
... Rubin's rules have not been widely used with DP synthetic data generation, and we are only aware of three existing works studying the combination. Charest [9] studied Rubin's rules with a very simple early synthetic data generation algorithm, and concluded that Rubin's rules are not appropriate for that algorithm. Zheng [55] found that some simple one-dimensional methods developed by the multiple imputation community are in fact DP, but not with practically useful privacy bounds. ...
Preprint
Full-text available
While generation of synthetic data under differential privacy (DP) has received a lot of attention in the data privacy community, analysis of synthetic data has received much less. Existing work has shown that simply analysing DP synthetic data as if it were real does not produce valid inferences of population-level quantities. For example, confidence intervals become too narrow, which we demonstrate with a simple experiment. We tackle this problem by combining synthetic data analysis techniques from the field of multiple imputation, and synthetic data generation using noise-aware Bayesian modeling into a pipeline NA+MI that allows computing accurate uncertainty estimates for population-level quantities from DP synthetic data. To implement NA+MI for discrete data generation from marginal queries, we develop a novel noise-aware synthetic data generation algorithm NAPSU-MQ using the principle of maximum entropy. Our experiments demonstrate that the pipeline is able to produce accurate confidence intervals from DP synthetic data. The intervals become wider with tighter privacy to accurately capture the additional uncertainty stemming from DP noise.
... Yet one of the most appealing uses of differential privacy is the generation of synthetic data, which is a collection of records matching the input schema, intended to be broadly representative of the source data. Differentially private synthetic data is an active area of research [1,2,5,11,12,19,25,27,29,30,43,45,46,[48][49][50][52][53][54][55] and has also been the basis for two competitions, hosted by the U.S. National Institute of Standards and Technology [40]. ...
Preprint
Full-text available
We propose AIM, a novel algorithm for differentially private synthetic data generation. \aim is a workload-adaptive algorithm, within the paradigm of algorithms that first selects a set of queries, then privately measures those queries, and finally generates synthetic data from the noisy measurements. It uses a set of innovative features to iteratively select the most useful measurements, reflecting both their relevance to the workload and their value in approximating the input data. We also provide analytic expressions to bound per-query error with high probability, which can be used to construct confidence intervals and inform users about the accuracy of generated data. We show empirically that AIM consistently outperforms a wide variety of existing mechanisms across a variety of experimental settings.
Article
We propose AIM, a new algorithm for differentially private synthetic data generation. AIM is a workload-adaptive algorithm within the paradigm of algorithms that first selects a set of queries, then privately measures those queries, and finally generates synthetic data from the noisy measurements. It uses a set of innovative features to iteratively select the most useful measurements, reflecting both their relevance to the workload and their value in approximating the input data. We also provide analytic expressions to bound per-query error with high probability which can be used to construct confidence intervals and inform users about the accuracy of generated data. We show empirically that AIM consistently outperforms a wide variety of existing mechanisms across a variety of experimental settings.
Article
Recently, several organizations have considered using differentially private algorithms for disclosure limitation when releasing count data. The typical approach is to add random noise to the counts sampled from, for example, a Laplace distribution or symmetric geometric distribution. One advantage of this approach, at least for some differentially private algorithms, is that analysts know the noise distribution and hence have the opportunity to account for it when making inferences about the true counts. In this article, we present Bayesian inference procedures to estimate the posterior distribution of a subset proportion, that is, a ratio of two counts, given the released values. We illustrate the methods under several scenarios, including when the released counts come from surveys or censuses. Using simulations, we show that the Bayesian procedures can result in accurate inferences with close to nominal coverage rates.
Article
Full-text available
We propose a general approach for differentially private synthetic data generation, that consists of three steps: (1) select a collection of low-dimensional marginals, (2) measure those marginals with a noise addition mechanism, and (3) generate synthetic data that preserves the measured marginals well. Central to this approach is Private-PGM, a post-processing method that is used to estimate a high-dimensional data distribution from noisy measurements of its marginals. We present two mechanisms, NIST-MST and MST, that are instances of this general approach. NIST-MST was the winning mechanism in the 2018 NIST differential privacy synthetic data competition, and MST is a new mechanism that can work in more general settings, while still performing comparably to NIST-MST. We believe our general approach should be of broad interest, and can be adopted in future mechanisms for synthetic data generation.
Article
To avoid disclosures, Rubin proposed creating multiple, synthetic data sets for public release so that (i) no unit in the released data has sensitive data from an actual unit in the population, and (ii) statistical procedures that are valid for the original data are valid for the released data. In this article, I show through simulation studies that valid inferences can be obtained from synthetic data in a variety of settings, including simple random sampling, probability proportional to size sampling, two-stage cluster sampling, and stratied sampling. I also provide guidance on specifying the number and size of synthetic data sets and demonstrate the benet of including design variables in the released data sets.
Article
Multiple imputation was rst conceived as a tool that statistical agencies could use to handle nonresponse in large sample, public use surveys. In the last two decades, the multiple imputation framework has been adapted for other statistical contexts. As examples, individual researchers use multiple imputa- tion to handle missing data in small samples; statistical agencies disseminate multiply-imputed datasets for purposes of protecting data conden tiality; and, survey methodologists and epidemiologists use multiple imputation to correct for measurement errors. In some of these settings, Rubin's original rules for combining the point and variance estimates from the multiply-imputed datasets are not appropriate, because what is known|and therefore in the conditional expectations and variances used to derive inferential methods|diers from the missing data context. These applications require new combining rules and methods of inference. In fact, more than ten combining rules exist in the