ArticlePDF Available

How Can We Analyze Differentially-Private Synthetic Datasets?

Authors:

Abstract

Synthetic datasets generated within the multiple imputation frame-work are now commonly used by statistical agencies to protect the confidentiality of their respondents. More recently, researchers have also proposed techniques to generate synthetic datasets which offer the formal guarantee of differential privacy. While combining rules were derived for the first type of synthetic datasets, little has been said on the analysis of differentially-private synthetic datasets generated with multiple imputations. In this paper, we show that we can not use the usual combining rules to analyze synthetic datasets which have been generated to achieve differential privacy. We consider specifically the case of generating synthetic count data with the beta-binomial synthetizer, and illustrate our discussion with simu-lation results. We also propose as a simple alternative a Bayesian model which models explicitly the mechanism for synthetic data generation.
... One goal to work toward is to require all validations and all synthetic data generators to satisfy -DP, and rely on composition properties to provide bounds on the risk. To our knowledge, this currently is not possible (at acceptably low levels of error) with state-of-the-art techniques for generating -DP synthetic data (e.g., Barak et al., 2007;Abowd and Vilhuber, 2008;Blum et al., 2008;Machanavajjhala et al., 2008;Charest, 2010;Hardt et al., 2012;Mir et al., 2013;Karwa and Slavkovic, 2015) for data with the dimensionality and complexity of the OPM data. However, work is ongoing. ...
Preprint
Data stewards seeking to provide access to large-scale social science data face a difficult challenge. They have to share data in ways that protect privacy and confidentiality, are informative for many analyses and purposes, and are relatively straightforward to use by data analysts. One approach suggested in the literature is that data stewards generate and release synthetic data, i.e., data simulated from statistical models, while also providing users access to a verification server that allows them to assess the quality of inferences from the synthetic data. We present an application of the synthetic data plus verification server approach to longitudinal data on employees of the U.S. federal government. As part of the application, we present a novel model for generating synthetic career trajectories, as well as strategies for generating high dimensional, longitudinal synthetic datasets. We also present novel verification algorithms for regression coefficients that satisfy differential privacy. We illustrate the integrated use of synthetic data plus verification via analysis of differentials in pay by race. The integrated system performs as intended, allowing users to explore the synthetic data for potential pay differentials and learn through verifications which findings in the synthetic data hold up and which do not. The analysis on the confidential data reveals pay differentials across races not documented in published studies.
... There is a growing collection of mechanisms and case studies for differentially private release of data [1,8,36,33,49,10,22], although some of these are based on a broad view of data release, such as the release of histograms or contingency tables. Our use of plausible deniability to achieve differentially private data adds to this body of work. ...
Preprint
Releasing full data records is one of the most challenging problems in data privacy. On the one hand, many of the popular techniques such as data de-identification are problematic because of their dependence on the background knowledge of adversaries. On the other hand, rigorous methods such as the exponential mechanism for differential privacy are often computationally impractical to use for releasing high dimensional data or cannot preserve high utility of original data due to their extensive data perturbation. This paper presents a criterion called plausible deniability that provides a formal privacy guarantee, notably for releasing sensitive datasets: an output record can be released only if a certain amount of input records are indistinguishable, up to a privacy parameter. This notion does not depend on the background knowledge of an adversary. Also, it can efficiently be checked by privacy tests. We present mechanisms to generate synthetic datasets with similar statistical properties to the input data and the same format. We study this technique both theoretically and experimentally. A key theoretical result shows that, with proper randomization, the plausible deniability mechanism generates differentially private synthetic data. We demonstrate the efficiency of this generative technique on a large dataset; it is shown to preserve the utility of original data with respect to various statistical analysis and machine learning measures.
... Originally developed as a way to protect the privacy of summary statistics (queries), it soon expanded as a way to protect entire data sets. Differentially private data synthesis (DIPS) has since become a popular area of research; see, for example, Abowd and Vilhuber (2008); Machanavajjhala et al. (2008); Charest (2011); McClure and Reiter (2012); Bowen and Liu (2020); Quick (2021); Drechsler (2023). ...
Preprint
Full-text available
We show that differential privacy type guarantees can be obtained when using a Poisson synthesis mechanism to protect counts in contingency tables. Specifically, we show how to obtain (ϵ,δ)(\epsilon, \delta)-probabilistic differential privacy guarantees via the Poisson distribution's cumulative distribution function. We demonstrate this empirically with the synthesis of an administrative-type confidential database.
... However, it still remains that the synthetic data do not follow the same distribution as the original dataset, and the combining rules are often designed for only specific statistics. Furthermore, in the case of differentially private synthetic data, it has been shown that traditional combining rules do not give valid inference, making the problem more complicated (Charest, 2011). ...
... Ideally, we wish to leverage the full value of the sensitive data whilst still maintaining individual privacy. This goal is driving new lines of research in statistics, information theory, and machine learning [4,6,2]. ...
Preprint
Full-text available
Machine Learning (ML) is accelerating progress across fields and industries, but relies on accessible and high-quality training data. Some of the most important datasets are found in biomedical and financial domains in the form of spreadsheets and relational databases. But this tabular data is often sensitive in nature. Synthetic data generation offers the potential to unlock sensitive data, but generative models tend to memorise and regurgitate training data, which undermines the privacy goal. To remedy this, researchers have incorporated the mathematical framework of Differential Privacy (DP) into the training process of deep neural networks. But this creates a trade-off between the quality and privacy of the resulting data. Generative Adversarial Networks (GANs) are the dominant paradigm for synthesising tabular data under DP, but suffer from unstable adversarial training and mode collapse, which are exacerbated by the privacy constraints and challenging tabular data modality. This work optimises the quality-privacy trade-off of generative models, producing higher quality tabular datasets with the same privacy guarantees. We implement novel end-to-end models that leverage attention mechanisms to learn reversible tabular representations. We also introduce TableDiffusion, the first differentially-private diffusion model for tabular data synthesis. Our experiments show that TableDiffusion produces higher-fidelity synthetic datasets, avoids the mode collapse problem, and achieves state-of-the-art performance on privatised tabular data synthesis. By implementing TableDiffusion to predict the added noise, we enabled it to bypass the challenges of reconstructing mixed-type tabular data. Overall, the diffusion paradigm proves vastly more data and privacy efficient than the adversarial paradigm, due to augmented re-use of each data batch and a smoother iterative training process.
... Other early applications include Eno and Thompson (2008); Cano et al. (2010); Blum et al. (2011);Xiao et al. (2011). Several papers also explicitly adapted the ideas from the statistical community to the DP context Machanavajjhala et al., 2008;Charest, 2011;McClure and Reiter, 2012a). The approach of Machanavajjhala et al. (2008) was later extended in Quick (2021) and Quick (2022). ...
Preprint
Full-text available
The idea to generate synthetic data as a tool for broadening access to sensitive microdata has been proposed for the first time three decades ago. While first applications of the idea emerged around the turn of the century, the approach really gained momentum over the last ten years, stimulated at least in parts by some recent developments in computer science. We consider the upcoming 30th jubilee of Rubin's seminal paper on synthetic data (Rubin, 1993) as an opportunity to look back at the historical developments, but also to offer a review of the diverse approaches and methodological underpinnings proposed over the years. We will also discuss the various strategies that have been suggested to measure the utility and remaining risk of disclosure of the generated data.
Article
In private data publishing, a promising solution is generating synthetic data that enables any query on the private dataset while satisfying differential privacy. Over the past decade, researchers mainly focused on improving the query accuracy of synthetic data. However, the limitations of existing works restrict them from achieving a better trade-off between accuracy and privacy. In this paper, we propose ABSyn, a novel scheme for differentially private data synthesis. Under the Select-Measure-Generate paradigm, ABSyn has an adaptive mechanism for precisely selecting marginals and follows the batch processes. Our adaptive-batch scheme can provide a well-selected marginal set and the optimal allocation of privacy budget, which makes its synthetic data achieve high accuracy without compromising privacy. We implement an efficient prototype of ABSyn and compare it with existing works by analyzing public datasets. Experimental results show that ABSyn achieves query accuracy on synthetic datasets by a factor of 1.26× and efficiency by a factor of 18.60× over the state-of-the-art scheme on average.
Article
Full-text available
Background Synthetic data have been proposed as a solution for sharing anonymized versions of sensitive biomedical datasets. Ideally, synthetic data should preserve the structure and statistical properties of the original data, while protecting the privacy of the individual subjects. Differential Privacy (DP) is currently considered the gold standard approach for balancing this trade-off. Objectives The aim of this study is to investigate how trustworthy are group differences discovered by independent sample tests from DP-synthetic data. The evaluation is carried out in terms of the tests' Type I and Type II errors. With the former, we can quantify the tests' validity, i.e., whether the probability of false discoveries is indeed below the significance level, and the latter indicates the tests' power in making real discoveries. Methods We evaluate the Mann–Whitney U test, Student's t-test, chi-squared test, and median test on DP-synthetic data. The private synthetic datasets are generated from real-world data, including a prostate cancer dataset (n = 500) and a cardiovascular dataset (n = 70,000), as well as on bivariate and multivariate simulated data. Five different DP-synthetic data generation methods are evaluated, including two basic DP histogram release methods and MWEM, Private-PGM, and DP GAN algorithms. Conclusion A large portion of the evaluation results expressed dramatically inflated Type I errors, especially at levels of ϵ ≤ 1. This result calls for caution when releasing and analyzing DP-synthetic data: low p-values may be obtained in statistical tests simply as a byproduct of the noise added to protect privacy. A DP Smoothed Histogram-based synthetic data generation method was shown to produce valid Type I error for all privacy levels tested but required a large original dataset size and a modest privacy budget (ϵ ≥ 5) in order to have reasonable Type II error levels.
Article
To avoid disclosures, Rubin proposed creating multiple, synthetic data sets for public release so that (i) no unit in the released data has sensitive data from an actual unit in the population, and (ii) statistical procedures that are valid for the original data are valid for the released data. In this article, I show through simulation studies that valid inferences can be obtained from synthetic data in a variety of settings, including simple random sampling, probability proportional to size sampling, two-stage cluster sampling, and stratied sampling. I also provide guidance on specifying the number and size of synthetic data sets and demonstrate the benet of including design variables in the released data sets.
Article
Multiple imputation was rst conceived as a tool that statistical agencies could use to handle nonresponse in large sample, public use surveys. In the last two decades, the multiple imputation framework has been adapted for other statistical contexts. As examples, individual researchers use multiple imputa- tion to handle missing data in small samples; statistical agencies disseminate multiply-imputed datasets for purposes of protecting data conden tiality; and, survey methodologists and epidemiologists use multiple imputation to correct for measurement errors. In some of these settings, Rubin's original rules for combining the point and variance estimates from the multiply-imputed datasets are not appropriate, because what is known|and therefore in the conditional expectations and variances used to derive inferential methods|diers from the missing data context. These applications require new combining rules and methods of inference. In fact, more than ten combining rules exist in the