Figure 2 - uploaded by Li Xiong
Content may be subject to copyright.
Synthetic data generation 

Synthetic data generation 

Source publication
Article
Full-text available
Differential privacy has recently emerged in private statistical data release as one of the strongest privacy guarantees. Most of the existing techniques that generate differentially private histograms or synthetic data only work well for single dimensional or low-dimensional histograms. They become problematic for high dimensional and large domain...

Contexts in source publication

Context 1
... mechanisms (e.g. [14,18,29]) have been proposed for achieving differential privacy for a single computation or a given analytical task and programming platforms have been implemented for supporting interactive differentially private queries or data analysis [28]. Due to the composibility of differential privacy [28], given an overall privacy budget constraint, it has to be allocated to subroutines in the computation or each query in a query sequence to ensure the overall privacy. After the budget is exhausted, the database can not be used for further queries or computations. This is especially challenging in the scenario where multiple users need to pose a large number of queries for exploratory analysis. Several works started addressing effective query answering in the interactive setting with differential privacy given a query workload or batch queries by considering the correlations between queries or query history [38,8,43,23,42]. A growing number of works started addressing non-interactive data release with differential privacy (e.g. [5,27,39,19,12,41,9,10]). Given an original dataset, the goal is to publish a DP statistical summary such as marginal or multi-dimensional histograms that can be used to answer predicate queries or to generate DP synthetic data that mimic the original data. For example, Figure 1 shows an example dataset and a one-dimensional marginal histogram for the attribute age. The main approaches of existing work can be illustrated by Figure 2(a) and classified into two categories: 1) parametric methods that fit the original data to a multivariate distribution and makes inferences about the parameters of the distribution (e.g. [27]). 2) non- parametric methods that learn empirical distributions from the data through histograms (e.g. [19,41,9,10]). Most of these work well for single dimensional or low-order data, but become problematic for data with high dimensions and large attribute domains. This is due to the facts that: 1) The underlying distribution of the data may be unknown in many cases or different from the assumed distribution, especially for data with arbitrary margins and high dimen- sions, leading the synthetic data generated by the paramet- ric methods not ...
Context 2
... We propose a DPCopula framework to generate high dimensional and large domain DP synthetic data. It com- putes a DP copula function and samples synthetic data from the function that effectively captures the dependence implicit in the high-dimensional datasets. With the copula functions, we can separately consider the margins and the joint dependence structure of the original data instead of modeling the joint distribution of all dimensions as shown in Figure 2(b). The DPCopula framework allows direct sampling for the synthetic data from the margins and the copula function. Although existing histogram techniques can be used to generate DP synthetic data, post-processing is required to enforce non-negative histogram counts or con- sistencies between counts which results in either degraded accuracy or high computation ...

Citations

... Both MWEM and Dual query need to work on the workload of queries in advance, which may cause them to perform unsatisfactorily in tasks other than query tasks. DP-Copula [28] is applied to continuous datasets, modeling the multivariate distribution using the Copula function. However, categorical data cannot be modeled with Gaussian Copula. ...
Article
Full-text available
Government agencies respond to policies on open government data and develop the innovation of services by releasing structured microdata. Before release, privacy protection is necessary to mitigate privacy disclosure risks. Synthetic Data Generation technique has attracted more and more attention in the field of privacy protection. The limitations of synthesizing comprehensive data are imposed owing to the nature of isolated government microdata. To release synthetic microdata from multiple government departments while keeping each department’s control over its local data, a framework for releasing privacy-preserving synthetic data is established, and a Bayesian network-based approach GovSynBayes is proposed. Firstly, the count histograms of multidimensional attributes from multi-department data sources are generated through federated queries. Secondly, the histograms are utilized to construct differentially private Bayesian networks. Finally, the synthesized microdata is sampled and released via the generated private Bayesian networks. It is validated that the exponential mechanism is superior for private network learning compared to the Laplace mechanism. A privacy budget allocation algorithm named consistent signal-to-noise for private distribution learning is proposed, which efficiently reduces the average variation distance between synthetic data and original data. GovSynBayes has been experimentally evaluated on four real government micro-datasets. Compared with the currently popular private synthesizers, GovSynBayes has better computational efficiency and generates synthetic dataset with better attribute correlation and machine learning utility.
... Sklar introduced copula functions [77], which can model multidimensional joint distributions through marginal distribution and correlation frameworks [78]. Multiple copula functions can be used to build multidimensional joint distributions of climate factors and drought. ...
Article
Full-text available
Global climate change increasingly impacts agroecosystems, particularly through high-temperature–drought and low-temperature–drought compound events. This study uses ground meteorological and remote sensing data and employs geostatistics, random forest models, and copula methods to analyze the spatial and temporal distribution of these events and their impact on winter wheat in the Huang-Huai-Hai Plain from 1982 to 2020. High-temperature–drought events increased in frequency and expanded from north to south, with about 40% of observation stations recording such events from 2001 to 2020. In contrast, low-temperature–drought events decreased in frequency, affecting up to 80% of stations, but with lower frequency than high-temperature–drought events. Sensitivity analyses show winter wheat is most responsive to maximum and minimum temperature changes, with significant correlations to drought and temperature extremes. Copula analysis indicates temperature extremes and drought severity are crucial in determining compound event probability and return periods. High-temperature–drought events are likely under high temperatures and mild drought, while low-temperature–drought events are more common under low temperatures and mild drought. These findings highlight the need for effective agricultural adaptation strategies to mitigate future climate change impacts.
... There is a vast literature of DP techniques for tabular synthetic data generation, namely, copulas [10,30,47], graphical models [16,52,55,82,83], workload/query based [11,50,78], Variational Autoencoders (VAEs) [2,5,72], Generative Adversarial Networks (GANs) [6,29,44,51,73,80,84], and other approaches [18,33,85]. ...
Preprint
Generative models trained with Differential Privacy (DP) are increasingly used to produce synthetic data while reducing privacy risks. Navigating their specific privacy-utility tradeoffs makes it challenging to determine which models would work best for specific settings/tasks. In this paper, we fill this gap in the context of tabular data by analyzing how DP generative models distribute privacy budgets across rows and columns, arguably the main source of utility degradation. We examine the main factors contributing to how privacy budgets are spent, including underlying modeling techniques, DP mechanisms, and data dimensionality. Our extensive evaluation of both graphical and deep generative models sheds light on the distinctive features that render them suitable for different settings and tasks. We show that graphical models distribute the privacy budget horizontally and thus cannot handle relatively wide datasets while the performance on the task they were optimized for monotonically increases with more data. Deep generative models spend their budget per iteration, so their behavior is less predictable with varying dataset dimensions but could perform better if trained on more features. Also, low levels of privacy (ϵ100\epsilon\geq100) could help some models generalize, achieving better results than without applying DP.
... Zhang et al. (2014Zhang et al. ( , 2017 proposed an approach that uses Bayesian networks to synthesize high-dimensional datasets, called PrivBayes. In parallel, Li et al. (2014) employed Copula functions to take into account the dependency structure of the data (DPCopula). DP guarantees were also integrated in Generative Adversarial Networks (GANs) later (Xie et al., 2018;Yoon et al., 2019). ...
Preprint
Full-text available
The idea to generate synthetic data as a tool for broadening access to sensitive microdata has been proposed for the first time three decades ago. While first applications of the idea emerged around the turn of the century, the approach really gained momentum over the last ten years, stimulated at least in parts by some recent developments in computer science. We consider the upcoming 30th jubilee of Rubin's seminal paper on synthetic data (Rubin, 1993) as an opportunity to look back at the historical developments, but also to offer a review of the diverse approaches and methodological underpinnings proposed over the years. We will also discuss the various strategies that have been suggested to measure the utility and remaining risk of disclosure of the generated data.
... DP data synthesis has been extensively studied over recent years as one of the solutions for privacy-preserving data publishing. Previous statistics-based works [38,68] computed joint distributions of original structured data under DP guarantees and used them to generate synthetic datasets. However, these methods can only be applied to structured data and may suffer from a significant utility loss with the increase in data dimensionality. ...
Article
Full-text available
In the era of big data, user data are often vertically partitioned and stored at different local parties. Exploring the data from all the local parties would enable data analysts to gain a better understanding of the user population from different perspectives. However, the publication of vertically-partitioned data faces a dilemma: on the one hand, the original data cannot be directly shared by local parties due to privacy concerns; on the other hand, independently privatizing the local datasets before publishing may break the potential correlation between the cross-party attributes and lead to a significant utility loss. Prior solutions compute the privatized multivariate distributions of different attribute sets for constructing a synthetic integrated dataset. However, these algorithms are only applicable for low-dimensional structured data and may suffer from large utility loss with the increase in data dimensionality. Following the idea of synthetic data generation, we propose VertiGAN, the first framework based on a generative adversarial network (GAN) for publishing vertically-partitioned data with privacy protection. The framework adopts a GAN model comprised of one multi-output global generator and multiple local discriminators. The generator is collaboratively trained by the server and local parties to learn the distribution of all parties' local data and is used to generate a high-utility synthetic integrated dataset on the server side. Additionally, we apply differential privacy (DP) during the training process to ensure strict privacy guarantees for the local data. We evaluate the framework's performance on a number of real-world datasets containing 68--1501 classification attributes and show that our framework is more capable of capturing joint distributions and cross-attribute correlations compared to statistics-based baseline algorithms. Moreover, with a privacy guarantee of epsilon=8, our framework achieves around a 2%~15% improvement in classification accuracy compared to the baseline algorithms. Extensive experimental results demonstrate the capability and efficiency of our framework in synthesizing vertically-partitioned data while striking a satisfactory utility-privacy balance.
... where D (P) denotes the grid density distribution in a given set P, and (·) represents the Jenson-Shannon divergence between two distributions. • Query error is a popular measure for evaluating data synthesis algorithms ranging from tabular data to graph and location data [8,10,29]. We consider range queries of trajectories in a random spatial region , i.e., (P) returns the number of points in any trajectory of a specified set P that are within the spatial region . ...
Preprint
Full-text available
Trajectory data has the potential to greatly benefit a wide-range of real-world applications, such as tracking the spread of the disease through people's movement patterns and providing personalized location-based services based on travel preference. However, privay concerns and data protection regulations have limited the extent to which this data is shared and utilized. To overcome this challenge, local differential privacy provides a solution by allowing people to share a perturbed version of their data, ensuring privacy as only the data owners have access to the original information. Despite its potential, existing point-based perturbation mechanisms are not suitable for real-world scenarios due to poor utility, dependence on external knowledge, high computational overhead, and vulnerability to attacks. To address these limitations, we introduce LDPTrace, a novel locally differentially private trajectory synthesis framework. Our framework takes into account three crucial patterns inferred from users' trajectories in the local setting, allowing us to synthesize trajectories that closely resemble real ones with minimal computational cost. Additionally, we present a new method for selecting a proper grid granularity without compromising privacy. Our extensive experiments using real-world data, various utility metrics and attacks, demonstrate the efficacy and efficiency of LDPTrace.
... While there are known algorithms that satisfy (i)-(iii) with proofs and empirically satisfy (iv) in simulations (see e.g., [15,19,27,30]), the challenge is to develop an algorithm that provably satisfies all four conditions. Ullman and Vadhan [18] showed that, assuming the existence of one-way functions, one cannot achieve (i)-(iv) even for d = 2, if we require in (iv) that all of the d-dimensional marginals be preserved accurately. ...
Article
Full-text available
The protection of private information is of vital importance in data-driven research, business and government. The conflict between privacy and utility has triggered intensive research in the computer science and statistics communities, who have developed a variety of methods for privacy-preserving data release. Among the main concepts that have emerged are anonymity and differential privacy. Today, another solution is gaining traction, synthetic data. However, the road to privacy is paved with NP-hard problems. In this paper, we focus on the NP-hard challenge to develop a synthetic data generation method that is computationally efficient, comes with provable privacy guarantees and rigorously quantifies data utility. We solve a relaxed version of this problem by studying a fundamental, but a first glance completely unrelated, problem in probability concerning the concept of covariance loss. Namely, we find a nearly optimal and constructive answer to the question how much information is lost when we take conditional expectation. Surprisingly, this excursion into theoretical probability produces mathematical techniques that allow us to derive constructive, approximately optimal solutions to difficult applied problems concerning microaggregation, privacy and synthetic data.
... As an alternative to these fully parametric approaches, [32] and [33] make use of classification and regression trees (CART), while more recently, [34,35,23,36,37,38] and others have used Bayesian networks, Generative Adversarial Networks or copulas to capture the underlying linear and non-linear relationships between the attributes. ...
... with u ∈ [0, 1] d , I ∈ dx d being the identity matrix, and φ −1 being the inverse cumulative distribution function of a standard normal distribution. Σ is a positive semi-definite covariance matrix that we estimate based on Pearson's correlation coefficient ρ [34]. ...
Preprint
Full-text available
Household survey programs around the world publish fine-granular georeferenced microdata to support research on the interdependence of human livelihoods and their surrounding environment. To safeguard the respondents' privacy, micro-level survey data is usually (pseudo)-anonymized through deletion or perturbation procedures such as obfuscating the true location of data collection. This, however, poses a challenge to emerging approaches that augment survey data with auxiliary information on a local level. Here, we propose an alternative microdata dissemination strategy that leverages the utility of the original microdata with additional privacy safeguards through synthetically generated data using generative models. We back our proposal with experiments using data from the 2011 Costa Rican census and satellite-derived auxiliary information. Our strategy reduces the respondents' re-identification risk for any number of disclosed attributes by 60-80\% even under re-identification attempts.
... Yet one of the most appealing uses of differential privacy is the generation of synthetic data, which is a collection of records matching the input schema, intended to be broadly representative of the source data. Differentially private synthetic data is an active area of research [1,2,5,11,12,19,25,27,29,30,43,45,46,[48][49][50][52][53][54][55] and has also been the basis for two competitions, hosted by the U.S. National Institute of Standards and Technology [40]. ...
Preprint
Full-text available
We propose AIM, a novel algorithm for differentially private synthetic data generation. \aim is a workload-adaptive algorithm, within the paradigm of algorithms that first selects a set of queries, then privately measures those queries, and finally generates synthetic data from the noisy measurements. It uses a set of innovative features to iteratively select the most useful measurements, reflecting both their relevance to the workload and their value in approximating the input data. We also provide analytic expressions to bound per-query error with high probability, which can be used to construct confidence intervals and inform users about the accuracy of generated data. We show empirically that AIM consistently outperforms a wide variety of existing mechanisms across a variety of experimental settings.
... Differentially private synthetic data generation has been extensively studied over recent years as an alternative solution to privacy-preserving data publishing. Previous works ( [36,57]) analyzed statistical distributions of original data under differential privacy and used them to generate synthetic data. Later works have proposed using differentially private generative models ( [41,49]) to directly generate high-utility synthetic data. ...
Article
Full-text available
Business intelligence and AI services often involve the collection of copious amounts of multidimensional personal data. Since these data usually contain sensitive information of individuals, the direct collection can lead to privacy violations. Local differential privacy (LDP) is currently considered a state-ofthe-art solution for privacy-preserving data collection. However, existing LDP algorithms are not applicable to high-dimensional data; not only because of the increase in computation and communication cost, but also poor data utility. In this paper, we aim at addressing the curse-of-dimensionality problem in LDP-based high-dimensional data collection. Based on the idea of machine learning and data synthesis, we propose DP-F ed -W ae , an efficient privacy-preserving framework for collecting high-dimensional categorical data. With the combination of a generative autoencoder, federated learning, and differential privacy, our framework is capable of privately learning the statistical distributions of local data and generating high utility synthetic data on the server side without revealing users’ private information. We have evaluated the framework in terms of data utility and privacy protection on a number of real-world datasets containing 68–124 classification attributes. We show that our framework outperforms the LDP-based baseline algorithms in capturing joint distributions and correlations of attributes and generating high-utility synthetic data. With a local privacy guarantee ∈ = 8, the machine learning models trained with the synthetic data generated by the baseline algorithm cause an accuracy loss of 10% ~ 30%, whereas the accuracy loss is significantly reduced to less than 3% and at best even less than 1% with our framework. Extensive experimental results demonstrate the capability and efficiency of our framework in synthesizing high-dimensional data while striking a satisfactory utility-privacy balance.