Conference PaperPDF Available

Towards Improving Privacy of Synthetic DataSet

Authors:

Abstract and Figures

Recent growth in domain specific applications of machine learning can be attributed to availability of realistic public datasets. Real world datasets may always contain sensitive information about the users, which makes it hard to share freely with other stake holders, and researchers due to regulatory and compliance requirements. Synthesising datasets from real data by leveraging generative techniques is gaining popularity. However, the privacy analysis of these dataset is still a open research. In this work, we fill this gap by investigating the privacy issues of the generated data sets from attacker and auditor point of view. We propose instance level Privacy Score (PS) for each synthetic sample by measuring the memorisation coefficient \textit{$\boldsymbol{\alpha_{m}}$} per sample. Leveraging, PS we empirically show that accuracy of membership inference attacks on synthetic data drop significantly. PS is a model agnostic, post training measure, which helps data sharer with guidance about the privacy properties of a given sample but also helps third party data auditors to run privacy checks without sharing model internals. We tested our method on two real world data sets and show that attack accuracy reduced by PS based filtering.
Content may be subject to copyright.
Towards Improving Privacy of Synthetic
DataSets
Aditya Kuppa1,2, Lamine Aouad2, and Nhien-An Le-Khac1
1UCD School of Computing, Dublin Ireland
2Tenable Network Security
aditya.kuppa@ucdconnect.ie,
laouad@tenable.com,
an.lekhac@@ucd.ie
Abstract. Recent growth in domain specific applications of machine
learning can be attributed to availability of realistic public datasets. Real
world datasets may always contain sensitive information about the users,
which makes it hard to share freely with other stake holders, and re-
searchers due to regulatory and compliance requirements. Synthesising
datasets from real data by leveraging generative techniques is gaining
popularity. However, the privacy analysis of these dataset is still a open
research. In this work, we fill this gap by investigating the privacy issues
of the generated data sets from attacker and auditor point of view. We
propose instance level Privacy Score (PS) for each synthetic sample by
measuring the memorisation coefficient αmper sample. Leveraging, PS
we empirically show that accuracy of membership inference attacks on
synthetic data drop significantly. PS is a model agnostic, post training
measure, which helps data sharer with guidance about the privacy prop-
erties of a given sample but also helps third party data auditors to run
privacy checks without sharing model internals. We tested our method
on two real world data sets and show that attack accuracy reduced by
PS based filtering.
Keywords: Privacy Preserving Synthetic Data ·Generative Adversarial
Networks ·Privacy Audit.
1 Introduction
Machine learning (ML) has been widely adopted in a variety of applications
across multiple domains. Due to the maturity of large-scale training infrastruc-
ture [3], development tools and, the availability of large amounts of training data
companies are building huge models with billions of parameters for commercial
applications. The training data used to train these models often contain per-
sonal or sensitive data [1], which was aggregated from datasets that are used
for purposes unknown to the data owner (i.e. the person itself). Unfortunately,
end-users who play an active role as data donors lack visibility into how this data
is being used/misused by the applications built on top of the data collected. A
2 Kuppa et al.
growing number of AI incidents3show an upward trend of data and model mis-
uses, which include discriminatory hiring decisions, behavioral manipulations,
and unethical systems that lack transparency. To gain the trust of users and
avoid the privacy implications of data sharing for end users institutions [4–6]
and companies [9–16] are adopting synthetic data [17] as a solution to avoid
privacy issues.
Synthetic data as an idea was first adopted in designing various imputation
methods in the early 90’s [17, 18]. Raghunathan et.al [32] formally defined a
framework to define properties of synthetic data, which was further enhanced in
multiple other researchers [19–22,33,34]. Many recent works leveraged traditional
statistical techniques and machine learning models to generate synthetic data
[23, 25, 26, 28–31].
Ideal synthetic data has properties such as handling large datasets with di-
verse data types (categorical, ordinal, continuous, skewed, etc.), preserving se-
mantic relationships between variables [26] and, underlying algorithms are fairly
generalizable and needs a little tuning. ML models trained on synthetic data
may have similar performance as original data. Theoretical privacy guarantees
can be achieved via ”ideal” synthetic data as a synthetic data record is not an
actual data record of an individual but reflect the exact data distributions of
original data.
In the context of data sharing, the majority of recent methods employ gen-
erative models [39, 40, 47, 50, 51] to obtain the synthetic data from real data.
Popular generative modeling frameworks such as Generative Adversarial Net-
works (GANs [40]) and Variational Auto-encoders (VAEs) use real data during
training to train a model capable of generating (or ”synthesising”) ”fake” data
points. Quality metrics such as fidelity which measures how “realistic” is the
synthetic data when compared to real data and, diversity which measures how
diverse the generated samples to cover the variability of real data are used to
assess quality of the synthetic data.
Measuring the generalization property of synthetic data is very important
and often ignored by synthetic data generators. Most of the proposed methods
optimize on fidelity and diversity properties ignoring to measure or report, how
well the synthetic samples capture the overall distribution including outliers of
real data and, not just copies of the (real) samples of training data. Evaluat-
ing generalization in generative models is not easy [54, 55] when compared to
discriminative models which are tested on held-out data.
From compliance and regulation view point, the General Data Protection
Regulation (GDPR [42]) gives new rights to citizens regarding transparency
around personal data. These include the right of access (knowing which data
about the person has been stored) and the right to data erasure, also known
as the right to be forgotten (having personal data deleted under certain cir-
cumstances). Privacy risk include the risk of re-identification and the risk of
inference [35–37, 41]. When an attacker is able to identify and link individuals
within a dataset or across multiple datasets, then these datasets pose the risk
3https://incidentdatabase.ai/
Towards Improving Privacy of Synthetic DataSets 3
of re-identification. These types of attacks can reveal the presence of user [43] or
user data was used to train an ML model [44]. Deducing the private information
of individuals with high confidence from the linked datasets or ML model fall
under the risk of inference.
Motivated by the need from compliance, privacy of the end user and, grow-
ing popularity of synthetic data as sharing mechanism both in industry and
academia, in this work we aim to investigate privacy properties of synthetic data
generated by generative models. First, we propose a novel Membership Inference
Attack (MIA), which uses the popular bagging technique to achieve attacker goal
of identifying the membership of user in the model trained on synthetic data.
Next, a instance level privacy score is calculated to filter the synthetic samples,
which could compromise the privacy of the sensitive attributes in the synthetic
data to protect from MIAs.
2 Background
We present a brief background on membership inference attacks and generative
machine learning models.
Membership inference attack (MIA). is one of the most popular privacy
attacks against ML models [73–79]. The attacker goal of membership inference
(MI) is to determine whether a data sample xis part of the training dataset of
a target model T. We define a membership inference attack model AMemInf as
a binary classifier.
AMemInf :x, T 7→ {member,non-member}(1)
MIA poses severe privacy risks, using a model trained on data samples of
people with sensitive information an attacker can execute MIA to successfully
infer the membership of person revealing the person’s sensitive information. To
achieve MIA, Shokri et al [73] train a number of shadow classifiers on confidence
scores produced by the target model with labels indicating whether samples
came from the training or testing set. MIAs are shown to work in white-box [57]
as well as black-box scenarios against a various target models including ML-as-
a-service [73] and generative models [56]. Yeom et al. [59] explore overfitting as
the root cause of MI vulnerability.
Similarly, a variant of MIA is attribute inference in which sensitive attribute
can be inferred from the target model. This sensitive attribute is not related to
the target ML model’s original classification task. For instance, a target model
is designed to classify an individual’s age from their social network posts, while
attribute inference aims to infer their educational background [71, 72]. Some
studies [58,59, 64] attribute membership inference attacks to the generalization
gap, the over-fitting of the model, and data memorization capabilities of neu-
ral networks. Deep neural networks have been shown to memorize the training
data [63, 69, 70], rather than learning the latent properties of the data, which
means they often tend to over-fit to the training data. Weakness like this, in
4 Kuppa et al.
some cases, can be exploited by an adversary to infer data samples’ sensitive
attributes.
Given a data sample xand its representation from a target model, denoted
by h=f(x), to conduct the attribute inference attack, the adversary trains an
attack model AAttInf formally defined as follows:
AAttInf :h7→ s(2)
where srepresents the sensitive attribute.
Generative Adversarial Network. (GAN) [40] is a type of adversarial train-
ing system, where two competing models are trained against each other. They
comprise of two components, a discriminator (D), which is trained to differenti-
ate between real generated data; and the generator (G), which creates fake data
to fool the discriminator. The generator Gtakes in a random noise vector Zand
parameter vector thetagand learns to generate synthetic samples very similar
to the real data. The resulting G(Z;θg) network is is used to generate fake data
with data distribution Fg.
Training GAN’s on real-world datasets is notoriously hard. One has to care-
fully tune both generator and discriminator network parameters at training time
to avoid conditions like mode collapse and vanishing gradients. Both these con-
ditions affect the quality of synthetic data generated. Choice of loss functions
and GAN architectures also play a vital role in avoiding these problems [65].
Tabular Generators. In this work, we deal in the feature space, which is tab-
ular in nature. The values of features are continuous, ordinal, categorical, tex-
tual, and/or multi-categorical values. Textual values can be of variable length,
high-dimensional due to long-tail distributions of observational data. Textual
data is converted into a set of sequences and projected into a trained embed-
ding that preserves semantic meaning similar to methods proposed in Natural
Language Processing (NLP) literature. Traditional Generative Adversarial Net-
work (GAN) [40] architectures proposed in literature perform poorly on tabular
data [66] due to – (a) Real data often consists of mixed data types. To generate
this data simultaneously, a GAN must be able to apply both softmax and tanh
on the output; (b) Images pixel values generally follow a Gaussian-like distribu-
tion, which can be normalized to [-1,1], whereas with tabular data, continuous
data often have non-Gaussian distributions with long tails that leads to gradi-
ent saturation; continuous data type and imbalanced categorical columns need
preprocessing steps before feeding into generators.
Recently proposed Conditional Tabular GAN or CTGAN [66] addresses some
of the key problems for tabular data synthesis. The training process involves
simulating records one by one. It first randomly selects one of the variables, then
it randomly selects a value for that variable. Given that value for that variable,
the algorithm finds a matching row from the training data. It also generates the
rest of the variables conditioning on the selected variable. The generated and true
rows are sent to the critic that gives a score. Similarly, modified WGAN-GP [68]
changed the generator of the WGAN-GP. Figure 1 summarizes the CTGAN and
Towards Improving Privacy of Synthetic DataSets 5
WGAN-GP training procedures. to handle continuous variables. Other recently
proposed Robust Variational AutoEncoders (R-VAEs) [67], which inherits the
properties of typical Variational AutoEncoders, handles the tabular data more
efficiently.
Randomly
select a value
Randomly
select a
column
Noise Z
Fz
N(0,Ik)
Generate record
with value
v G(Z;θg|v)
Fake record
with v
Randomly
select record
in Xwith v
Real record
with v
Critic
D(·)
Data X
Score
(a) CTGAN
Noise Z
Fz
N(0,Ik)
Generator
G(Z;θg)Fake Data Fg
Real Data FX
Critic
D(·)
D: set
of 1-
Lipschitz
functions
Data X
Score of
realism Loss
maxD∈D{D(X)} − [D{G(Z;θg)}]
max
θg
[D{G(Z;θg)}]
(b) WGAN-GP
Fig. 1: CTGAN and WGAN-GP Training dynamics
DPGAN [52] and PATEGAN [61] are popular differentially private GAN’s.
DPGAN adds noise to the gradient of the discriminator during training to create
differential privacy guarantees. PATEGAN first trains multiple teacher models
on disjoint distributions of data. At inference time, to classify a new sample it
noisily aggerates outputs from previously trained teacher models to generate the
final output.
3 Instance level Privacy Scoring
To avoid MIA type attacks and unintentional leakage of sensitive attributes, we
devise an instance level privacy score for generated synthetic data. As shown
in previous studies, the main reasons for privacy compromises in generative
models is mainly because of over fitting of training data that could lead to
data memorization. We empirically measure memorization coefficient αmfor
each generated sample and use this measure as privacy score. When distributing
synthetic data to external third parties, users can filter samples with high privacy
score and reduce privacy risks.
Given Xr,Xsreal and synthetic data set generated by some generative al-
gorithm Φrespectively. We first convert into latent vectors in embedding space
E(Xr), E(Xs). Next, we measure the spherelet distance between each real and
generated sample latent vectors. The spherelet distance between s(Xi) and s(Xj)
is defined as:
dS=dR(Xi, Xj) + dE(Xi, Xj) (3)
where dEis the Euclidean distance between two sets of points, and dRis Rieman-
nian divergence (geometric) of dataset of Xiand Xj. The intuition behind this is
6 Kuppa et al.
we are measuring the distance between points by projecting the samples on to a
sphere centered at cwith radius r. Spherelet of Xdenoted by s(X) = S(V, c, r)
where Vdetermines an affine subspace the sphere lies in. The spherical error [60]
of Xis defined by
(X) = 1
n
n
X
i=1
(kxick − r)2(4)
If xilie on the sphere then (X) = 0. Also the distance metric captures (a)
which points are found in data manifold in low probability; (b) how close the
synthetic and real in nearest neighbour proximity; which can give us a proxy
measure for over-fitting and data memorization. memorization coefficient αmfor
a generated sample Xgiis defined by gni/tniwhere tniand gniare sample counts
from training and generated dataset, which have proximity distance lesser than
a threshold > λ. In short, we are measuring the support set for the sample Xgiin
the training and synthetic set. Samples, which have αm<1 have large influence
of training set and are prone to MI attacks vs samples with αm>1 capture
the underlying distribution of training data without leaking private training set
attributes.
The advantage of this method is the model Φis agnostic and works on data
level. Users can discard the samples which cross a certain threshold privacy
score to protect privacy of users and, data audit by compliance bodies can be
performed without a need to know the model internals and training dynamics.
4 MIA Method
Membership Inference The goal of a membership inference attack is to create a
function that accurately predicts whether a data point belongs to the training set
of the classifier. To achieve this, an attacker trains a set of surrogate models [73]
on confidence scores produced by the target model and membership as labels to
achieve her goal.
Typically, the attacker exploits the model confidence reflecting overfitting to
infer membership. [73] first demonstrated MIA with only black-box access to
the confidence scores for each input data sample. Their approach involved first
training many ’surrogate’ ML models similar to the target model and uses these
shadow models to train a membership inference model to identify whether a
sample was in the training set of any of the shadow models. MIA is evaluated as
a binary classification task, where metrics like false-negative, false-positive rates,
and F1 scores reflect the attack success rates of predicting that a given sample
is part of the training set.
We adapt ’shadow’ training model in our MI attack on synthetic data. Given
an auxiliary dataset Daux similar to the training data distribution, we first ran-
domly sample single data point Ltand remove it from the dataset. Now, a
well-known ensemble learning method Bootstrap aggregating (Bagging) is used
to divide remaining dataset Daux 1 into subsamples with replacement. Gener-
ative models are trained on each sub sample and using the trained models syn0
Towards Improving Privacy of Synthetic DataSets 7
synthetic datasets of size kis sampled. These samples as labeled lj= 0 indicat-
ing that synthetic data does not contain target sample Lt. Same procedure is
repeated by adding Ltinto the dataset Daux and the synthetic data sets syn1of
size kis sampled. These samples as labled lj= 1 indicating that synthetic data
contain target sample Lt. Now we have labeled dataset that captures the mem-
bership relationship and we train a random forest classifier MIc, which predicts
the probability of a given sample is member or not of the original data set. Given
a synthetic dataset SAux,M Icpredicts the membership of the target record tin
the training set of the generative model Gthat produced SAux .
5 Experiments
The membership privacy leakage is evaluated through a membership privacy
experiment [48,49]. Given a training set Dtwhich follows a data distribution D
and a learning algorithm φis trained on Dt. We assume the adversary has only
access to φvia a query interface, which exposes inference and class probability
of a given sample. In synthetic data sharing use cases, the adversary has access
to synthetic datasets generated by some generative model. Here the goal of the
adversary is to discover the membership of a given sample i.e. it belongs to
training data or not with access to only synthetic data. We assume domain-
dependent data access to the adversary. In our experiments, to measure the
strength of the proposed Instance Level Privacy filtering of synthetic data we
measure the accuracy drop of MI attack with/without filtering mechanism.
For this work, we use two datasets – First dataset is SIVEP-Gripe database [2]
released by Brazilian government with records of 99,557 ICU Covid-19 patients
in Brazil, which includes personal information (ethnicity, age and location). A
logistic regression (binary classification) model is fit to predict patient-level
COVID-19 mortality with 6882 data sample and 25 features. Second dataset
is the Brazilian E-Commerce Public Dataset by Olist4. It consists of 112,000
customer orders at multiple marketplaces in Brazil. The dataset contains order
status, price, payment and freight performance to customer location, product
attributes and finally reviews written by customers. Customer and Seller Id’s
are anonymized, and references to the companies and partners in the review
text have been replaced with the names of Game of Thrones great houses. For
our experiment we use columns price, customer identifier, customer zip code,
customer city as feature columns. A logistic regression model is fit to predict the
customer spend by aggregating price column.
We create 5 synthetic datasets using CTGAN, WGAN-GP, PATEGAN, DP-
GAN, and, RVAE as synthetic data generators for SIVEP-Gripe and E-Commerce
data. The training of generator the input dimensions of architecture is ad-
justed to accommodate the dataset shape. All hidden layers use the ReLU
activation function, the final layer uses the sigmoid function. The discrimi-
nator network also uses three hidden layers, which are symmetric to the gen-
erator. The hidden layers again use ReLU. We train each network with 100
4https://www.kaggle.com/olistbr/brazilian-ecommerce/
8 Kuppa et al.
(a)
(b)
Fig. 2: Synthetic data size dependency on MIA attack for (a) E-Commerce
Dataset (b) EHR dataset
Towards Improving Privacy of Synthetic DataSets 9
(a)
(b)
Fig. 3: Attack Accuracy drop before and after filtering data points with high
privacy scores - (a) EHR dataset (b) E-Commerce Dataset
10 Kuppa et al.
epochs. For training differential private GAN’s, we used the privacy budgets
of = [0.01,0.1,0.5,1.0,3.0]. λis set to 0.5 to reflect the adversary advantage
in membership attack. Randomly 1000 samples are sampled from training and
marked as Daux. To reflect maximum compromise by adversary as white box
setting, we use the same generator to train the shadow model S.
5.1 Experimental Results and Discussion
We first present experimental results of the membership inference attack and the
attack accuracy drop before and after we filter the data with αmmethod. We
measure the accuracy of attack with respect to number of samples in the auxiliary
dataset used in the attack mechanism. Figure 5.1 and 3 summarizes the privacy
risk of each synthesising method in terms of MI attack accuracy and auxiliary
dataset. We observe that increasing the size of the synthetic dataset increases the
risk of privacy leakage. As more information about the data is learnt by adversary
the privacy of the data generator weakens. Similarly, differential privacy methods
that add noise to the output with some privacy budget , the attack accuracy
decreases showing the value of DP-based generators reduce privacy risks. Increase
in privacy strength of the dataset is consistent across datasets and different
models highlighting the strength of our proposed method.
5.2 Discussion
To develop good synthetic data generation models we still have some technical
hurdles, for example, it is not only hard to exactly predict what data charac-
teristics will be preserved in a model’s stochastic output but also appropriate
metrics to empirically compare two distributions are lacking. The privacy bud-
gets in terms of perturbations to achieve differential privacy make it even harder
to predict which records will remain vulnerable, and might even increase the
exposure of some data records. Our work aims to give some direction towards
improving synthetic data privacy by assigning instance-level privacy scores. A
data auditor or service provider can leverage the scores to remove the data
points prone to MI attacks. The advantage of our method is a model agnostic
and it can be used in post-hoc fashion. We explored the accuracy drop from
the attacker end, but measuring the accuracy-privacy tradeoff due to dropping
the data points from the synthetic data on downstream models needs further
investigation.
We want to highlight some of the limitations of our work. We only tested
our method on tabular data, further investigation is needed to understand the
privacy risks of other types of data such as text, images and, audio. Additionally,
it can be observed that we use the whole dataset to measure αm, which may
be misleading as pointed out by [46], membership leakage for subgroup may
be more vulnerable than others for discriminative models. We plan to address
this drawback in our future work. It is also important to point out that the
privacy risk measures provided in this analysis are dataset-dependent and can
be generator-specific. While the proposed method gives some empirical privacy
Towards Improving Privacy of Synthetic DataSets 11
guarantees for generative models and guidance for data sharing use case, we need
more detailed analysis on the dependency of αmagainst model parameters, and
varied dataset distributions. In terms of threat models, our analysis does not
take into consideration an attacker who has knowledge of the filtering scheme in
advance. A malicious attacker can use only filtered data points to train surrogate
models making the proposed scheme invalid.
6 Conclusion
One of the major impediments in advancing ML in security-sensitive domains is
the lack of open realistic publicly available datasets. To gain trust and protect the
end-user data and, comply with regulations stakeholders are leveraging synthetic
data as a tool to share data between third-party aggregators and researchers.
However, the privacy analysis of these datasets is still open research. In this
work, we fill this gap by investigating the privacy issues of the generated data
sets from the attacker and auditor point of view. We propose an Instance-level
Privacy Score (PS) for each synthetic sample by measuring the memorization
coefficient αmper sample. Leveraging, PS we empirically show that the accu-
racy of membership inference attacks on synthetic data drops significantly. PS is
a model agnostic, post-training measure, which helps data sharer with guidance
about the privacy properties of a given sample but also helps third party data
auditors to run privacy checks without sharing model internals. Finally, there
is a lot more work to be done purely within the realm of privacy and our work
only addresses a very small issue empirically. Understanding social and legal im-
plications of synthetic data sharing needs inputs and partnerships from policy
experts and, compliance officers.
References
1. Carlini, Nicholas and Tramer, Florian and Wallace, Eric and Jagielski, Matthew
and Herbert-Voss, Ariel and Lee, Katherine and Roberts, Adam and Brown, Tom
and Song, Dawn and Erlingsson, Ulfar and others
Extracting Training Data from Large Language Models , arXiv preprint
arXiv:2012.07805
2. SIVEP-Gripe. http://plataforma.saude.gov.br/coronavirus/dados-abertos/. In
Ministry of Health. SIVEP-Gripe public dataset, (accessed May 10, 2020; in Por-
tuguese), 2020.
3. N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates,
S. Bhatia, N. Boden, A. Borchers, et al. In-datacenter performance analysis of a
tensor processing unit. In Proceedings of the 44th Annual International Symposium
on Computer Architecture, pages 1–12, 2017.
4. Departement of Commerce, National Institute of Standards and Technology, “Differ-
ential private synthetic data challenge.” https://www.challenge.gov/challenge/
differential-privacy-synthetic-data- challenge/, 2019. Accessed 2021-02-19.
5. T. Theraux, Olivier, “Anonymisation and synthetic data:
towards trustworthy data.” https://theodi.org/article/
12 Kuppa et al.
anonymisation-and-synthetic-data- towards-trustworthy-data/, 2019. Ac-
cessed 2021-02-19.
6. The Open Data Institute, “Diagnosing the nhs: Synae.” https://www.odileeds.
org/events/synae/. Accessed 2021-02-19.
7. Rubin, D. B. Discussion: Statistical Disclosure Limitation, 1993.
8. Rubin, D. B. Multiple imputation after 18+ years. American Statistical Association,
91(434):473–489, 1996.
9. “Hazy.” https://hazy.com/.
10. “AIreverie.” https://aireverie.com/.
11. “Statice.” https://statice.ai/.
12. “one-view.” https://one-view.ai/.
13. “datagen.” https://www.datagen.tech/.
14. “Synthesize.” https://synthezise.io/.
15. “cognata.” https://www.cognata.com/.
16. “Mostly-ai.” https://mostly.ai/.
17. D. B. Rubin, “Statistical disclosure limitation,” Journal of official Statistics, vol. 9,
no. 2, pp. 461–468, 1993.
18. Little, R. J. Statistical Analysis of Masked Data. Journal of Official Statistics, 9
(2):407–426, 1993.
19. Abowd, J. M. and Lane, J. New Approaches to Confidentiality Protection: Syn-
thetic Data, Remote Access and Research Data Centers. In Domingo-Ferrer, J. and
Torra, V. (eds.), Privacy in Statistical Databases, volume 3050, pp. 282–289. 2004.
ISBN 978-3-540-22118-0.
20. Abowd, J. M. and Woodcock, S. D. Multiply-Imputing Confidential Characteristics
and File Links in Longitudinal Linked Data. In Privacy in Statistical Databases,
volume 3050, pp. 290–297. 2004. ISBN 3-540-22118-2.
21. Reiter, J. P. and Raghunathan, T. E. The multiple adaptations of multiple impu-
tation. Journal of the American Statistical Association, 102(480):1462–1471, 2007.
22. Drechsler, J. and Reiter, J. P. Sampling with synthesis: A new approach for releas-
ing public use census microdata. Journal of the American Statistical Association,
105(492):1347–1357, 2010.
23. Drechsler, J. and Reiter, J. P. An Empirical Evaluation of Easily Implemented,
Nonparametric Methods for Generating Synthetic Datasets. Computational Statis-
tics and Data Analysis, 55(12):3232–3243, dec 2011.
24. Kinney, S. K., Reiter, J. P., Reznek, A. P., Miranda, J., Jarmin, R. S., and Abowd,
J. M. Towards Unrestricted Public Use Business Microdata: The Synthetic Longi-
tudinal Business Database. International Statistical Review, 79(3):362–384, 2011.
25. Reiter, J. P. Using cart to generate partially synthetic, public use microdata.
Journal of Official Statistics, 21(3):441–462, 2005a.
26. Caiola, G. and Reiter, J. P. Random Forests for Generating Partially Synthetic,
Categorical Data. Transactions on Data Privacy, 3(1):27–42, 2010.
27. Drechsler, J. Synthetic Datasets for Statistical Disclosure Control, volume 53.
Springer, 2011. ISBN 9788578110796.
28. Bowen, C. M. and Liu, F. Comparative Study of Differentially Private Data Syn-
thesis Methods. Statistical Science, forthcoming.
29. Drechsler, J. Synthetic Datasets for Statistical Disclosure Control, volume 53.
Springer, 2011. ISBN 9788578110796.
30. Manrique-Vallier, D. and Hu, J. Bayesian Non-Parametric Generation of Fully
Synthetic Multivariate Categorical Data in the Presence of Structural Zeros. Journal
of the Royal Statistical Society. Series A: Statistics in Society, 181(3):635–647, 2018.
Towards Improving Privacy of Synthetic DataSets 13
31. Snoke, J., Raab, G., Nowok, B., Dibben, C., and Slavkovic, A. General and specific
utility measures for synthetic data. 2016.
32. Raghunathan, T. E., Reiter, J. P., and Rubin, D. B. Multiple imputation for
statistical disclosure limitation. Journal of official statistics, 19(1):1, 2003.
33. Kinney, S. K., Reiter, J. P., Reznek, A. P., Miranda, J., Jarmin, R. S., and Abowd,
J. M. Towards Unrestricted Public Use Business Microdata: The Synthetic Longi-
tudinal Business Database. International Statistical Review, 79(3):362–384, 2011.
34. Kinney, S. K., Reiter, J. P., and Berger, J. O. Model Selection When Multiple
Imputation is Used to Protect Confidentiality in Public Use Data. Journal of Privacy
and Confidentiality, 2(2):3–19, 2010.
35. Article 29 Data Protection Working Party - European Commission, “Opin-
ion 05/2014 on anonymisation techniques.” https://ec.europa.eu/justice/
article-29/documentation/opinion-recommendation/files/2014/wp216_en.
pdf, 2014.
36. M. Elliot, E. Mackey, K. O’Hara, and C. Tudor, The anonymisation decision-
making framework. UKAN Manchester, 2016.
37. I. S. Rubinstein and W. Hartzog, “Anonymization and risk,” Wash. L. Rev., vol. 91,
p. 703, 2016.
38. M. Elliot, K. O’hara, C. Raab, C. M. O’Keefe, E. Mackey, C. Dibben, H. Gowans,
K. Purdam, and K. McCullagh, “Functional anonymisation: Personal data and the
data environment,” Computer Law & Security Review, vol. 34, no. 2, pp. 204–221,
2018.
39. N. Patki, R. Wedge, and K. Veeramachaneni, “The synthetic data vault,” in 2016
IEEE International Conference on Data Science and Advanced Analytics (DSAA),
pp. 399–410, IEEE, 2016.
40. I. Goodfellow, “Nips 2016 tutorial: Generative adversarial networks,” arXiv
preprint arXiv:1701.00160, 2016.
41. M. Elliot, K. O’hara, C. Raab, C. M. O’Keefe, E. Mackey, C. Dibben, H. Gowans,
K. Purdam, and K. McCullagh, “Functional anonymisation: Personal data and the
data environment,” Computer Law & Security Review, vol. 34, no. 2, pp. 204–221,
2018.
42. European Commission (2016). Regulation (EU) 2016/679: General Data Protection
Regulation (GDPR).
43. L. Sweeney, “k-anonymity: A model for protecting privacy,” International Journal
of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, no. 05, pp. 557–
570, 2002.
44. R. Shokri, M. Stronati, C. Song, and V. Shmatikov, “Membership Inference Attacks
Against Machine Learning Models,” in IEEE Symposium on Security and Privacy
(S&P), 2017.
45. A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam, “L-
diversity: privacy beyond k-anonymity,” in 22nd International Conference on Data
Engineering (ICDE’06), 2006.
46. M. Yaghini, B. Kulynych, and C. Troncoso, “Disparate vulnerability: On the unfair-
ness of privacy attacks against machine learning,” arXiv preprint arXiv:1906.00389,
2019.
47. H. Ping, J. Stoyanovich, and B. Howe, “Datasynthesizer: Privacy-preserving syn-
thetic datasets,” in Proceedings of the 29th International Conference on Scientific
and Statistical Database Management, 2017.
48. B. Jayaraman, L. Wang, D. Evans, and Q. Gu, “Revisiting membership inference
under realistic assumptions,” arXiv preprint arXiv:2005.10881, 2020.
14 Kuppa et al.
49. S. Yeom, I. Giacomelli, M. Fredrikson, and S. Jha, “Privacy risk in machine learn-
ing: Analyzing the connection to overfitting,” in 2018 IEEE 31st Computer Security
Foundations Symposium (CSF), pp. 268–282, IEEE, 2018.
50. J. Zhang, G. Cormode, C. M. Procopiuc, D. Srivastava, and X. Xiao, “Privbayes:
Private data release via bayesian networks,” ACM Trans. Database Syst., 2017.
51. L. Xu, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni, “Modeling
tabular data using conditional gan,” in Advances in Neural Information Processing
Systems, 2019.
52. Yoon, J., Drumright, L. N., and Van Der Schaar, M. Anonymization through
data synthesis using generative adversarial networks (ads-gan). IEEE Journal of
Biomedical and Health Informatics, 2020.
53. Brock, A., Donahue, J., and Simonyan, K. Large scale gan training for high fidelity
natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
54. Adlam, B., Weill, C., and Kapoor, A. Investigating under and overfitting in wasser-
stein generative adversarial networks. arXiv preprint arXiv:1910.14137, 2019.
55. Meehan, C., Chaudhuri, K., and Dasgupta, S. A non-parametric test to detect
data-copying in generative models. arXiv preprint arXiv:2004.05675, 2020.
56. Jamie Hayes, Luca Melis, George Danezis, and Emiliano De Cristofaro. Logan:
Membership inference attacks against generative models. Proceedings on Privacy
Enhancing Technologies, 2019(1):133–152, 2019.
57. Sablayrolles, A.; Douze, M.; Ollivier, Y.; Schmid, C.; and J´egou, H. 2019. White-
box vs Black-box: Bayes Optimal Strategies for Membership Inference. In Proceed-
ings of the 36th International Conference on Machine Learning (ICML), 5558–5567.
58. A. Sablayrolles, M. Douze, Y. Ollivier, C. Schmid, and H. J´egou, “White-box vs
black-box: Bayes optimal strategies for membership inference,” 2019.
59. S. Truex, L. Liu, M. E. Gursoy, L. Yu, and W. Wei, “Towards demystifying mem-
bership inference attacks,” ArXiv, vol. abs/1807.09173, 2018.
60. Kuppa, Aditya and Grzonkowski, Slawomir and Asghar, Muhammad Rizwan and
Le-Khac, Nhien-An Black Box Attacks on Deep Anomaly Detectors, Proceedings
of the 14th International Conference on Availability, Reliability and Security
61. J. Yoon, J. Jordon, and M. van der Schaar. PATE-GAN: Generating synthetic
data with differential privacy guarantees. In International Conference on Learning
Representations, 2019. URL https://openreview.net/forum?id=S1zk9iRqF7.
62. Torkzadehmahani, R., Kairouz, P., and Paten, B. DP-CGAN: differentially private
synthetic data and label generation. CoRR, abs/2001.09700, 2020.
63. D. Arpit, S. Jastrzundefinedbski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal,
T. Maharaj, A. Fischer, A. Courville, Y. Bengio, and et al., “A closer look at
memorization in deep networks,” in Proceedings of the 34th International Conference
on Machine Learning - Volume 70, ICML’17, p. 233–242, JMLR.org, 2017.
64. S. Yeom, I. Giacomelli, M. Fredrikson, and S. Jha, “Privacy risk in machine learn-
ing: Analyzing the connection to overfitting,” in 2018 IEEE 31st Computer Security
Foundations Symposium (CSF), pp. 268–282, July 2018.
65. Salimans, T., Goodfellow, I.J., Zaremba, W., Cheung, V., Radford, A., Chen, X.:
Improved techniques for training gans. CoRR abs/1606.03498 (2016)
66. Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular
data using conditional gan. In: NeurIPS (2019)
67. Eduardo, S., Naz´abal, A., Williams, C.K.I., Sutton, C.: Robust Variational Au-
toencoders for Outlier Detection and Repair of Mixed-Type Data. In: Proceedings
of AISTATS (2020)
68. Camino, R., Hammerschmidt, C., State, R., 2018. Generating multi-categorical
samples with generative adversarial networks. arXiv preprint arXiv:1807.01202.
Towards Improving Privacy of Synthetic DataSets 15
69. C. R. Meehan, K. Chaudhuri, and S. Dasgupta, “A non-parametric test to detect
data-copying in generative models,” ArXiv, vol. abs/2004.05675, 2020.
70. Z. Izzo, M. A. Smart, K. Chaudhuri, and J. Zou, “Approximate data deletion from
machine learning models: Algorithms and evaluations,” ArXiv, vol. abs/2002.10077,
2020.
71. Congzheng Song and Vitaly Shmatikov. Overlearning Reveals Sensitive Attributes.
In International Conference on Learning Representations (ICLR), 2020.
72. Luca Melis, Congzheng Song, Emiliano De Cristofaro, and Vitaly Shmatikov. Ex-
ploiting Unintended Feature Leakage in Collaborative Learning. In IEEE Sympo-
sium on Security and Privacy (S&P), pages 497–512. IEEE, 2019.
73. Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Member-
ship Inference Attacks Against Machine Learning Models. In IEEE Symposium on
Security and Privacy (S&P), pages 3–18. IEEE, 2017.
74. Min Chen, Zhikun Zhang, Tianhao Wang, Michael Backes, Mathias Humbert,
and Yang Zhang. When Machine Unlearning Jeopardizes Privacy. CoRR
abs/2005.02205, 2020.
75. Zheng Li and Yang Zhang. Label-Leaks: Membership Inference Attack with Label.
CoRR abs/2007.15528, 2020.
76. Klas Leino and Matt Fredrikson. Stolen Memories: Leveraging Model Memoriza-
tion for Calibrated White-Box Membership Inference. In USENIX Security Sympo-
sium (USENIX Security), pages 1605–1622. USENIX, 2020.
77. Dingfan Chen, Ning Yu, Yang Zhang, and Mario Fritz. GAN-Leaks: A Taxonomy of
Membership Inference Attacks against Generative Models. In ACM SIGSAC Con-
ference on Computer and Communications Security (CCS), pages 343–362. ACM,
2020.
78. Ahmed Salem, Yang Zhang, Mathias Humbert, Pascal Berrang, Mario Fritz, and
Michael Backes. ML-Leaks: Model and Data Independent Membership Inference
Attacks and Defenses on Machine Learning Models. In Network and Distributed
System Security Symposium (NDSS). Internet Society, 2019.
79. Jinyuan Jia, Ahmed Salem, Michael Backes, Yang Zhang, and Neil Zhenqiang
Gong. MemGuard: Defending against Black-Box Membership Inference Attacks via
Adversarial Examples. In ACM SIGSAC Conference on Computer and Communi-
cations Security (CCS), pages 259–274. ACM, 2019.
... While synthetic data is often referred to as anonymous per se, generative models may memorize training examples and reproduce specific patient identities during inference [11]. To ensure the non-transference of biometric patterns when generating synthetic chest radiographs, we propose a rigorous privacy-enhancing sampling strategy (see Fig. 1) that leverages state-of-the-art patient matching approaches. ...
Preprint
Full-text available
The availability of large-scale chest X-ray datasets is a requirement for developing well-performing deep learning-based algorithms in thoracic abnormality detection and classification. However, biometric identifiers in chest radiographs hinder the public sharing of such data for research purposes due to the risk of patient re-identification. To counteract this issue, synthetic data generation offers a solution for anonymizing medical images. This work employs a latent diffusion model to synthesize an anonymous chest X-ray dataset of high-quality class-conditional images. We propose a privacy-enhancing sampling strategy to ensure the non-transference of biometric information during the image generation process. The quality of the generated images and the feasibility of serving as exclusive training data are evaluated on a thoracic abnormality classification task. Compared to a real classifier, we achieve competitive results with a performance gap of only 3.5% in the area under the receiver operating characteristic curve.
... Various important tasks could not be addressed efficiently employing artificial intelligence because of data restrictions, particularly in the medical field where medical privacy protocols are strict for researchers to access the data. Indeed, the contingency of patient reidentification is an emerging concern [6,13,19,30,39]. For example, the manipulation of patient data would financially benefit insurance companies. ...
Article
Full-text available
Deep learning has shown record-shattering performance in multiple medical tasks. However, data quantity and quality are crucial requirements. As a matter of fact, data is one of the most challenging issues while deploying deep learning models for different tasks. One of the main challenges is the institutions’ privacy protocols, in particular in the medical field. Indeed, the metadata is usually excluded from the database provided. Many invisible features in images can help tracing anonymized data. We propose to use deep learning to exclude these traces. This article focuses on Magnetic resonance imaging (MRI) and one of the most important features, the equipment used for acquisition. First, we aim to produce an algorithm able to perform well distinguishing multiple MRI equipment from different brands. To this end, we employ a convolution neural network architecture to work on this medical image classification task. The second part of this paper is dedicated to reconstructing the input MRI using a simple auto-encoder. The latter step is to use the auto-encoder in order to mislead the classifier classifying the MRI equipment.
Article
Full-text available
We study membership inference in settings where assumptions commonly used in previous research are relaxed. First, we consider cases where only a small fraction of the candidate pool targeted by the adversary are members and develop a PPV-based metric suitable for this setting. This skewed prior setting is more realistic than the balanced prior setting typically considered. Second, we consider adversaries that select inference thresholds according to their attack goals, such as identifying as many members as possible with a given false positive tolerance. We develop a threshold selection designed for achieving particular attack goals. Since previous inference attacks fail in imbalanced prior settings, we develop new inference attacks based on the intuition that inputs corresponding to training set members will be near a local minimum in the loss function. An attack that combines this with thresholds on the per-instance loss can achieve high PPV even in settings where other attacks are ineffective.
Article
The medical and machine learning communities are relying on the promise of artificial intelligence (AI) to transform medicine through enabling more accurate decisions and personalized treatment. However, progress is slow. Legal and ethical issues around unconsented patient data and privacy is one of the limiting factors in data sharing, resulting in a significant barrier in accessing routinely collected electronic health records (EHR) by the machine learning community. We propose a novel framework for generating synthetic data that closely approximates the joint distribution of variables in an original EHR dataset, providing a readily accessible, legally and ethically appropriate solution to support more open data sharing, enabling the development of AI solutions. In order to address issues around lack of clarity in defining sufficient anonymization, we created a quantifiable, mathematical definition for "identifiability". We used a conditional generative adversarial networks (GAN) framework to generate synthetic data while minimize patient identifiability that is defined based on the probability of re-identification given the combination of all data on any individual patient. We compared models fitted to our synthetically generated data to those fitted to the real data across four independent datasets to evaluate similarity in model performance, while assessing the extent to which original observations can be identified from the synthetic data. Our model, ADS-GAN, consistently outperformed state-of-the-art methods, and demonstrated reliability in the joint distributions. We propose that this method could be used to develop datasets that can be made publicly available while considerably lowering the risk of breaching patient confidentiality.
Conference Paper
In a membership inference attack, an attacker aims to infer whether a data sample is in a target classifier's training dataset or not. Specifically, given a black-box access to the target classifier, the attacker trains a binary classifier, which takes a data sample's confidence score vector predicted by the target classifier as an input and predicts the data sample to be a member or non-member of the target classifier's training dataset. Membership inference attacks pose severe privacy and security threats to the training dataset. Most existing defenses leverage differential privacy when training the target classifier or regularize the training process of the target classifier. These defenses suffer from two key limitations: 1) they do not have formal utility-loss guarantees of the confidence score vectors, and 2) they achieve suboptimal privacy-utility tradeoffs. In this work, we propose MemGuard,the first defense with formal utility-loss guarantees against black-box membership inference attacks. Instead of tampering the training process of the target classifier, MemGuard adds noise to each confidence score vector predicted by the target classifier. Our key observation is that attacker uses a classifier to predict member or non-member and classifier is vulnerable to adversarial examples.Based on the observation, we propose to add a carefully crafted noise vector to a confidence score vector to turn it into an adversarial example that misleads the attacker's classifier. Specifically, MemGuard works in two phases. In Phase I, MemGuard finds a carefully crafted noise vector that can turn a confidence score vector into an adversarial example, which is likely to mislead the attacker's classifier to make a random guessing at member or non-member. We find such carefully crafted noise vector via a new method that we design to incorporate the unique utility-loss constraints on the noise vector. In Phase II, MemGuard adds the noise vector to the confidence score vector with a certain probability, which is selected to satisfy a given utility-loss budget on the confidence score vector. Our experimental results on three datasets show that MemGuard can effectively defend against membership inference attacks and achieve better privacy-utility tradeoffs than existing defenses. Our work is the first one to show that adversarial examples can be used as defensive mechanisms to defend against membership inference attacks.
Conference Paper
Collaborative machine learning and related techniques such as federated learning allow multiple participants, each with his own training dataset, to build a joint model by training locally and periodically exchanging model updates. We demonstrate that these updates leak unintended information about participants’ training data and develop passive and active inference attacks to exploit this leakage. First, we show that an adversarial participant can infer the presence of exact data points—for example, specific locations—in others’ training data (i.e., membership inference). Then, we show how this adversary can infer properties that hold only for a subset of the training data and are independent of the properties that the joint model aims to capture. For example, he can infer when a specific person first appears in the photos used to train a binary gender classifier. We evaluate our attacks on a variety of tasks, datasets, and learning configurations, analyze their limitations, and discuss possible defenses.
Conference Paper
Generative models estimate the underlying distribution of a dataset to generate realistic samples according to that distribution. In this paper, we present the first membership inference attacks against generative models: given a data point, the adversary determines whether or not it was used to train the model. Our attacks leverage Generative Adversarial Networks (GANs), which combine a discriminative and a generative model, to detect overfitting and recognize inputs that were part of training datasets, using the discriminator’s capacity to learn statistical differences in distributions. We present attacks based on both white-box and black-box access to the target model, against several state-of-the-art generative models, over datasets of complex representations of faces (LFW), objects (CIFAR-10), and medical images (Diabetic Retinopathy). We also discuss the sensitivity of the attacks to different training parameters, and their robustness against mitigation strategies, finding that defenses are either ineffective or lead to significantly worse performances of the generative models in terms of training stability and/or sample quality.
Article
Statistical agencies are increasingly adopting synthetic data methods for disseminating microdata without compromising the privacy of respondents. Crucial to the implementation of these approaches are flexible models, able to capture the nuances of the multivariate structure in the original data. In the case of multivariate categorical data, preserving this multivariate structure also often involves satisfying constraints in the form of combinations of responses that cannot logically be present in any data set—like married toddlers or pregnant men—also known as structural zeros. Ignoring structural zeros can result in both logically inconsistent synthetic data and biased estimates. Here we propose the use of a Bayesian non-parametric method for generating discrete multivariate synthetic data subject to structural zeros. This method can preserve complex multivariate relationships between variables, can be applied to high dimensional data sets with massive collections of structural zeros, requires minimal tuning from the user and is computationally efficient. We demonstrate our approach by synthesizing an extract of 17 variables from the 2000 US census. Our method produces synthetic samples with high analytic utility and low disclosure risk.