# Towards Improving Privacy of Synthetic DataSet

## Abstract and Figures

Recent growth in domain specific applications of machine learning can be attributed to availability of realistic public datasets. Real world datasets may always contain sensitive information about the users, which makes it hard to share freely with other stake holders, and researchers due to regulatory and compliance requirements. Synthesising datasets from real data by leveraging generative techniques is gaining popularity. However, the privacy analysis of these dataset is still a open research. In this work, we fill this gap by investigating the privacy issues of the generated data sets from attacker and auditor point of view. We propose instance level Privacy Score (PS) for each synthetic sample by measuring the memorisation coefficient \textit{$\boldsymbol{\alpha_{m}}$} per sample. Leveraging, PS we empirically show that accuracy of membership inference attacks on synthetic data drop significantly. PS is a model agnostic, post training measure, which helps data sharer with guidance about the privacy properties of a given sample but also helps third party data auditors to run privacy checks without sharing model internals. We tested our method on two real world data sets and show that attack accuracy reduced by PS based filtering.
1 Introduction
Machine learning (ML) has been widely adopted in a variety of applications
across multiple domains. Due to the maturity of large-scale training infrastruc-
ture [3], development tools and, the availability of large amounts of training data
companies are building huge models with billions of parameters for commercial
applications. The training data used to train these models often contain per-
sonal or sensitive data [1], which was aggregated from datasets that are used
for purposes unknown to the data owner (i.e. the person itself). Unfortunately,
end-users who play an active role as data donors lack visibility into how this data
is being used/misused by the applications built on top of the data collected. A
growing number of AI incidents3show an upward trend of data and model mis-
uses, which include discriminatory hiring decisions, behavioral manipulations,
and unethical systems that lack transparency. To gain the trust of users and
avoid the privacy implications of data sharing for end users institutions [4–6]
and companies [9–16] are adopting synthetic data [17] as a solution to avoid
privacy issues.
Synthetic data as an idea was ﬁrst adopted in designing various imputation
methods in the early 90’s [17, 18]. Raghunathan et.al [32] formally deﬁned a
framework to deﬁne properties of synthetic data, which was further enhanced in
multiple other researchers [19–22,33,34]. Many recent works leveraged traditional
statistical techniques and machine learning models to generate synthetic data
[23, 25, 26, 28–31].
Ideal synthetic data has properties such as handling large datasets with di-
verse data types (categorical, ordinal, continuous, skewed, etc.), preserving se-
mantic relationships between variables [26] and, underlying algorithms are fairly
generalizable and needs a little tuning. ML models trained on synthetic data
may have similar performance as original data. Theoretical privacy guarantees
can be achieved via ”ideal” synthetic data as a synthetic data record is not an
actual data record of an individual but reﬂect the exact data distributions of
original data.
In the context of data sharing, the majority of recent methods employ gen-
erative models [39, 40, 47, 50, 51] to obtain the synthetic data from real data.
Popular generative modeling frameworks such as Generative Adversarial Net-
works (GANs [40]) and Variational Auto-encoders (VAEs) use real data during
training to train a model capable of generating (or ”synthesising”) ”fake” data
points. Quality metrics such as ﬁdelity which measures how “realistic” is the
synthetic data when compared to real data and, diversity which measures how
diverse the generated samples to cover the variability of real data are used to
assess quality of the synthetic data.
Measuring the generalization property of synthetic data is very important
and often ignored by synthetic data generators. Most of the proposed methods
optimize on ﬁdelity and diversity properties ignoring to measure or report, how
well the synthetic samples capture the overall distribution including outliers of
real data and, not just copies of the (real) samples of training data. Evaluat-
ing generalization in generative models is not easy [54, 55] when compared to
discriminative models which are tested on held-out data.
From compliance and regulation view point, the General Data Protection
Regulation (GDPR [42]) gives new rights to citizens regarding transparency
around personal data. These include the right of access (knowing which data
about the person has been stored) and the right to data erasure, also known
as the right to be forgotten (having personal data deleted under certain cir-
cumstances). Privacy risk include the risk of re-identiﬁcation and the risk of
inference [35–37, 41]. When an attacker is able to identify and link individuals
within a dataset or across multiple datasets, then these datasets pose the risk
of re-identiﬁcation. These types of attacks can reveal the presence of user [43] or
user data was used to train an ML model [44]. Deducing the private information
of individuals with high conﬁdence from the linked datasets or ML model fall
under the risk of inference.
Motivated by the need from compliance, privacy of the end user and, grow-
ing popularity of synthetic data as sharing mechanism both in industry and
academia, in this work we aim to investigate privacy properties of synthetic data
generated by generative models. First, we propose a novel Membership Inference
Attack (MIA), which uses the popular bagging technique to achieve attacker goal
of identifying the membership of user in the model trained on synthetic data.
Next, a instance level privacy score is calculated to ﬁlter the synthetic samples,
which could compromise the privacy of the sensitive attributes in the synthetic
data to protect from MIAs.
2 Background
We present a brief background on membership inference attacks and generative
machine learning models.
Membership inference attack (MIA). is one of the most popular privacy
attacks against ML models [73–79]. The attacker goal of membership inference
(MI) is to determine whether a data sample xis part of the training dataset of
a target model T. We deﬁne a membership inference attack model AMemInf as
a binary classiﬁer.
AMemInf :x, T 7→ {member,non-member}(1)
MIA poses severe privacy risks, using a model trained on data samples of
people with sensitive information an attacker can execute MIA to successfully
infer the membership of person revealing the person’s sensitive information. To
achieve MIA, Shokri et al [73] train a number of shadow classiﬁers on conﬁdence
scores produced by the target model with labels indicating whether samples
came from the training or testing set. MIAs are shown to work in white-box [57]
as well as black-box scenarios against a various target models including ML-as-
a-service [73] and generative models [56]. Yeom et al. [59] explore overﬁtting as
the root cause of MI vulnerability.
Similarly, a variant of MIA is attribute inference in which sensitive attribute
can be inferred from the target model. This sensitive attribute is not related to
the target ML model’s original classiﬁcation task. For instance, a target model
is designed to classify an individual’s age from their social network posts, while
attribute inference aims to infer their educational background [71, 72]. Some
studies [58,59, 64] attribute membership inference attacks to the generalization
gap, the over-ﬁtting of the model, and data memorization capabilities of neu-
ral networks. Deep neural networks have been shown to memorize the training
data [63, 69, 70], rather than learning the latent properties of the data, which
means they often tend to over-ﬁt to the training data. Weakness like this, in
some cases, can be exploited by an adversary to infer data samples’ sensitive
attributes.
Given a data sample xand its representation from a target model, denoted
by h=f(x), to conduct the attribute inference attack, the adversary trains an
attack model AAttInf formally deﬁned as follows:
AAttInf :h7→ s(2)
where srepresents the sensitive attribute.
ing system, where two competing models are trained against each other. They
comprise of two components, a discriminator (D), which is trained to diﬀerenti-
ate between real generated data; and the generator (G), which creates fake data
to fool the discriminator. The generator Gtakes in a random noise vector Zand
parameter vector thetagand learns to generate synthetic samples very similar
to the real data. The resulting G(Z;θg) network is is used to generate fake data
with data distribution Fg.
Training GAN’s on real-world datasets is notoriously hard. One has to care-
fully tune both generator and discriminator network parameters at training time
to avoid conditions like mode collapse and vanishing gradients. Both these con-
ditions aﬀect the quality of synthetic data generated. Choice of loss functions
and GAN architectures also play a vital role in avoiding these problems [65].
Tabular Generators. In this work, we deal in the feature space, which is tab-
ular in nature. The values of features are continuous, ordinal, categorical, tex-
tual, and/or multi-categorical values. Textual values can be of variable length,
high-dimensional due to long-tail distributions of observational data. Textual
data is converted into a set of sequences and projected into a trained embed-
ding that preserves semantic meaning similar to methods proposed in Natural
work (GAN) [40] architectures proposed in literature perform poorly on tabular
data [66] due to – (a) Real data often consists of mixed data types. To generate
this data simultaneously, a GAN must be able to apply both softmax and tanh
on the output; (b) Images pixel values generally follow a Gaussian-like distribu-
tion, which can be normalized to [-1,1], whereas with tabular data, continuous
data often have non-Gaussian distributions with long tails that leads to gradi-
ent saturation; continuous data type and imbalanced categorical columns need
preprocessing steps before feeding into generators.
Recently proposed Conditional Tabular GAN or CTGAN [66] addresses some
of the key problems for tabular data synthesis. The training process involves
simulating records one by one. It ﬁrst randomly selects one of the variables, then
it randomly selects a value for that variable. Given that value for that variable,
the algorithm ﬁnds a matching row from the training data. It also generates the
rest of the variables conditioning on the selected variable. The generated and true
rows are sent to the critic that gives a score. Similarly, modiﬁed WGAN-GP [68]
changed the generator of the WGAN-GP. Figure 1 summarizes the CTGAN and
WGAN-GP training procedures. to handle continuous variables. Other recently
proposed Robust Variational AutoEncoders (R-VAEs) [67], which inherits the
properties of typical Variational AutoEncoders, handles the tabular data more
eﬃciently.
Randomly
select a value
Randomly
select a
column
Noise Z
Fz
N(0,Ik)
Generate record
with value
v G(Z;θg|v)
Fake record
with v
Randomly
select record
in Xwith v
Real record
with v
Critic
D(·)
Data X
Score
(a) CTGAN
Noise Z
Fz
N(0,Ik)
Generator
G(Z;θg)Fake Data Fg
Real Data FX
Critic
D(·)
D: set
of 1-
Lipschitz
functions
Data X
Score of
realism Loss
maxD∈D{D(X)} − [D{G(Z;θg)}]
max
θg
[D{G(Z;θg)}]
(b) WGAN-GP
Fig. 1: CTGAN and WGAN-GP Training dynamics
DPGAN [52] and PATEGAN [61] are popular diﬀerentially private GAN’s.
DPGAN adds noise to the gradient of the discriminator during training to create
diﬀerential privacy guarantees. PATEGAN ﬁrst trains multiple teacher models
on disjoint distributions of data. At inference time, to classify a new sample it
noisily aggerates outputs from previously trained teacher models to generate the
ﬁnal output.
3 Instance level Privacy Scoring
To avoid MIA type attacks and unintentional leakage of sensitive attributes, we
devise an instance level privacy score for generated synthetic data. As shown
in previous studies, the main reasons for privacy compromises in generative
models is mainly because of over ﬁtting of training data that could lead to
data memorization. We empirically measure memorization coeﬃcient αmfor
each generated sample and use this measure as privacy score. When distributing
synthetic data to external third parties, users can ﬁlter samples with high privacy
score and reduce privacy risks.
Given Xr,Xsreal and synthetic data set generated by some generative al-
gorithm Φrespectively. We ﬁrst convert into latent vectors in embedding space
E(Xr), E(Xs). Next, we measure the spherelet distance between each real and
generated sample latent vectors. The spherelet distance between s(Xi) and s(Xj)
is deﬁned as:
dS=dR(Xi, Xj) + dE(Xi, Xj) (3)
where dEis the Euclidean distance between two sets of points, and dRis Rieman-
nian divergence (geometric) of dataset of Xiand Xj. The intuition behind this is
we are measuring the distance between points by projecting the samples on to a
sphere centered at cwith radius r. Spherelet of Xdenoted by s(X) = S(V, c, r)
where Vdetermines an aﬃne subspace the sphere lies in. The spherical error [60]
of Xis deﬁned by
(X) = 1
n
n
X
i=1
(kxick − r)2(4)
If xilie on the sphere then (X) = 0. Also the distance metric captures (a)
which points are found in data manifold in low probability; (b) how close the
synthetic and real in nearest neighbour proximity; which can give us a proxy
measure for over-ﬁtting and data memorization. memorization coeﬃcient αmfor
a generated sample Xgiis deﬁned by gni/tniwhere tniand gniare sample counts
from training and generated dataset, which have proximity distance lesser than
a threshold > λ. In short, we are measuring the support set for the sample Xgiin
the training and synthetic set. Samples, which have αm<1 have large inﬂuence
of training set and are prone to MI attacks vs samples with αm>1 capture
the underlying distribution of training data without leaking private training set
attributes.
The advantage of this method is the model Φis agnostic and works on data
level. Users can discard the samples which cross a certain threshold privacy
score to protect privacy of users and, data audit by compliance bodies can be
performed without a need to know the model internals and training dynamics.
4 MIA Method
Membership Inference The goal of a membership inference attack is to create a
function that accurately predicts whether a data point belongs to the training set
of the classiﬁer. To achieve this, an attacker trains a set of surrogate models [73]
on conﬁdence scores produced by the target model and membership as labels to
achieve her goal.
Typically, the attacker exploits the model conﬁdence reﬂecting overﬁtting to
the conﬁdence scores for each input data sample. Their approach involved ﬁrst
training many ’surrogate’ ML models similar to the target model and uses these
shadow models to train a membership inference model to identify whether a
sample was in the training set of any of the shadow models. MIA is evaluated as
a binary classiﬁcation task, where metrics like false-negative, false-positive rates,
and F1 scores reﬂect the attack success rates of predicting that a given sample
is part of the training set.
We adapt ’shadow’ training model in our MI attack on synthetic data. Given
an auxiliary dataset Daux similar to the training data distribution, we ﬁrst ran-
domly sample single data point Ltand remove it from the dataset. Now, a
well-known ensemble learning method Bootstrap aggregating (Bagging) is used
to divide remaining dataset Daux 1 into subsamples with replacement. Gener-
ative models are trained on each sub sample and using the trained models syn0
synthetic datasets of size kis sampled. These samples as labeled lj= 0 indicat-
ing that synthetic data does not contain target sample Lt. Same procedure is
repeated by adding Ltinto the dataset Daux and the synthetic data sets syn1of
size kis sampled. These samples as labled lj= 1 indicating that synthetic data
contain target sample Lt. Now we have labeled dataset that captures the mem-
bership relationship and we train a random forest classiﬁer MIc, which predicts
the probability of a given sample is member or not of the original data set. Given
a synthetic dataset SAux,M Icpredicts the membership of the target record tin
the training set of the generative model Gthat produced SAux .
5 Experiments
The membership privacy leakage is evaluated through a membership privacy
experiment [48,49]. Given a training set Dtwhich follows a data distribution D
and a learning algorithm φis trained on Dt. We assume the adversary has only
access to φvia a query interface, which exposes inference and class probability
of a given sample. In synthetic data sharing use cases, the adversary has access
to synthetic datasets generated by some generative model. Here the goal of the
adversary is to discover the membership of a given sample i.e. it belongs to
training data or not with access to only synthetic data. We assume domain-
strength of the proposed Instance Level Privacy ﬁltering of synthetic data we
measure the accuracy drop of MI attack with/without ﬁltering mechanism.
For this work, we use two datasets – First dataset is SIVEP-Gripe database [2]
released by Brazilian government with records of 99,557 ICU Covid-19 patients
in Brazil, which includes personal information (ethnicity, age and location). A
logistic regression (binary classiﬁcation) model is ﬁt to predict patient-level
COVID-19 mortality with 6882 data sample and 25 features. Second dataset
is the Brazilian E-Commerce Public Dataset by Olist4. It consists of 112,000
customer orders at multiple marketplaces in Brazil. The dataset contains order
status, price, payment and freight performance to customer location, product
attributes and ﬁnally reviews written by customers. Customer and Seller Id’s
are anonymized, and references to the companies and partners in the review
text have been replaced with the names of Game of Thrones great houses. For
our experiment we use columns price, customer identiﬁer, customer zip code,
customer city as feature columns. A logistic regression model is ﬁt to predict the
customer spend by aggregating price column.
We create 5 synthetic datasets using CTGAN, WGAN-GP, PATEGAN, DP-
GAN, and, RVAE as synthetic data generators for SIVEP-Gripe and E-Commerce
data. The training of generator the input dimensions of architecture is ad-
justed to accommodate the dataset shape. All hidden layers use the ReLU
activation function, the ﬁnal layer uses the sigmoid function. The discrimi-
nator network also uses three hidden layers, which are symmetric to the gen-
erator. The hidden layers again use ReLU. We train each network with 100
4https://www.kaggle.com/olistbr/brazilian-ecommerce/
(a)
(b)
Fig. 2: Synthetic data size dependency on MIA attack for (a) E-Commerce
Dataset (b) EHR dataset
(a)
(b)
Fig. 3: Attack Accuracy drop before and after ﬁltering data points with high
privacy scores - (a) EHR dataset (b) E-Commerce Dataset
epochs. For training diﬀerential private GAN’s, we used the privacy budgets
of = [0.01,0.1,0.5,1.0,3.0]. λis set to 0.5 to reﬂect the adversary advantage
in membership attack. Randomly 1000 samples are sampled from training and
marked as Daux. To reﬂect maximum compromise by adversary as white box
setting, we use the same generator to train the shadow model S.
5.1 Experimental Results and Discussion
We ﬁrst present experimental results of the membership inference attack and the
attack accuracy drop before and after we ﬁlter the data with αmmethod. We
measure the accuracy of attack with respect to number of samples in the auxiliary
dataset used in the attack mechanism. Figure 5.1 and 3 summarizes the privacy
risk of each synthesising method in terms of MI attack accuracy and auxiliary
dataset. We observe that increasing the size of the synthetic dataset increases the
the privacy of the data generator weakens. Similarly, diﬀerential privacy methods
that add noise to the output with some privacy budget , the attack accuracy
decreases showing the value of DP-based generators reduce privacy risks. Increase
in privacy strength of the dataset is consistent across datasets and diﬀerent
models highlighting the strength of our proposed method.
5.2 Discussion
To develop good synthetic data generation models we still have some technical
hurdles, for example, it is not only hard to exactly predict what data charac-
teristics will be preserved in a model’s stochastic output but also appropriate
metrics to empirically compare two distributions are lacking. The privacy bud-
gets in terms of perturbations to achieve diﬀerential privacy make it even harder
to predict which records will remain vulnerable, and might even increase the
exposure of some data records. Our work aims to give some direction towards
improving synthetic data privacy by assigning instance-level privacy scores. A
data auditor or service provider can leverage the scores to remove the data
points prone to MI attacks. The advantage of our method is a model agnostic
and it can be used in post-hoc fashion. We explored the accuracy drop from
the attacker end, but measuring the accuracy-privacy tradeoﬀ due to dropping
the data points from the synthetic data on downstream models needs further
investigation.
We want to highlight some of the limitations of our work. We only tested
our method on tabular data, further investigation is needed to understand the
privacy risks of other types of data such as text, images and, audio. Additionally,
it can be observed that we use the whole dataset to measure αm, which may
be misleading as pointed out by [46], membership leakage for subgroup may
be more vulnerable than others for discriminative models. We plan to address
this drawback in our future work. It is also important to point out that the
privacy risk measures provided in this analysis are dataset-dependent and can
be generator-speciﬁc. While the proposed method gives some empirical privacy
guarantees for generative models and guidance for data sharing use case, we need
more detailed analysis on the dependency of αmagainst model parameters, and
varied dataset distributions. In terms of threat models, our analysis does not
take into consideration an attacker who has knowledge of the ﬁltering scheme in
advance. A malicious attacker can use only ﬁltered data points to train surrogate
models making the proposed scheme invalid.
6 Conclusion
One of the major impediments in advancing ML in security-sensitive domains is
the lack of open realistic publicly available datasets. To gain trust and protect the
end-user data and, comply with regulations stakeholders are leveraging synthetic
data as a tool to share data between third-party aggregators and researchers.
However, the privacy analysis of these datasets is still open research. In this
work, we ﬁll this gap by investigating the privacy issues of the generated data
sets from the attacker and auditor point of view. We propose an Instance-level
Privacy Score (PS) for each synthetic sample by measuring the memorization
coeﬃcient αmper sample. Leveraging, PS we empirically show that the accu-
racy of membership inference attacks on synthetic data drops signiﬁcantly. PS is
a model agnostic, post-training measure, which helps data sharer with guidance
about the privacy properties of a given sample but also helps third party data
auditors to run privacy checks without sharing model internals. Finally, there
is a lot more work to be done purely within the realm of privacy and our work
only addresses a very small issue empirically. Understanding social and legal im-
plications of synthetic data sharing needs inputs and partnerships from policy
experts and, compliance oﬃcers.
Kuppa et al.
Kuppa et al.
Kuppa et al.
Kuppa et al.
