PreprintPDF Available

Synthetic Observational Health Data with GANs: from slow adoption to a boom in medical research and ultimately digital twins?

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

After being collected for patient care, Observational Health Data (OHD) can further benefit patient well-being by sustaining the development of health informatics and medical research. Vast potential is unexploited because of the fiercely private nature of patient-related data and regulations to protect it. Generative Adversarial Networks (GANs) have recently emerged as a groundbreaking way to learn generative models that produce realistic synthetic data. They have revolutionized practices in multiple domains such as self-driving cars, fraud detection, digital twin simulations in industrial sectors, and medical imaging. The digital twin concept could readily apply to modelling and quantifying disease progression. In addition, GANs posses many capabilities relevant to common problems in healthcare: lack of data, class imbalance, rare diseases, and preserving privacy. Unlocking open access to privacy-preserving OHD could be transformative for scientific research. In the midst of COVID-19, the healthcare system is facing unprecedented challenges, many of which of are data related for the reasons stated above. Considering these facts, publications concerning GAN applied to OHD seemed to be severely lacking. To uncover the reasons for this slow adoption, we broadly reviewed the published literature on the subject. Our findings show that the properties of OHD were initially challenging for the existing GAN algorithms (unlike medical imaging, for which state-of-the-art model were directly transferable) and the evaluation synthetic data lacked clear metrics. We find more publications on the subject than expected, starting slowly in 2017, and since then at an increasing rate. The difficulties of OHD remain, and we discuss issues relating to evaluation, consistency, benchmarking, data modelling, and reproducibility.
Content may be subject to copyright.
Synthetic Observational Health Data with GANs: from slow
adoption to a boom in medical research and ultimately
digital twins?
Georges-Filteau, Jeremy
Radboud University, The Hyve
jeremy@thehyve.nl
Cirrilo, Elisa
The Hyve
elisa@thehyve.nl
November 14, 2020
Abstract
After being collected for patient care, Observational Health Data (OHD) can further benefit patient well-being by sustaining
the development of health informatics and medical research. Vast potential is unexploited because of the fiercely private nature
of patient-related data and regulation about its distribution. Generative Adversarial Networks (GANs) have recently emerged as a
groundbreaking approach to learn generative models efficiently that produce realistic Synthetic Data (SD). They have revolutionized
practices in multiple domains such as self-driving cars, frauddetection, simulations in the and marketing industrial sectors known as
digital twins, and medical imaging. The digital twin concept could readily apply to modelling and quantifying disease progression. In
addition, GANs posses a multitude of capabilities relevant to common problems in the healthcare: augmenting small dataset, correct-
ing class imbalance, domain translation for rare diseases, let alone preserving privacy. Unlocking open access to privacy-preserving
OHD could be transformative for scientific research. In the COVID-19’s midst, the healthcare system is facing unprecedented chal-
lenges, many of which of are data related and could be alleviated by the capabilities of GANs. Considering these facts, publications
concerning the development of GAN applied to OHD seemed to be severely lacking. To uncover the reasons for the slow adoption of
GANs for OHD, we broadly reviewed the published literature on the subject. Our findings show that the properties of OHD and eval-
uating the SD were initially challenging for the existing GAN algorithms (unlike medical imaging, for which state-of-the-art model
were directly transferable) and the choice of metrics ambiguous. We find many publications on the subject, starting slowly in 2017
and since then being published at an increasing rate. The difficulties of OHD remain, and we discuss issues relating to evaluation,
consistency, benchmarking, data modeling, and reproducibility.
1 Introduction
1.1 Background
Medical professionals collect Observational Health Data (OHD) in Electronic Health Records (EHRs)
at various points of care in a patient’s trajectory, to support and enable their work (Cowie et al.,2016). The
patient profiles found in EHRs are diverse and longitudinal, composed of demographic variables, record-
ings of diagnoses, conditions,procedures, prescriptions, measurements and lab test results, administrative
information, and increasingly omics (Abedtash et al.,2020).
1
Having served its primary purpose, this wealth of detailed information can further benefit patient
well-being by sustaining medical research and development. That is to say, improving the development
life-cycle of Health Informatics (HI), the predictive accuracy of Machine Learning (ML) algorithms, or en-
abling discoveries in research on clinical decisions, triage decisions, inter-institution collaboration, and
HI automation (Rudin et al.,2020;Rankin et al.,2020). Big health data is the underpinning of two prime
objectives of precision medicine: individualization of patient interventions and inferring the workings of
biological systems from high-level analysis (Capobianco,2020). However, the private nature of patient-
related data and the growing widespread concern over its disclosure hampers dramatically the potential
for secondary usage of OHD for legitimate purposes.
Anonymization techniques are used to hinder the misuse of sensitive data. This implies a costly and
data-specific cleansing process, and the unavoidable trade-off of enhancing privacy to the detriment of data
utility (Dankar and El Emam,2012;Cheu et al.,2019;De Cristofaro,2020). These techniques are fallible
and do not prevent reidentification. In fact, no polynomial time Differential privacy (DP) algorithms can
produce Synthetic Data (SD) preserving all relations of the real data, even for simple relations such as
2-way marginals (Ullman and Vadhan,2011). To address these drawbacks, alternative modes for sharing
sensitive data is an active research area, including privacy-preserving analytic and distributed learning.
Although promising, these approaches come with limitations, and we must still explore their feasibility
and scalability (Raisaro,2018). Regardless, distributed models are vulnerable to a variety of attacks, for
which no single protection measure is sufficient as research on defense is far behind attack (Enthoven and
Al-Ars,2020;Gao et al.,2020;Luo et al.,2020;Lyu et al.,2020).
These conditions restrict access to OHD to professionals with academic credentials and financial re-
sources. Use of OHD by all other health data-related occupations is blocked, along with the downstream
benefits. For example, software developers rarely have access to the data at the core of the HI solutions
they are developing, or educators lack examples (Laderas et al.,2018).
1.2 Synthetic data
An alternative to traditional privacy-preserving methods is to produce full SD. We categorize meth-
ods to produce SD as either theory-driven (theoretical, mechanistic or iconic) or data-driven (empirical
or interpolatory) modelling (Kim et al.,2017;Hand,2019). Theory-driven modelling involves a complex
knowledge-based attempt to define a simulation process or a statistical model representing the causal re-
lationships of a system (Yousefi et al.,2018;Kansal et al.,2018). The Synthea (Walonoski et al.,2017)
synthetic patient generator is one such model, in which state transition models1produce patient trajecto-
ries. It takes the model parameters from aggregate population-level statistics of disease progression and
medical knowledge. Such a knowledge-based model depends on prior knowledge of the system, and how
much we can intellect about it (Kim et al.,2017;Bonn´
ery et al.,2019). On one hand, theory-based modelling
aims at understanding and offers interpretability, on the other when modelling complex systems, simpli-
fications and assumptions are inevitable, leading to inaccuracies or reduced utility (Hand,2019;Rankin
et al.,2020). In fact, relying on population-level statistics does not produce models capable of reproducing
heterogeneous health outcomes (Chen et al.,2019a).
Data-driven modelling techniques infer a representation of the data from a sample distribution, to
summarize or describe it (Hand,2019). There are many statistical modelling approaches to produce SD, but
intrinsic assumptions about the data form the basis. They bound their representational power to correla-
tions intelligible to the modeler, being prone to obscure inaccuracies. SD generated by these models hits
1Probabilistic model composed of pre-defined states, transitions, and conditional logic.
2
a ceiling of utility (Rankin et al.,2020). In the ML field, generative models learn an approximation of the
multi-modal distribution, from which we can draw synthetic samples (Goodfellow et al.,2014). Generative
Adversarial Network (GAN) (Goodfellow et al.,2014) have recently emerged as a groundbreaking approach
to learn generative models that produce realistic SD using Neural Network (NN).GAN algorithms have
rapidly found a wide range of applications, such as data augmentation in medical imaging (Yi et al.,2019a;
Wang et al.,2020a;Zhou et al.,2020).
The potential affects of GAN to healthcare and science are considerable (Rankin et al.,2020), some
of which have been realized in fields such as medical imaging. However, the application of GAN to OHD
seems to have been lagging (Xiao et al.,2018a). Well-known characteristics of OHD could explain the rel-
atively slow progress. Primarily, algorithms developed for images and text in other fields were easily re-
purposed for medical equivalents of the data types. However, OHD presents a unique complexity in terms
of multi-modality, heterogeneity, and fragmentation (Xiao et al.,2018a). In addition, evaluating the real-
ism of synthetic OHD is intuitively complex, a problem that still burdens GANs. In 2017, a few authors the
first attempts at GANs for OHD were published (Esteban et al.,2017;Che et al.,2017;Choi et al.,2017a;
Yahi et al.,2017). We aimed to investigate if these examples inspired more research, and if so, to gain a
comprehensive understanding of approaches to the problem and the techniques involved.
2 Methods
Table 1: Search query terms
Health data Generative adversarial models
Terms Terms
OR
clinical AND
OR
generative adversarial
health GAN
EHR adversarial training
electronic health record synthetic
patient
Publications concerning GANs for Observation Health Data (OHD-GAN) were identified through with
Google Scholar (Google), Web of Science (Clarivate) and Prophy (Prophy). The terms and operators found
in Table 2form the search query. We included studies reporting the development,application, performance
evaluation and privacy evaluation of GAN algorithms to produce OHD. We define OHD as categorical, real-
valued, ordinal or binary event data recorded for patient care. We list a more detailed summary of the
included and excluded data types in Table 3. The excluded data types are already the subject of one or
more review, or would merit a review of their own (Yi et al.,2019b;Nakata,2019;Anwar et al.,2018;Wang
et al.,2020a;Zhou et al.,2020). In each of the included publications, we considered the aspects listed in
Table 1.
Table 2: Aspects analysed in each of the publications included in the review
A) Types of healthcare data D) Evaluation metrics
B) GAN algorithm, learning procedures, losses E) Privacy considerations
C) Intended use of the SD F) Interpreatability of the model
3
Table 3: Types of OHD data included or excluded from the review.
Type Examples
Included
Observations Demographic information, medical classification, family history
Time-stamped ob-
servations
Diagnosis, treatment and procedure codes, prescription and dosage, laboratory test results,
physiologic measurements and intake events
Encounters Visit dates, care provider, care site
Derived Aggregated counts, calculated indicators.
Excluded
Omics Genome, transcriptome, proteome, immunome, metabolome, microbiome
Imaging X-rays, computed tomography (CT), magnetic resonance imaging (MRI)
Signal Electrocardiogram (ECG), electroencephalogram (EEG)
Unstructured Narrative reports, textual
3 Results
3.1 Summary
We found 43 publications describing the development or adaption of OHD-GAN, presented in Table
4. We can generalize the data addressed in each of these publications into one of two categories: time-
dependent observations, such as time-series, or static representation in the form of feature vectors such as
tabular rows. We briefly bring attention to the lack of multi-relational tabular representations,the primary
form of EHR, and further discuss the subject in latter sections.
Most efforts propose adaptations of current algorithms to the characteristics and complexities of
OHD. These include multi-modality of marginal distributions or non-Gaussian real-valued features, het-
erogeneity, a combination of discrete and real-valued features, longitudinal irregularity, complex condi-
tional distributions, missingness or sparsity, class imbalance of categorical features and noise.
While these properties may make training a useful model difficult, the variety of applications that are
highly relevant and needed in the healthcare domain provides sufficient incentive. The most cited motives
are, as one would expect, to cope with the often limited number of samples in medical datasets and to
overcome the highly restricted access to OHD. The potential of releasing privacy-preserving SD freely is
a common subject. Publications considering privacy evaluate the effect on utility of applying DP to their
algorithm, propose alternatives privacy concepts and metrics, or only concentrate on the subject of privacy.
3.2 Motives for developing OHD-GAN
Some claim that the ability to generate synthetic is becoming an essential skill in data science (Sarkar,
2018), but what purpose can it serve in the medical domain? The authors mention a wide range of potential
applications. We briefly describe the four prevailing themes in the following sections: data augmentation
(Sec.3.2.1), privacy and accessibility (Sec.3.2.2), precision medicine (Sec.3.2.3) and modelling simulations
(Sec.3.2.4).
3.2.1 Data augmentation
Data augmentation is mentioned in nearly all publications. Although counter-intuitive, GAN can gen-
erate SD that conveys more information about the real data distribution. Effectively, the real-valued space
distribution of the generator produces a more comprehensive set of data points, valid, but not present in
the discrete real data points. A combination of real and synthetic training data habitually leads to increased
predictor performance (Wang et al.,2019a;Che et al.,2017;Yoon et al.,2018a,b;Yang et al.,2019a;Chen
et al.,2019a;Cui et al.,2019;Che et al.,2017). A more intelligible way to seize the concept from the point
4
of view of image classification, known as invariances,perturbations such as rotation, shift, sheer and scale
(Antoniou et al.,2017).
Similarly, domain translation and Semi-supervised learning (SSL) training approaches with GANs
could support predictive tasks that lack data with accurate labels,lack paired samples or suffer class imbal-
ance (Che et al.,2017;McDermott et al.,2018;Yoon et al.,2018a). Another example is correcting discrep-
ancies between datasets collected in different locations or under different conditions inducing bias (Yoon
et al.,2018c). GANs are also well adapted for data imputation, were entries are Missing at Random (MaR)
(Yoon et al.,2018b).
3.2.2 Enhancing privacy and increasing data accessibility
Most authors see SD as the key to unlocking the unexploited value of OHD hindering machine learn-
ing, and scientific progress (Beaulieu-Jones et al.,2019;Baowaly et al.,2019;Baowaly et al.,2018;Che et al.,
2017;Esteban et al.,2017;Fisher et al.,2019;Severo et al.,2019) or education (Laderas et al.,2018). We can
broadly describe preserving privacy as reducing the risk of re-identification attack to an acceptable level.
It quantifies this level of risk when releasing data anonymized with DP.
Due to its artificial nature, SD is put forward to forgo the tight restrictions on data sharing, while
potentially providing greater privacy guarantees (Beaulieu-Jones et al.,2019;Baowaly et al.,2019;Baowaly
et al.,2018;Esteban et al.,2017;Fisher et al.,2019;Walsh et al.,2020;Chin-Cheong et al.,2019). Enabling
access to greater variety, quality and quantity of OHD could have positive effects in a wide range of fields,
such as software development, education, and training of medical professionals. The fact remains that
GANs do not eliminate the risk of reidentification. Considering none of the synthetic data points represent
actual people, the significance of such an occurrence is unclear. It is possible to combine both methods,
and GAN training according to DP shows evidence of reducing the loss of utility compared to DP alone.
3.2.3 Enabling precision medicine
The application to precision medicine involves predicting outcomes conditioned on a patient’s cur-
rent state and history. Simulated trajectories could help inform clinical decision making by quantifying
disease progression and outcomes and have a transformative effect on healthcare (Walsh et al.,2020;Fisher
et al.,2019). Ensembles of stochastic simulations of individual patient profiles such as those produced by
Conditional Restricted Boltzmann Machine (CRMB) could help quantify risk at an unprecedented level of
granularity (Fisher et al.,2019).
Predicting patient-specific responses to drugs is still a new field of research, a problem known as Indi-
vidualized Treatment effects (ITE). Estimating ITEs is persistently hampered by the lack of paired counter-
factual samples (Yoon et al.,2018a;Chu et al.,2019). To solve similar problems n medical imaging,various
GAN algorithms were developed for domain translation,mapping a sample from its to original class to the
paired equivalent. This includes bidirectional transformations, allowing GAN to learn mappings from very
few, or a lack of paired samples (Wolterink et al.,2017;Zhu et al.,2017a;McDermott et al.,2018).
3.2.4 From patient and disease models to digital twins
A well-trained model approximates the process that generated the real data points. The relations
learned by the model, its parameters, contain meaningful information if we can learn to harness it. Data-
driven algorithms evolve as our understanding of their behavior improves. We incorporate new concepts in
the algorithms leading to further understanding, iterativly blurring the line with theory-driven approaches
(Hand,2019). Interpretability is a growing field of research concerned with understanding how the learned
parameters of a model relate. In other words analysing the representation the algorithm has converged to
and deriving meaning from obscure logic. Incorporating new understanding in the architecture of algo-
5
rithms shift the view from a data-driven to a theory-driven perspective (Hand,2019). As we purposefully
build structure in our algorithms from new understanding, we may get the chance to explore meaningful
representations that would otherwise be beyond our reasoning.
Approaching these ideas from above, the concept of ”digital twins” represents in a way the ultimate
realization of Personalized Medicine. A common practice in industrial sectors is high-fidelity virtual rep-
resentations of physical assets. Long-term simulations, that provide an overview and comprehensive un-
derstanding of the workings, behavior and life-cycle of their real counterparts. The state of the models is
continuously updated from theoretical data, real data, and streaming Internet of Things (IoT) indicators.
Intently conditioned input data allows the exploration of specific events or conditions. In a position
paper on the subject, Angulo et al. draw the parallels of this technique with the current needs in healthcare
and the emergence of the technologies for actionable models of patients. (Angulo Bahon et al.,2019;An-
gulo et al.,2020). The authors bring up the rapid adoption of wearables that are continuously monitoring
people’s physiological state.
Wearables are one of many mobile digitally connected devices that collect patient data over a broad
range of physiological characteristic and behavioral patterns (Coravos et al.,2019). This emerging trend
known as digital bio-markers has already led to studies demonstrating predictive models with the potential
for improved patient care (Snyder et al.,2018). Through continuous lifelong learning, integrating multiple
modes of personal data, generative patient models could inform diagnostics of medical professionals and
also enable testing treatment options. In their proposal, GAN are an essential component of the ecosystem
to ensure patient privacy and to provide bootstrap data. Fisher et al. already use the term ”digital twin”
to describe their process, noting that they present no privacy risk and enable simulating patient cohorts of
any size and characteristics (Walsh et al.,2020).
Table 4: Summary of the publication included in the review.
Publication Algorithm(s) Focus, algorithms, and techniques Data type
2017
Choi et al. medGAN (medGAN) Incompatibility of back-propagation with discrete features.
Autoencoder (AE),Mini-batch Averaging (MB-Avg),
batch-normalization (BN),shortcut connections (SC),
Attribute Disclosure (AD),Presence Disclosure (PD).
Binary occurences or counts
of medical codes.
Yahi et al. medGAN adaptation Drug Laboratory Effects (DLE) on continuous time-series,
multi-modality. t-Distributed Stochastic Neighbor
Embedding (t-SNE).
Paired pre/post treatment
exposure time-series
Esteban et al. Recurrent GAN (RGAN),Recurrent
Convolutional GAN (RC-GAN)
Adversarial training of (conditional) Recurrent NNs (RNNs)
on time-series, evaluation, privacy. Long Short-term
Memory (LSTM),Conditional GAN (CGAN),Differential
private stochastic gradient descent (DP-SGD).
Regularly observed
real-valued time-series
(RV-TS)
Xiao et al. WGAN for Temporal Point-processes
(PPWGAN)
Temporal Point Processes. LSTM,Wassertein GAN (WGAN),
Poisson process.
Sporadic occurrences,
hospital visits.
Che et al. Electronic Health Record GAN (ehrGAN),
Semi-supervised Learning with a learned
ehrGAN (SSL-GAN)
Semi-supervised augmentation, transitional distribution.
1D-CNN, Word2vec, Variational contrastive divergence
(VCD).
Discrete time-series (D-TS),
sequences of medical codes.
Dash et al. HealthGAN (HealthGAN) Sleep patterns, stratification by covariates. Binary over multiple visits.
2018
6
Table 4: Summary of the publication included in the review (Continued).
Publication Algorithm(s) Focus, algorithms and techniques Data
Camino et al. Multi-categorical ARAE (MC-ARAE),
Multi-categorical medGAN (MC-medGAN),
Multi-categorical Gumbel-softmax GAN
(MC-GumbelGAN),Multi-categorical WGAN
with Gradient Penalty (MC-WGAN-GP)
Improving training process. medGAN,WGAN with Gradient
Penalty (WGAN-GP),Gumbel-Softmax GAN (Gumbel-GAN),
Adversarially regularized autoencoder (ARAE).
Multiple categorical variables.
McDermott
et al.
Cycle Wasserstein Regression GAN
(CWR-GAN)
Cycle-consistent semi-supervised regression learning,
unpaired data, class imbalance. WGAN Cycle-consistent
GAN (Cycle-GAN) ITE
ICU RV-TS, lack of paired
samples, SD.
Yoon et al. Generative Adversarial Nets for inference of
Individualized Treatment Effects (GANITE)
ITE, unobserved counterfactual, multi-label classification,
uncertainty. CGAN pair.
Feature, treatment and
outcome vectors.
Yoon et al. RadialGAN (RadialGAN) Multi-domain translation, features and distribution
mismatch, cycle-consistency, augmentation. CGAN,WGAN.
Tabular, discrete and
continuous.
Yoon et al. Generative Adversarial Imputation Network
(GAIN)
Tabular data imputation. Missing Completly at Random
(MCaR),CGAN.
Real-valued, tabular with
entries MCaR.
2019
Wang et al. Sequentially Coupled GAN (SC-GAN) Capturing mutual influence in time-series. Coupled
generator pair. Treatment recommendation task. LSTM,
CGAN.
RV-TS of patient state and
medication dosage data.
Baowaly et al. Boundary-seeking medGAN (MedBGAN) Improving training process. medGAN,Boundary-seeking
GAN.
Binary occurences or counts
of medical codes.
Baowaly et al. MedBGAN,Wassertein medGAN
(MedWGAN)
Improving training process. medGAN,BGAN,WGAN. Binary occurences or counts
of medical codes.
Severo et al. Conditional WGAN-GP (cWGAN-GP) Generation and public release of dataset. Protecting
commercial sensitive information. Class imbalance.
cWGAN-GP,CGAN.
Physiological RV-TS.
Chin-Cheong
et al.
WGAN Heterogeneous mixture of dense and sparse features.
Privacy and evaluating the introduction of bias. WGAN,
WGAN-GP,Mode-specific normalization (MSN),DP aware
optimizer from Tensor-flow. community.
Binary, real-valued and
categorical.
Jordon et al. Private Aggregation of Teacher Ensembles
(PATE) framework applied to GANs
(PATE-GAN)
Alternative differential privacy,adaptation of the Private
Aggregation of Teacher Ensembles (PATE) framework.
Demographic and binary.
Torfi and Beyki corGAN (corGAN) Convolutional NN (CNN) architecture, capturing feature
correlations, evaluating realism,privacy evaluation using
Membership Inference (MI). 1D-Convolutional AE (CAE).
Binary occurences or counts
of medical codes.
Chu et al. Adversarial Deep Treatment Effect
Prediction (ADTEP)
ITE, two independent AE for patient and treatment feature
sets, trained adversarially in combination,and outcome
predictor from latent representation.
EHR data, not specified.
Jackson and
Lussetti
medGAN Evaluating medgan with the addition of demographics
features.
Demographic features and
binary occurences or counts
of medical codes.
Yu et al. SSL-GAN Rare disease detection, Semi-supervised learning (SSL),
leveraging unlabeled EHR data, medical code embedding
network. LSTM.
Diagnosis and prescription
codes.
Yang et al. CGAN Class imbalance, low count of minority class.
Semi-supervised learning combining Self-training (ST) and
CT with a CGAN for a IoT application.
Twenty medical datasets from
the UCI repository, types
unspecified.
7
Table 4: Summary of the publication included in the review (Continued).
Publication Algorithm(s) Focus, algorithms and techniques Data
Yang et al. CorrNN and T-wGAN (GcGAN) Capturing the correlations between different categories of
medical codes and the outcome. Correlation NN,Turing
GAN,Wassertein T-GAN (T-wGAN).
Binary occurences or counts
of medical codes.
Yang et al. Categorical GAIN (CGAIN) Improve on GAIN for categorical variable using fuzzy
encoding of the features.
Categorical (multi-class and
multi-label) real-valued.
Camino et al. GAIN,GAIN+Variable Splitting (VS),
Variational AE (VAE),VAE+Iterative
Imputation (IT),VAE+Backpropagation IT
(BP),VAE+VS,VAE+VS+IT,VAE+VS+BP
Benchmark and improve on generative imputation with
GAIN and VAE.
Categorical and real-valued.
Mostly not OHD
Beaulieu-
Jones et al.
Auxiliary Classifier GAN (AC-GAN) Evaluating if differentially private GANs that is valid
reanalysis while ensuring privacy. DP,CGAN.
Physiological RV-TS.
Xu et al. Conditional Tabular GAN (CTGAN) Non-Gaussian multi-modal distribution of continuous
columns and imbalanced discrete column in tabular data.
Evaluation benchmark. CGAN Training by sampling (TbS)
MSN WGAN-GP Gumbel-GAN
Tabular real-valued and
categorical.
Yale et al. HealthGAN Privacy metrics and over-fitting. MI,Nearest-neighbor
Adversarial Accuracy (NN-AA),Privacy loss (PL),
Discriminator testing (DT)
Categorical demographics,
real-valued and binary
medical codes.
Fisher et al. Adversarially trained CRMB Simulation of patient trajectories from their baseline state,
disease prediction and risk quantification,
missingness.CRMB.
Binary, ordinal, categorical,
and continuous, 3 months
intervals.
2020
Walsh et al. Adversarially trained CRMB Digital twins, disease prediction and risk quantification,
missingness. CRMB.
Binary, ordinal, categorical,
and continuous, 3 months
intervals.
Yale et al. HealthGAN Metrics to capture a synthetic dataset’s resemblance,
privacy, utility and footprint. Evaluating applications.
Application case studies, Reproducibility of studies with SD.
NN-AA,PL,Data obfuscation (DO,medGAN,WGAN-GP,
Synthetic Data Vault,
Real-valued and categorical.
Demographics, vital signs,
diagnoses, and procedures.
Tantipongpipat
et al.
DP-auto-GAN (DP-auto-GAN) Privacy, medGAN adaptation, evaluation metrics. DP-SGD
AE medGAN Renyi Differential Privacy (RDP)
Medical data: binary.
Non-health data: categorical
and real-valued.
Bae et al. GANs for anonymizing private medical data
(AnomiGAN)
Probabilistic scheme that ensures indistinguishability of the
SD can be viewed as encrypted. DP CNN
Binary occurences of medical
codes.
Cui et al. Coplementary pattern Aaugmentation
(CONAN)
Complementary GAN in a rare disease predictor model that
generates positive samples from negatives to alleviate class
imbalance.
Embedding vectors
representing multiple patient
visits and conditions.
Zhu et al. Blood Glucose GAN (GluGAN) Adversarialy trained RNN to predict the upcoming
time-step in physiological time-series conditioned on the
past observations. RNN,CNN,Gated Recurrent Unit (GRU).
RV-TS of blood glucose
measurements, discrete
patient submitted features.
Chen et al. medGAN,WGAN-GP, DC-GAN Privacy analysis of generative models. MI,Full Black-box
Attack,Partial Black-box Attack,White-box Attack,DP-SGD.
Binary vector of medical
codes.
Chin-Cheong
et al.
WGAN with DP (WGAN-DP) Heterogeneous data, effect of differential privacy on utility.
WGAN DP
Categorical, continuous,
ordinal, and binary. Dense or
sparse.
8
Table 4: Summary of the publication included in the review (Continued).
Publication Algorithm(s) Focus, algorithms and techniques Data
Camino et al. - Initially a comparison GANs and VAEs, but they choose
instead to bring attention to the problem of benchmarking.
Analysis of problematic, requirements and suggestions.
GAIN,Six component GAN (HexaGAN) (Hwang et al.,2019),
Missing data IWAE (MIWAE) (Mattei and Frellsen,2019),
Heterogeneous-Incomplete VAE (HI-VAE)(Nazabal et al.,
2020), Multiple Imputation Denoising Autoencoders (MIDA)
(Gondara and Wang,2017)
Real-valued and categorical.
Zhang et al. EMR Wassertein GAN (EMR-WGAN) Improving training, evaluation metrics, sparsity.
WGAN,BN,Layer normalisation (LN),CGAN.
Binary occurences of medical
codes. Low-prevalence of
codes.
Yan et al. Heterogeneous GAN (HGAN) Improvements on EMR-WGAN incorporating record-level
constraints in the loss function. WGAN,BN,LN,CGAN,MI,
PD.
Binary, categorical and
real-valued.
Ozyigit et al. Realistic Synthetic Dataset Generation
Method (RSDGM)
Exploring the feasibility of various methods to generate
synthetic datasets. WGAN
Real-valued and categorical.
Yoon et al. Anonymization through data synthesis using
GAN (ADS-GAN)
Identifiability view of privacy. Generator conditioned on
real samples inputs with an identifiability loss to satisfy the
identifiability constraint. WGAN WGAN-GP DP alternative.
Real-valued and binary.
Goncalves
et al.
MC-medGAN Comparison of GANs with statistical models to generate
synthetic data, evaluation metrics. MI,AD.
Categorical and real-valued.
3.3 Data Types and Feature Engineering
No publications made use of OHD in its initial form, patient records in EHR composed of many related
tables (normalized form). The complexity of a model would explode when maintaining referential integrity
and statistics between multiple tables. The hierarchy by witch these would interact with each other condi-
tionally is no less complicated (Buda et al.,2015;Patki et al.,2016;Zhang and Tay,2015;Tay et al.,2013).
There are published GAN algorithms made to consume normalized database in their original form. In all
publications we considered, feature engineering was used to adapt the data to task requirements, or to
promising algorithms that fit the data characteristics. They transform the data into one of four modalities:
time series, point-processes, ordered sequences or aggregates described in Fig. 5.
9
Table 5: Types of observational health data and features engineering
Type Values and structure Challenges Features engineering
Time-series
Continuous
Regular
Sporadic
- Time-stamped observations
- Continuous, ordinal, categorical
and/or multi-categorical
- Recorded continuously by medical de-
vices, following a schedule by medical
professional, or when necessary
- Observations are often MaR across
time end dimensions, erroneous, or
completely absent for certain pa-
tients.
- Time-series of different concepts are
often highly correlated and their in-
fluence on one another must be ac-
counted for.
Imputation coupled
with training
Regular
Data imputation
Binning in into fixed-
size intervals
Combination of bin-
ning and imputation
Point-processes - Series of timestamped observations
of one variable or medical concept per
patient
- Series of events re-
duced to the time
interval between each
consecutive occur-
rence.
Ordered
sequences
- Ordered vectors representing one or
more patients visits
- Medical codes associated with the di-
agnoses, procedures, measurements
and interventions
Variable length
High-dimensionalLong-tail distribu-
tion of codes
Sequences are pro-
jected into a trained
embedding that
preserves semantic
meaning according
to methods borrowed
from NLP
Tabular
Denormalized
Relational
- Medical and demographic variables
aggregated in tabular format
- Continuous, ordinal, categorical
and/or multi-categorical features
Medical history is aggregated into a
fixed-size vector of binary or aggregated
counts of occurrences and combined
with demographic features.
3.4 Data oriented GAN development
3.4.1 Auto-encoders and categorical features
In what is to the best of our knowledge, the first attempt at developing a GAN for OHD. Choi et al. fo-
cus on the problem posed by the incompatibility of categorical and ordinal features with back-propagation.
Their solution is to pretrain an AE to project the samples to and from a continuous latent space representa-
tion. They keep the decoder portion along with its trained weights to form a component of medGAN (Choi
et al.,2017a). It is incorporated into the generator and maps the randomly sampled input vectors from
the real-valued latent space representation back to discrete features. This first exemplar of synthetic OHD
generated by GAN inspires a series of enhancements.
Early efforts were to improve the performance of medGAN. Among the first, Camino et al. developed
MC-medGAN changing the AE component by splitting its output into a Gumbel-Softmax (Jang et al.,2016)
activation layer for each categorical variable and concatenating the results. (Camino et al.,2018). The au-
thors also developed an adaptation based on recent training techniques: WGAN (Arjovsky et al.,2017) and
aWGAN (Briefed in Panel 1) with Gradient Penalty (Gulrajani et al.,2017). MC-WGAN-GP is the equivalent
of MC-medGAN but with Softmax layers. The authors report that the choice of a model will depend on data
characteristics, particularly sparsity.
Subsequent authors owing to the propensity of OHD to induce mode collapse widely adopted Wasser-
stein’s distance. Baowaly et al. developed MedWGAN also based on WGAN, and MedBGAN borrowing from
Boundary-seeking GAN (BGAN) (Hjelm et al.,2017) which pushes the generator to produce samples that
lie on the decision boundary of the discriminator, expanding the search space. Both led to improved data
10
Panel 1. Wasserstein’s distance
In brief, the Wasserstein distance is a measure between two Probability Distributions (PDs) that has the property of always
providing a smooth gradient. As the loss function of the discriminator, this property improves training stability and mitigates
mode collapse. To make the equation tractable a 1-Lipschitz constraint must be introduced, creating another problem. In the
words of the author:
”Weight clipping is a clearly terrible way to enforce a Lipschitz constraint. If the clipping parameter is large,
then it can take a long time for any weights to reach their limit, [...] If the clipping is small, this can easily lead
to vanishing gradients [...] However, we do leave the topic of enforcing Lipschitz constraints in a neural network
setting for further investigation, and we actively encourage interested researchers to improve on this method.
(Arjovsky et al.,2017)
Sometimes this prevented the network from modelling the optimal function, but Gradient penalty, a less restrictive regular-
ization replaced the clipping. (Petzka et al.,2018).
quality, in particular MedBGAN (Baowaly et al.,2019;Baowaly et al.,2018). Jackson and Lussetti tested
medGAN on an extended dataset containing demographic and health system usage information,obtaining
results similar to the original (Jackson and Lussetti,2019). HealthGAN, based on WGAN-GP, includes a
data transformation method adapted from the Synthetic Data Vault (Patki et al.,2016) to map categorical
features to and from the unit numerical range (Yale et al.,2020).
3.4.2 Forgoing the autoencoder and introducing conditional training
Claiming that the use of an AE introduces noise, with EMR-WGAN,Zhang et al. dispose of the AE
component of previous algorithms and introduce a conditional training method, along with conditioned
BN and LN techniques to stabilise training (Zhang et al.,2020). The algorithm was further adapted by
Yan et al. as HGAN to better account for the conditional distributions between multiple data types and
enforce record-wise consistency. A recognized problem with medGAN was that it produced common-sense
inconsistencies, such as gender mismatches in medical codes (Yan et al.,2020;Choi et al.,2017a). HGAN
enforces constraints by adding specific penalties to the loss function, such as limit ranges for numerical
categorical pairs and mutual exclusivity for pairs of binary features (Yan et al.,2020). The algorithm also
performs well on regular time-series of sleep patterns (Dash et al.,2019)
To develop CTGAN,Xu et al. presume that tabular data poses a challenge to GAN owing to the non-
Gaussian multi-modal distribution of real-valued columns and imbalanced discrete columns (Xu et al.,
2019). The fully connected layers, have adaptations to deal with both real-valued and categorical features.
For real-valued features, it use mode-specific normalization to capture the multiplicity of modes. For dis-
crete features, they introduce conditional training-by sampling to re-sample discrete attributes evenly dur-
ing training, while recovering the real distribution when generating data.
In other efforts, Torfi and Beyki develop corGAN, with a 1-dimensional Convolutional AE (1D-CAE)
to capture neighboring feature correlations of the input vectors (Torfi and Beyki,2019). Chin-Cheong et al.
use a Feed-forward Network (FFN) based on Wasserstein’s distance to evaluate the capacity of GANs to
model heterogeneous data of dense and sparse medical features (Chin-Cheong et al.,2020). Ozyigit et al.
use the same approach, focusing on reproducing statistical properties (Ozyigit et al.,2020).
3.4.3 Time-series
Esteban et al. devise the LSTM-based RGAN and RC-GAN to generate a regular time-series of physio-
logical measurements from bedside monitors (Esteban et al.,2017). Curiously,the authors dismiss Wasser-
stein’s distance explicitly, and generated each dimension of their time-series independently, where one
11
Panel 2. Transitional distribution
The ehrGAN generator is trained to decode a random vec-
tor zmixed with the latent space representation of a real
patient hto produce a synthetic sample ˜x(Che et al.,2017).
A standard autoencoder (left) is trained to encode a real
patient xto and from a latent representation h, minimiz-
ing the reconstruction error with ¯x. The decoder portion
(left) is then trained to produce realistic synthetic sam-
ples ˜xfrom a combination of the random latent vector z
and the latent space encoding of a real patient x. The
generator thus learns a transition distribution px|x)with
xpdata(x). The amount of contribution of the real
sample is controlled by a random mask according to ˜
h=
mz+ (1 m)·h. This method inspired from Varia-
tional Contrastive Divergeance prevents mode collapse by
design and learns an information rich transition distribu-
tion px|x)around real samples x.
would assume they are correlated. They observe a considerable loss of accuracy on their utility metric.
3.5 Task oriented GAN development
3.5.1 Semi-supervised learning
ehrGAN is developed for sequences of medical codes Che et al.. It learns a transitional distribution,
combine an Encoder-Decoder CNN (Rankin et al.,2020) with VCD (Che et al.,2017). The ehrGAN generator
is trained to decode a random vector mixed with the latent space representation of a real patient (See Panel
2). The trained ehrGAN model is then incorporated into the loss function of a predictor where it can help
generalization by producing neighbors for each input sample.
SSL is commonly used to augment the minority class in imbalanced datasets, with techniques such
as ST and CT.Yang et al. improve on both by incorporating a GAN in the procedure (Yang et al.,2018).
The GAN is first trained on the labelled set and used to re-balance it. A prediction task with a classifier
ensemble is then executed and the data points with highest prediction confidence are labelled. The process
is iterated until labelling expansion ceases. As a final step, the GAN is trained on the expanded labelled
set to generate an equal amount of augmentation data. The authors obtained improved performance in a
number of classification tasks and multiple tabular datasets.
3.5.2 Domain translation
To address the heterogeneity of healthcare data originating from different sources, Yoon et al. com-
bines the concepts of cycle-consistent domain translation from Cycle-GAN (Zhu et al.,2017b) and multi-
domain translation from Star-GAN (Choi et al.,2017b) to build RadialGAN to translate heterogeneous pa-
tient information from different hospitals, correcting features and distribution mismatches (Yoon et al.,
2018c). One encoder-decoder pair per data endpoint that are trained to map records to and from a shared
latent representation for their respective endpoint.
3.5.3 Individualized treatment effects
The task of estimating ITEs is an ongoing problem. ITEs refer to the response of a patient to a certain
treatment given a set of characterizing features. The problem is that counterfactual outcomes are never
observed or treatment selection is highly biased (Yoon et al.,2018a;McDermott et al.,2018;Walsh et al.,
12
2020). In GANITE Yoon et al. propose a solution by using a pair of GANs: one for counterfactual imputation
and another for ITE estimation (Yoon et al.,2018a). The former captures the uncertainty in unobserved out-
comes by generating a variety of counterfactuals. The output is fed to the latter, which estimates treatment
effects and provides confidence intervals.
McDermott et al. developed CWR-GAN to leverage large amounts of unpaired pre/post-treatment
time-series in Intensive Care Unit (ICU) data for the estimation of ITEs on physiological time-series (Mc-
Dermott et al.,2018). CWR-GAN is a joint regression-adversarial SSL approach inspired by Cycle-GAN.
The algorithm has the ability to learn from unpaired samples, with very few paired samples, to reversibly
translate the pre/post-treatment physiological series.
Chu et al. address the problem of data scarcity by designing ADTEP. The algorithm can harness the
large volume of EHR data formed by triples of non-task specific patient features, treatment interventions
and treatment outcomes (Chu et al.,2019). ADTEP learns representation and discriminatory features of the
patient, and treatment data by training an AE for each pair of features. In addition to AE reconstruction
loss, a second discriminator is tasked with identifying fake treatment feature reconstructions. Finally, a
fourth loss metric is calculated by feeding the concatenated latent representations of both AEs toaLogistic-
regression (LR) model aimed at predicting the treatment outcome (Chu et al.,2019).
Like Esteban et al.,Wang et al. demonstrated an algorithm to generate a time series of patient states
and medication dosages pairs using LSTM. In contrast to RGAN and RC-GAN, in SC-GAN, patients state at
the current time-step informs the concurrent medication dosage, which in turn affects the patient state in
the upcoming time-step (Wang et al.,2019a). SC-GAN overcame a number of baselines on both statistical
and utility metrics.
3.5.4 Data imputation and augmentation
GAN are naturally suited for data imputation, and can mitigate missingness. Statistical models devel-
oped for the multiple imputation problem increase quadratically in complexity with the number of features,
while the expressiveness of deep neural networks can efficiently model all features with missing values si-
multaneously.
In that regard, Yoon et al. adapted the standard GAN to perform imputation on real-valued features
MaR in tabular datasets (Yoon et al.,2018b). In GAIN, the discriminator must classify individual variables
as real or fake (imputed), as opposed to the whole ensemble. Additional input, or hint, containing the
probability of each component being real or imputed is fed to the discriminator to resolve the multiplicity
of optimal distributions that the generator could reproduce. The model performs considerably better than
five state-of-the-art benchmarks. GAIN was later adapted by Yang et al. to also handle categorical features
using fuzzy binary encoding, the same technique employed in HealthGAN. In parallel, Camino et al. apply
the same VS technique they used fir medGAN to adapt GAIN and run a benchmark against different types
of VAE.
The distribution estimated by a generator can compensate for lack of diversity in a real sample, es-
sentially filling in the blanks in a manner comparable to data imputation. In such cases, data sampled from
this distribution has the potential to help improve generalization in training predictive models. As an ex-
ample, we mentioned generating unobserved counterfactual outcomes (Yoon et al.,2018b), and generating
neighboring samples to help generalization in predictors (Che et al.,2017).
The adversarially trained Restricted Boltzmann Machine (RMB) developed by Fisher et al. enabled
them to simulate individualized patient trajectories based on their base state characteristics. Due to the
stochastic nature of the algorithm, generating a large number of trajectories for a single patient can provide
13
new insights on the influence of starting conditions on disease progression or quantify risk (Fisher et al.,
2019).
3.6 Model validation and data evaluation
To asses the solution to a generative modelling problem, it is necessary to validate the model, and to
verify its output. GAN aim to approximate a data distribution P, using a parameterized model distribution
Q(Borji,2019). Thus, in evaluating the model, the goal is to validate that the learning process has led to a
sufficiently close approximation. What this means in practice is hard to define. The concept of ”realism”
finds more natural application to images of text, but becomes ambiguous when faced with the complexity
of health data.
Walsh et al. employ the term ”statistical indistinguishability” and define it as the inability of a classi-
fication algorithm to differentiate real from synthetic samples (Walsh et al.,2020). The terms covers almost
all evaluation methods employed in the publications,which can be divided into two broad categories: those
aimed at evaluating the statistical properties of the data directly,and those aimed at doing so indirectly by
quantifying the work that can be done with the data. There are, nonetheless a few attempts of a qualitative
nature, more in line with the concept of realism.
3.6.1 Qualitative evaluation
Visual inspection of projections of the SD is a common theme, serving mostly as a basic sanity check,
but occasionally presented as evidence. The formal qualitative evaluation approaches found in the liter-
ature are mainly Preference Judgement, Discrimination Tasks or Clinician Evaluation and are generally
carried out by medical professionals (Borji 2018).
-Preference judgment The task is choosing the most realistic of two data points in pairs of one real
and one synthetic (Choi et al.,2017a).
-Discrimination Tasks Data points are shown one by one and must be classified as real or synthetic
(Beaulieu-Jones et al.,2019).
-Clinician Evaluation Rather than classifying the data points, they must be rated for realism accord-
ing to a predefined numerical scale. (Beaulieu-Jones et al.,2019). Significance is determined with a
statistical test such as Mann-Whitney.
-Visualized embeding The real and synthetic data samples are plotted on a graph or projected into
an embeding such as t-SNE or PCA and compared visually. (Cui et al.,2019;Yu et al.,2019;Zhu et al.,
2020a;Yale et al.,2019a;Yang et al.,2019c;Beaulieu-Jones et al.,2019;Tantipongpipat et al.,2019;
Dash et al.,2019).
-Feature analysis In certain fields, the data can be projected to representations that highlight patterns
or properties that can be easily visually assessed. While this does not provide conclusive evidence
of data realism, it can help get a better understanding of model behaviour during training. As an
example, typical and easily distinguishable patterns in EEG and ECG bio-signals. (Harada et al.,2019)
In general, qualitative evaluation methods based on visual inspection are weak indicators of data
quality. At the dataset or sample level, quantitative metrics provide more convincing evidence of data
quality (Borji 2018).
3.6.2 Quantitative evaluation
Quantitative evaluation metrics can be categorized into three loosely defined groups: comparing the
distributions of real and synthetic data as a whole, assessing the marginal and conditional distributions of
14
features, and evaluating the quality of the data indirectly by quantifying the amount of work that can be
done with the data, referred to as utility.
-Dataset distributions A summary of metrics is presented in Tab. 6.
-Feature Distributions If the model has learned a realistic representation of the real data it should
produce SD that possesses the same quantity and type of information content. Authors attempt by
various metrics to determine if the statistical properties of the SD agree with those of the real data.
These metrics are presented in Table 7. Although statistical similarity provides strong support for the
behavior of the learning process, it is not necessarily informative about their validity. They are often
ambiguous and can be found to be misleading upon further investigation. Given the complexity of
health data, low level relations are unlikely to paint a full picture. Authors often state that no single
metric taken on its own was sufficient, and that a combination of them allowed deeper understanding
of the data.
-Data utility Utility-based metrics, presented in Table 8, often provide a more convincing indicator of
data realism. On the other hand, they mostly lack the interpretability of statistical metrics. We took
the liberty of placing these into one of two categories: tasks mostly defined only for evaluation (Ad
hoc utility metrics) or tasks based on real-world applications (Application utility metrics). Note that
this distinction is not based on a rigorous definition, but serves to facilitate comparison.
-Analytical The analytical methods were mainly employed for evaluation,but can also provide a better
understanding of the and its behavior.
Feature Importance The important features (Random Forest (RF)) and model coefficients (LR,
Support Vector Machine (SVM)) of predictors. (Esteban et al.,2017;Xu et al.,2019;Yoon et al.,
2020;Chin-Cheong et al.,2019;Beaulieu-Jones et al.,2019).
Ablation study The performance of the model is compared against impaired version. This helps
determining if the novel component of the algorithm contributes significantly to performance
(Cui et al.,2019;Che et al.,2017;McDermott et al.,2018;Yoon et al.,2018c;Chin-Cheong et al.,
2020).
15
Table 6: Metrics employed to validate trained models based on the comparison of distributions.
Metric Description
Kullback-Leibler diver-
gence (KLD)
Non-symmetric measure of difference between two PDs, related to relative entropy. Given a feature X,
p(x)and q(x)the PD of the real and synthetic data respectively, the KLD of q(x)from p(x)is the amount
of information lost when q(x)is trained to estimate p(x)(Jiawei,2018;Goncalves et al.,2020).
RDP Alternative measure of divergence, which includes KLD as a special case. The RDP includes a parameter
αthat gives it an extra degree of freedom, becoming equivalent to the Shannon-Jensen divergence when
α1. It showed a number of advantages when compared to the original GAN loss function, and removed
the need for gradient penalty (Van Balveren et al.,2018;Tantipongpipat et al.,2019)
Jaccard similarity Measure of similarity and diversity defined on sets as the size of the intersection over the size of the union
(Ozyigit et al.,2020;Yang et al.,2019c;contributors).
2-sample test (2-ST) Statistical test of the null hypotheses the real and SD samples came from the same distribution. and
synthetic, originate from the same distribution through the use of a statistical test such as Kolmogorov-
Smirnov (KS) or Maximum Mean Discrepency (MMD).(Fisher et al.,2019;Baowaly et al.,2019;Baowaly
et al.,2018;Esteban et al.,2017)
Distribution of Recon-
struction Error
Compares the distributions of reconstruction error for the SD and the training set versus the SD and a
held out testing set. Calculated according to the Nearest-neighbor metric or other measures of distance.
A significant difference would indicate over-fitting and can evaluated with a statistical test, such as KS.
(Esteban et al.,2017)
Latent space projections Real and synthetic samples are projected back into the latent space,or encoded with a 0xCE-VAE,comparing
the dimensional mean of the variance or the distance between mode peaks (Zhang et al.,2020). See Section
5.5 for examples of how the latent space encoding can interpreted.
Domain Specific Measures
(DSMs)
Comparison of the PD with DSMs. For instance the Quantile-Quantile (Q-Q) plot for point-processes (Xiao
et al.,2017). See Section 6.2 for a notion of how DSMs could apply to EHR data.
Classifier accuracy Accuracy of a classifier trained to discriminate real from synthetic units. Predictor accuracy around 0.5
would indicate indistinguishability. (Fisher et al.,2019;Walsh et al.,2020)
16
Table 7: Metrics based on evaluating the statistical properties of the synthetic data distribution.
Metric Description
Dimensions-wise distribution The real and synthetic data are compared feature-wise according to a variety of methods
For example, the Bernoulli success probability for binary features, or the Student T-test
for continuous variables, and Pearson Chi-square test for binary variables is used to deter-
mine statistical significance (Beaulieu-Jones et al.,2019;Choi et al.,2017a;Chin-Cheong
et al.,2019;Yan et al.,2020;Baowaly et al.,2019;Baowaly et al.,2018;Ozyigit et al.,2020;
Tantipongpipat et al.,2019;Yoon et al.,2020;Tantipongpipat et al.,2019;Fisher et al.,
2019;Che et al.,2017;Wang et al.,2019a;Yale et al.,2019a;Chin-Cheong et al.,2020;
Ozyigit et al.,2020).
Inter-dimensional correlation Dimension-wise Pearson coefficient correlation matrices for both real and synthetic data
(Beaulieu-Jones et al.,2019;Goncalves et al.,2020;Torfi and Beyki,2019;Frid-Adar et al.,
2018;Ozyigit et al.,2020;Yang et al.,2019c;Yoon et al.,2020;Zhu et al.,2020a;Yoon et al.,
2020;Walsh et al.,2020;Yale et al.,2019a;Ozyigit et al.,2020;Dash et al.,2019;Bae et al.,
2020b).
Cross-type Conditional Distribution Correlations between categorical and continuous features, comparing the mean and stan-
dard deviation of each conditional distribution (Yan et al.,2020).
Time-lagged correlations Measures the correlation between features over time intervals. (Fisher et al.,2019;Walsh
et al.,2020).
Pairwise mutual information Checks for the presence multivariate relationships pair-wise for each feature, as a mea-
sure of mutual dependence (Rankin et al.,2020). Quantifies the amount of information
obtained about a feature from observing another.
First-order proximity metric Defined over graphs, captures the direct neighbor relationships of vertices. Zhang et al.
applied to graphs built from the co-occurrence of medical codes and compared the results
between real and synthetic data (Zhang et al.,2020).
Log-cluster metric Clustering is applied to the real and synthetic data combined. The metric is calculated
from the number of real and synthetic samples that fall in the same clusters (Goncalves
et al.,2020).
Support coverage metric Measures how much of the variables support in the real data is covered in the synthetic
data. Support is defined as the percentage of values found in the synthetic data, while
coverage is the reverse operation. The metric is calculated as the average of the ratios
over all features. Penalizes less frequent categories that are underrepresented (Goncalves
et al.,2020).
Proportion of valid samples Defined by Yang et al. as a requirement for records to contain both disease and medication
instances. (Yang et al.,2019c).
PCA Distributional Wassertein distance The Wassertein distance is calculated over k-dimensional PCA projections of the real and
synthetic data (Tantipongpipat et al.,2019).
17
Table 8: Metrics based on evaluating the utility of the synthetic data on practical tasks.
Metric Description
Data utility metrics
DWP Each variable is in turn chosen as the prediction target label and the remaining as features. Two predic-
tors are trained to predict the label, one from the synthetic data and another from a portion of the real
data. Their performance is compared on the left out real data (Choi et al.,2017a;Camino et al.,2018;
Goncalves et al.,2020;Yan et al.,2020;Tantipongpipat et al.,2019;Baowaly et al.,2019).
ARM ARM aims to the discovery of relationships among a large set of variables, commonly occurring variable-
value pairs (Agrawal et al.,1993). The rules obtained from the real and synthetic data are compared
(Baowaly et al.,2019;Baowaly et al.,2018;Bae et al.,2020a;Yan et al.,2020).
Training utility Performance of predictors trained on the synthetic data,often in comparison with the real data or data
generated with DP (Bae et al.,2020a).
TRTS Accuracy on real data of some form of predictor trained on synthetic data (Beaulieu-Jones et al.,2019;
Rankin et al.,2020;Yoon et al.,2020).
TSTR Accuracy on synthetic data of some form of predictor trained on real data (Bae et al.,2020a;Yoon et al.,
2020;Jordon et al.,2019).
Discriminator A predictor is trained to discriminate synthetic from real sample. An accuracy value of 0.5 would indi-
cate that they are indistinguishable (Fisher et al.,2019;Walsh et al.,2020;Yale et al.,2019b).
Siamese discriminator A pair of identical FFN each receive either a real sample or a synthetic sample. Their output is passed
to a third network which outputs a measure of similarity (Torfi and Beyki,2019).
Applied utility metrics
Data augmentation A predictor is trained on a combination dataset of real and synthetic data or real data with missing
values imputed and performance is compared with the same predictor trained on real data alone (Yoon
et al.,2020;Yang et al.,2019b,c).
Model augmentation The trained generative model is incorporated into a predictor’s activation function by generating an en-
semble of proximate data points for each instance, thereby improving generalization (Che et al.,2017).
Accuracy The prediction performance of the model is compared against benchmarks of the same type on real data
(Cui et al.,2019;Yoon et al.,2018a;Che et al.,2017;Yu et al.,2019;Zhu et al.,2020a;Baowaly et al.,
2019;Wang et al.,2019a;Walsh et al.,2020;Yoonet al.,2018b;McDermott et al.,2018;Yang et al.,2019c;
Yoon et al.,2018c;Xu et al.,2019;Beaulieu-Jones et al.,2019;Bae et al.,2020a). Models trained to make
forward predictions from past observations or from real data transformed with a known function can
simply be evaluated for accuracy. For example, the RMSE on time-series (Xiao et al.,2018b;McDermott
et al.,2018;Yoon et al.,2018b;Yang et al.,2019b;Zhu et al.,2020a).
3.7 Alternative evaluation
In their publications, Yale et al. propose refreshing approaches to evaluating the utility of SD. For
example, they organized a hack-a-thon type challenge involving the data. During the event,students were
tasked with creating classifiers, while provided only with SD (Yale et al.,2020). They were then scored on
the accuracy of their model on real data.
In more rigorous initiatives, they attempted (successfully) to recreate the experiments published in
medical papers based on the MIMIC dataset using only data generated from their model HealthGAN. In a
subsequent version of their article, the authors evaluate the performance of their model against traditional
privacy preservation methods by using the trained discriminator component of HealthGAN to discriminate
real from synthetic samples.
18
3.8 Privacy
Some authors conducted a privacy risk assessment to evaluate the risk of reidentification. The empir-
ical analyses were based on the definitions of MI,AD (Choi et al.,2017a;Goncalves et al.,2020;Yan et al.,
2020;Chen et al.,2019b;Chin-Cheong et al.,2020) and the Reproduction rate (RR) (Zhang et al.,2020).
Cosine similarities between pairs of samples was also used (Torfi and Beyki,2019). Most studies report low
success rates for these types of attacks, and little effect from the sample size, although Chen et al. note
that sample sizes under 10k lead to higher risk. Goncalves et al. evaluated MC-medGAN against multiple
non-adversarial generative models in a variety of privacy compromising attacks, including AD, obtaining
inconsistent results for MC-medGAN (Goncalves et al.,2020). While this is not mentioned by the authors,
multiple results reported in the publication point to the fact that the GAN was not properly trained or suf-
fered mode-collapse.In black-box and white-box type attacks, including the LOGAN (Hayes et al.,2017)
method, medGAN performed considerably better than WGAN-GP (Chen et al.,2019b), the algorithm which
served as basis for improvements to medGAN in publications discussed in Section 3.4.1. Overall, the au-
thor notes that releasing the full model poses a high risk of privacy breaches and that smaller training sets
(under 10k) also lead to a higher risk.
3.8.1 The status of fully synthetic data in regards to current privacy regulations
It seems intuitively possible that the artificial nature of SD essentially prevents associations with
real patients, however the question is never directly addressed in the publications. An extensive Stanford
Technological Review legal analysis of SD concluded that laws and regulations should not treat SD indis-
criminately from traditional privacy preservation methods (Bellovin et al.,2019). They state that current
privacy statutes either outweigh or downplay the potential for SD to leak secrets by implicitly including it
as the equivalent of anonymization.
3.8.2 Traditional privacy
Numerous attempts at applying traditional privacy guarantees, such as deferentially-private stochas-
tic gradient descent can also be found in other fields, as well as in healthcare (Beaulieu-Jones et al.,2019;
Esteban et al.,2017;Chin-Cheong et al.,2020;Bae et al.,2020a). By limiting the gradient amplitude at each
step and adding random noise, AC-GAN could produce useful data with = 3.5and δ < 105according to
the definition of differential privacy.
3.8.3 Moving forward safely
Some have put forward the notion that preventing over-fitting and preserving privacy may not be
conflicting goals (Wu et al.,2019;Mukherjee et al.,2019;Zhu et al.,2020b). Letting go of the negative con-
notation, we can explore the benefits such as improving generalization, stabilizing learning and building
fairer models (Zhu et al.,2020b) and the use of GANs to optimize the trade-off (Chen et al.,2019c).
-Bae et al. ensure privacy with a probabilistic scheme that ensure indistinguishably, but also maxi-
mizes utility. Specifically, a multiplicative perturbation by random orthogonal matrices with input
entries of kxm medical records and a second second discriminator in the form of a pre-trained pre-
dictor (Bae et al.,2020a).
- In privGAN (Mukherjee et al.,2019), an adversary is introduced, forcing the generator to produce sam-
ples that minimize the risk of MI attacks, in addition to cheating the discriminator. The combination
of both goals has the explicit effect of preventing over-fitting,and their algorithm produces samples
of similar quality to non-private GAN.
19