Content uploaded by Jeremy Georges-Filteau

Author content

All content in this area was uploaded by Jeremy Georges-Filteau on Jan 29, 2021

Content may be subject to copyright.

Synthetic Observational Health Data with GANs: from slow

adoption to a boom in medical research and ultimately

digital twins?

Georges-Filteau, Jeremy

Radboud University, The Hyve

jeremy@thehyve.nl

Cirrilo, Elisa

The Hyve

elisa@thehyve.nl

November 14, 2020

Abstract

After being collected for patient care, Observational Health Data (OHD) can further beneﬁt patient well-being by sustaining

the development of health informatics and medical research. Vast potential is unexploited because of the ﬁercely private nature

of patient-related data and regulation about its distribution. Generative Adversarial Networks (GANs) have recently emerged as a

groundbreaking approach to learn generative models efﬁciently that produce realistic Synthetic Data (SD). They have revolutionized

practices in multiple domains such as self-driving cars, frauddetection, simulations in the and marketing industrial sectors known as

digital twins, and medical imaging. The digital twin concept could readily apply to modelling and quantifying disease progression. In

addition, GANs posses a multitude of capabilities relevant to common problems in the healthcare: augmenting small dataset, correct-

ing class imbalance, domain translation for rare diseases, let alone preserving privacy. Unlocking open access to privacy-preserving

OHD could be transformative for scientiﬁc research. In the COVID-19’s midst, the healthcare system is facing unprecedented chal-

lenges, many of which of are data related and could be alleviated by the capabilities of GANs. Considering these facts, publications

concerning the development of GAN applied to OHD seemed to be severely lacking. To uncover the reasons for the slow adoption of

GANs for OHD, we broadly reviewed the published literature on the subject. Our ﬁndings show that the properties of OHD and eval-

uating the SD were initially challenging for the existing GAN algorithms (unlike medical imaging, for which state-of-the-art model

were directly transferable) and the choice of metrics ambiguous. We ﬁnd many publications on the subject, starting slowly in 2017

and since then being published at an increasing rate. The difﬁculties of OHD remain, and we discuss issues relating to evaluation,

consistency, benchmarking, data modeling, and reproducibility.

1 Introduction

1.1 Background

Medical professionals collect Observational Health Data (OHD) in Electronic Health Records (EHRs)

at various points of care in a patient’s trajectory, to support and enable their work (Cowie et al.,2016). The

patient proﬁles found in EHRs are diverse and longitudinal, composed of demographic variables, record-

ings of diagnoses, conditions,procedures, prescriptions, measurements and lab test results, administrative

information, and increasingly omics (Abedtash et al.,2020).

1

Having served its primary purpose, this wealth of detailed information can further beneﬁt patient

well-being by sustaining medical research and development. That is to say, improving the development

life-cycle of Health Informatics (HI), the predictive accuracy of Machine Learning (ML) algorithms, or en-

abling discoveries in research on clinical decisions, triage decisions, inter-institution collaboration, and

HI automation (Rudin et al.,2020;Rankin et al.,2020). Big health data is the underpinning of two prime

objectives of precision medicine: individualization of patient interventions and inferring the workings of

biological systems from high-level analysis (Capobianco,2020). However, the private nature of patient-

related data and the growing widespread concern over its disclosure hampers dramatically the potential

for secondary usage of OHD for legitimate purposes.

Anonymization techniques are used to hinder the misuse of sensitive data. This implies a costly and

data-speciﬁc cleansing process, and the unavoidable trade-off of enhancing privacy to the detriment of data

utility (Dankar and El Emam,2012;Cheu et al.,2019;De Cristofaro,2020). These techniques are fallible

and do not prevent reidentiﬁcation. In fact, no polynomial time Differential privacy (DP) algorithms can

produce Synthetic Data (SD) preserving all relations of the real data, even for simple relations such as

2-way marginals (Ullman and Vadhan,2011). To address these drawbacks, alternative modes for sharing

sensitive data is an active research area, including privacy-preserving analytic and distributed learning.

Although promising, these approaches come with limitations, and we must still explore their feasibility

and scalability (Raisaro,2018). Regardless, distributed models are vulnerable to a variety of attacks, for

which no single protection measure is sufﬁcient as research on defense is far behind attack (Enthoven and

Al-Ars,2020;Gao et al.,2020;Luo et al.,2020;Lyu et al.,2020).

These conditions restrict access to OHD to professionals with academic credentials and ﬁnancial re-

sources. Use of OHD by all other health data-related occupations is blocked, along with the downstream

beneﬁts. For example, software developers rarely have access to the data at the core of the HI solutions

they are developing, or educators lack examples (Laderas et al.,2018).

1.2 Synthetic data

An alternative to traditional privacy-preserving methods is to produce full SD. We categorize meth-

ods to produce SD as either theory-driven (theoretical, mechanistic or iconic) or data-driven (empirical

or interpolatory) modelling (Kim et al.,2017;Hand,2019). Theory-driven modelling involves a complex

knowledge-based attempt to deﬁne a simulation process or a statistical model representing the causal re-

lationships of a system (Youseﬁ et al.,2018;Kansal et al.,2018). The Synthea (Walonoski et al.,2017)

synthetic patient generator is one such model, in which state transition models1produce patient trajecto-

ries. It takes the model parameters from aggregate population-level statistics of disease progression and

medical knowledge. Such a knowledge-based model depends on prior knowledge of the system, and how

much we can intellect about it (Kim et al.,2017;Bonn´

ery et al.,2019). On one hand, theory-based modelling

aims at understanding and offers interpretability, on the other when modelling complex systems, simpli-

ﬁcations and assumptions are inevitable, leading to inaccuracies or reduced utility (Hand,2019;Rankin

et al.,2020). In fact, relying on population-level statistics does not produce models capable of reproducing

heterogeneous health outcomes (Chen et al.,2019a).

Data-driven modelling techniques infer a representation of the data from a sample distribution, to

summarize or describe it (Hand,2019). There are many statistical modelling approaches to produce SD, but

intrinsic assumptions about the data form the basis. They bound their representational power to correla-

tions intelligible to the modeler, being prone to obscure inaccuracies. SD generated by these models hits

1Probabilistic model composed of pre-deﬁned states, transitions, and conditional logic.

2

a ceiling of utility (Rankin et al.,2020). In the ML ﬁeld, generative models learn an approximation of the

multi-modal distribution, from which we can draw synthetic samples (Goodfellow et al.,2014). Generative

Adversarial Network (GAN) (Goodfellow et al.,2014) have recently emerged as a groundbreaking approach

to learn generative models that produce realistic SD using Neural Network (NN).GAN algorithms have

rapidly found a wide range of applications, such as data augmentation in medical imaging (Yi et al.,2019a;

Wang et al.,2020a;Zhou et al.,2020).

The potential affects of GAN to healthcare and science are considerable (Rankin et al.,2020), some

of which have been realized in ﬁelds such as medical imaging. However, the application of GAN to OHD

seems to have been lagging (Xiao et al.,2018a). Well-known characteristics of OHD could explain the rel-

atively slow progress. Primarily, algorithms developed for images and text in other ﬁelds were easily re-

purposed for medical equivalents of the data types. However, OHD presents a unique complexity in terms

of multi-modality, heterogeneity, and fragmentation (Xiao et al.,2018a). In addition, evaluating the real-

ism of synthetic OHD is intuitively complex, a problem that still burdens GANs. In 2017, a few authors the

ﬁrst attempts at GANs for OHD were published (Esteban et al.,2017;Che et al.,2017;Choi et al.,2017a;

Yahi et al.,2017). We aimed to investigate if these examples inspired more research, and if so, to gain a

comprehensive understanding of approaches to the problem and the techniques involved.

2 Methods

Table 1: Search query terms

Health data Generative adversarial models

Terms Terms

OR

clinical AND

OR

generative adversarial

health GAN

EHR adversarial training

electronic health record synthetic

patient

Publications concerning GANs for Observation Health Data (OHD-GAN) were identiﬁed through with

Google Scholar (Google), Web of Science (Clarivate) and Prophy (Prophy). The terms and operators found

in Table 2form the search query. We included studies reporting the development,application, performance

evaluation and privacy evaluation of GAN algorithms to produce OHD. We deﬁne OHD as categorical, real-

valued, ordinal or binary event data recorded for patient care. We list a more detailed summary of the

included and excluded data types in Table 3. The excluded data types are already the subject of one or

more review, or would merit a review of their own (Yi et al.,2019b;Nakata,2019;Anwar et al.,2018;Wang

et al.,2020a;Zhou et al.,2020). In each of the included publications, we considered the aspects listed in

Table 1.

Table 2: Aspects analysed in each of the publications included in the review

A) Types of healthcare data D) Evaluation metrics

B) GAN algorithm, learning procedures, losses E) Privacy considerations

C) Intended use of the SD F) Interpreatability of the model

3

Table 3: Types of OHD data included or excluded from the review.

Type Examples

Included

Observations Demographic information, medical classiﬁcation, family history

Time-stamped ob-

servations

Diagnosis, treatment and procedure codes, prescription and dosage, laboratory test results,

physiologic measurements and intake events

Encounters Visit dates, care provider, care site

Derived Aggregated counts, calculated indicators.

Excluded

Omics Genome, transcriptome, proteome, immunome, metabolome, microbiome

Imaging X-rays, computed tomography (CT), magnetic resonance imaging (MRI)

Signal Electrocardiogram (ECG), electroencephalogram (EEG)

Unstructured Narrative reports, textual

3 Results

3.1 Summary

We found 43 publications describing the development or adaption of OHD-GAN, presented in Table

4. We can generalize the data addressed in each of these publications into one of two categories: time-

dependent observations, such as time-series, or static representation in the form of feature vectors such as

tabular rows. We brieﬂy bring attention to the lack of multi-relational tabular representations,the primary

form of EHR, and further discuss the subject in latter sections.

Most efforts propose adaptations of current algorithms to the characteristics and complexities of

OHD. These include multi-modality of marginal distributions or non-Gaussian real-valued features, het-

erogeneity, a combination of discrete and real-valued features, longitudinal irregularity, complex condi-

tional distributions, missingness or sparsity, class imbalance of categorical features and noise.

While these properties may make training a useful model difﬁcult, the variety of applications that are

highly relevant and needed in the healthcare domain provides sufﬁcient incentive. The most cited motives

are, as one would expect, to cope with the often limited number of samples in medical datasets and to

overcome the highly restricted access to OHD. The potential of releasing privacy-preserving SD freely is

a common subject. Publications considering privacy evaluate the effect on utility of applying DP to their

algorithm, propose alternatives privacy concepts and metrics, or only concentrate on the subject of privacy.

3.2 Motives for developing OHD-GAN

Some claim that the ability to generate synthetic is becoming an essential skill in data science (Sarkar,

2018), but what purpose can it serve in the medical domain? The authors mention a wide range of potential

applications. We brieﬂy describe the four prevailing themes in the following sections: data augmentation

(Sec.3.2.1), privacy and accessibility (Sec.3.2.2), precision medicine (Sec.3.2.3) and modelling simulations

(Sec.3.2.4).

3.2.1 Data augmentation

Data augmentation is mentioned in nearly all publications. Although counter-intuitive, GAN can gen-

erate SD that conveys more information about the real data distribution. Effectively, the real-valued space

distribution of the generator produces a more comprehensive set of data points, valid, but not present in

the discrete real data points. A combination of real and synthetic training data habitually leads to increased

predictor performance (Wang et al.,2019a;Che et al.,2017;Yoon et al.,2018a,b;Yang et al.,2019a;Chen

et al.,2019a;Cui et al.,2019;Che et al.,2017). A more intelligible way to seize the concept from the point

4

of view of image classiﬁcation, known as invariances,perturbations such as rotation, shift, sheer and scale

(Antoniou et al.,2017).

Similarly, domain translation and Semi-supervised learning (SSL) training approaches with GANs

could support predictive tasks that lack data with accurate labels,lack paired samples or suffer class imbal-

ance (Che et al.,2017;McDermott et al.,2018;Yoon et al.,2018a). Another example is correcting discrep-

ancies between datasets collected in different locations or under different conditions inducing bias (Yoon

et al.,2018c). GANs are also well adapted for data imputation, were entries are Missing at Random (MaR)

(Yoon et al.,2018b).

3.2.2 Enhancing privacy and increasing data accessibility

Most authors see SD as the key to unlocking the unexploited value of OHD hindering machine learn-

ing, and scientiﬁc progress (Beaulieu-Jones et al.,2019;Baowaly et al.,2019;Baowaly et al.,2018;Che et al.,

2017;Esteban et al.,2017;Fisher et al.,2019;Severo et al.,2019) or education (Laderas et al.,2018). We can

broadly describe preserving privacy as reducing the risk of re-identiﬁcation attack to an acceptable level.

It quantiﬁes this level of risk when releasing data anonymized with DP.

Due to its artiﬁcial nature, SD is put forward to forgo the tight restrictions on data sharing, while

potentially providing greater privacy guarantees (Beaulieu-Jones et al.,2019;Baowaly et al.,2019;Baowaly

et al.,2018;Esteban et al.,2017;Fisher et al.,2019;Walsh et al.,2020;Chin-Cheong et al.,2019). Enabling

access to greater variety, quality and quantity of OHD could have positive effects in a wide range of ﬁelds,

such as software development, education, and training of medical professionals. The fact remains that

GANs do not eliminate the risk of reidentiﬁcation. Considering none of the synthetic data points represent

actual people, the signiﬁcance of such an occurrence is unclear. It is possible to combine both methods,

and GAN training according to DP shows evidence of reducing the loss of utility compared to DP alone.

3.2.3 Enabling precision medicine

The application to precision medicine involves predicting outcomes conditioned on a patient’s cur-

rent state and history. Simulated trajectories could help inform clinical decision making by quantifying

disease progression and outcomes and have a transformative effect on healthcare (Walsh et al.,2020;Fisher

et al.,2019). Ensembles of stochastic simulations of individual patient proﬁles such as those produced by

Conditional Restricted Boltzmann Machine (CRMB) could help quantify risk at an unprecedented level of

granularity (Fisher et al.,2019).

Predicting patient-speciﬁc responses to drugs is still a new ﬁeld of research, a problem known as Indi-

vidualized Treatment effects (ITE). Estimating ITEs is persistently hampered by the lack of paired counter-

factual samples (Yoon et al.,2018a;Chu et al.,2019). To solve similar problems n medical imaging,various

GAN algorithms were developed for domain translation,mapping a sample from its to original class to the

paired equivalent. This includes bidirectional transformations, allowing GAN to learn mappings from very

few, or a lack of paired samples (Wolterink et al.,2017;Zhu et al.,2017a;McDermott et al.,2018).

3.2.4 From patient and disease models to digital twins

A well-trained model approximates the process that generated the real data points. The relations

learned by the model, its parameters, contain meaningful information if we can learn to harness it. Data-

driven algorithms evolve as our understanding of their behavior improves. We incorporate new concepts in

the algorithms leading to further understanding, iterativly blurring the line with theory-driven approaches

(Hand,2019). Interpretability is a growing ﬁeld of research concerned with understanding how the learned

parameters of a model relate. In other words analysing the representation the algorithm has converged to

and deriving meaning from obscure logic. Incorporating new understanding in the architecture of algo-

5

rithms shift the view from a data-driven to a theory-driven perspective (Hand,2019). As we purposefully

build structure in our algorithms from new understanding, we may get the chance to explore meaningful

representations that would otherwise be beyond our reasoning.

Approaching these ideas from above, the concept of ”digital twins” represents in a way the ultimate

realization of Personalized Medicine. A common practice in industrial sectors is high-ﬁdelity virtual rep-

resentations of physical assets. Long-term simulations, that provide an overview and comprehensive un-

derstanding of the workings, behavior and life-cycle of their real counterparts. The state of the models is

continuously updated from theoretical data, real data, and streaming Internet of Things (IoT) indicators.

Intently conditioned input data allows the exploration of speciﬁc events or conditions. In a position

paper on the subject, Angulo et al. draw the parallels of this technique with the current needs in healthcare

and the emergence of the technologies for actionable models of patients. (Angulo Bahon et al.,2019;An-

gulo et al.,2020). The authors bring up the rapid adoption of wearables that are continuously monitoring

people’s physiological state.

Wearables are one of many mobile digitally connected devices that collect patient data over a broad

range of physiological characteristic and behavioral patterns (Coravos et al.,2019). This emerging trend

known as digital bio-markers has already led to studies demonstrating predictive models with the potential

for improved patient care (Snyder et al.,2018). Through continuous lifelong learning, integrating multiple

modes of personal data, generative patient models could inform diagnostics of medical professionals and

also enable testing treatment options. In their proposal, GAN are an essential component of the ecosystem

to ensure patient privacy and to provide bootstrap data. Fisher et al. already use the term ”digital twin”

to describe their process, noting that they present no privacy risk and enable simulating patient cohorts of

any size and characteristics (Walsh et al.,2020).

Table 4: Summary of the publication included in the review.

Publication Algorithm(s) Focus, algorithms, and techniques Data type

2017

Choi et al. medGAN (medGAN) Incompatibility of back-propagation with discrete features.

Autoencoder (AE),Mini-batch Averaging (MB-Avg),

batch-normalization (BN),shortcut connections (SC),

Attribute Disclosure (AD),Presence Disclosure (PD).

Binary occurences or counts

of medical codes.

Yahi et al. medGAN adaptation Drug Laboratory Effects (DLE) on continuous time-series,

multi-modality. t-Distributed Stochastic Neighbor

Embedding (t-SNE).

Paired pre/post treatment

exposure time-series

Esteban et al. Recurrent GAN (RGAN),Recurrent

Convolutional GAN (RC-GAN)

Adversarial training of (conditional) Recurrent NNs (RNNs)

on time-series, evaluation, privacy. Long Short-term

Memory (LSTM),Conditional GAN (CGAN),Differential

private stochastic gradient descent (DP-SGD).

Regularly observed

real-valued time-series

(RV-TS)

Xiao et al. WGAN for Temporal Point-processes

(PPWGAN)

Temporal Point Processes. LSTM,Wassertein GAN (WGAN),

Poisson process.

Sporadic occurrences,

hospital visits.

Che et al. Electronic Health Record GAN (ehrGAN),

Semi-supervised Learning with a learned

ehrGAN (SSL-GAN)

Semi-supervised augmentation, transitional distribution.

1D-CNN, Word2vec, Variational contrastive divergence

(VCD).

Discrete time-series (D-TS),

sequences of medical codes.

Dash et al. HealthGAN (HealthGAN) Sleep patterns, stratiﬁcation by covariates. Binary over multiple visits.

2018

6

Table 4: Summary of the publication included in the review (Continued).

Publication Algorithm(s) Focus, algorithms and techniques Data

Camino et al. Multi-categorical ARAE (MC-ARAE),

Multi-categorical medGAN (MC-medGAN),

Multi-categorical Gumbel-softmax GAN

(MC-GumbelGAN),Multi-categorical WGAN

with Gradient Penalty (MC-WGAN-GP)

Improving training process. medGAN,WGAN with Gradient

Penalty (WGAN-GP),Gumbel-Softmax GAN (Gumbel-GAN),

Adversarially regularized autoencoder (ARAE).

Multiple categorical variables.

McDermott

et al.

Cycle Wasserstein Regression GAN

(CWR-GAN)

Cycle-consistent semi-supervised regression learning,

unpaired data, class imbalance. WGAN Cycle-consistent

GAN (Cycle-GAN) ITE

ICU RV-TS, lack of paired

samples, SD.

Yoon et al. Generative Adversarial Nets for inference of

Individualized Treatment Effects (GANITE)

ITE, unobserved counterfactual, multi-label classiﬁcation,

uncertainty. CGAN pair.

Feature, treatment and

outcome vectors.

Yoon et al. RadialGAN (RadialGAN) Multi-domain translation, features and distribution

mismatch, cycle-consistency, augmentation. CGAN,WGAN.

Tabular, discrete and

continuous.

Yoon et al. Generative Adversarial Imputation Network

(GAIN)

Tabular data imputation. Missing Completly at Random

(MCaR),CGAN.

Real-valued, tabular with

entries MCaR.

2019

Wang et al. Sequentially Coupled GAN (SC-GAN) Capturing mutual inﬂuence in time-series. Coupled

generator pair. Treatment recommendation task. LSTM,

CGAN.

RV-TS of patient state and

medication dosage data.

Baowaly et al. Boundary-seeking medGAN (MedBGAN) Improving training process. medGAN,Boundary-seeking

GAN.

Binary occurences or counts

of medical codes.

Baowaly et al. MedBGAN,Wassertein medGAN

(MedWGAN)

Improving training process. medGAN,BGAN,WGAN. Binary occurences or counts

of medical codes.

Severo et al. Conditional WGAN-GP (cWGAN-GP) Generation and public release of dataset. Protecting

commercial sensitive information. Class imbalance.

cWGAN-GP,CGAN.

Physiological RV-TS.

Chin-Cheong

et al.

WGAN Heterogeneous mixture of dense and sparse features.

Privacy and evaluating the introduction of bias. WGAN,

WGAN-GP,Mode-speciﬁc normalization (MSN),DP aware

optimizer from Tensor-ﬂow. community.

Binary, real-valued and

categorical.

Jordon et al. Private Aggregation of Teacher Ensembles

(PATE) framework applied to GANs

(PATE-GAN)

Alternative differential privacy,adaptation of the Private

Aggregation of Teacher Ensembles (PATE) framework.

Demographic and binary.

Torﬁ and Beyki corGAN (corGAN) Convolutional NN (CNN) architecture, capturing feature

correlations, evaluating realism,privacy evaluation using

Membership Inference (MI). 1D-Convolutional AE (CAE).

Binary occurences or counts

of medical codes.

Chu et al. Adversarial Deep Treatment Effect

Prediction (ADTEP)

ITE, two independent AE for patient and treatment feature

sets, trained adversarially in combination,and outcome

predictor from latent representation.

EHR data, not speciﬁed.

Jackson and

Lussetti

medGAN Evaluating medgan with the addition of demographics

features.

Demographic features and

binary occurences or counts

of medical codes.

Yu et al. SSL-GAN Rare disease detection, Semi-supervised learning (SSL),

leveraging unlabeled EHR data, medical code embedding

network. LSTM.

Diagnosis and prescription

codes.

Yang et al. CGAN Class imbalance, low count of minority class.

Semi-supervised learning combining Self-training (ST) and

CT with a CGAN for a IoT application.

Twenty medical datasets from

the UCI repository, types

unspeciﬁed.

7

Table 4: Summary of the publication included in the review (Continued).

Publication Algorithm(s) Focus, algorithms and techniques Data

Yang et al. CorrNN and T-wGAN (GcGAN) Capturing the correlations between different categories of

medical codes and the outcome. Correlation NN,Turing

GAN,Wassertein T-GAN (T-wGAN).

Binary occurences or counts

of medical codes.

Yang et al. Categorical GAIN (CGAIN) Improve on GAIN for categorical variable using fuzzy

encoding of the features.

Categorical (multi-class and

multi-label) real-valued.

Camino et al. GAIN,GAIN+Variable Splitting (VS),

Variational AE (VAE),VAE+Iterative

Imputation (IT),VAE+Backpropagation IT

(BP),VAE+VS,VAE+VS+IT,VAE+VS+BP

Benchmark and improve on generative imputation with

GAIN and VAE.

Categorical and real-valued.

Mostly not OHD

Beaulieu-

Jones et al.

Auxiliary Classiﬁer GAN (AC-GAN) Evaluating if differentially private GANs that is valid

reanalysis while ensuring privacy. DP,CGAN.

Physiological RV-TS.

Xu et al. Conditional Tabular GAN (CTGAN) Non-Gaussian multi-modal distribution of continuous

columns and imbalanced discrete column in tabular data.

Evaluation benchmark. CGAN Training by sampling (TbS)

MSN WGAN-GP Gumbel-GAN

Tabular real-valued and

categorical.

Yale et al. HealthGAN Privacy metrics and over-ﬁtting. MI,Nearest-neighbor

Adversarial Accuracy (NN-AA),Privacy loss (PL),

Discriminator testing (DT)

Categorical demographics,

real-valued and binary

medical codes.

Fisher et al. Adversarially trained CRMB Simulation of patient trajectories from their baseline state,

disease prediction and risk quantiﬁcation,

missingness.CRMB.

Binary, ordinal, categorical,

and continuous, 3 months

intervals.

2020

Walsh et al. Adversarially trained CRMB Digital twins, disease prediction and risk quantiﬁcation,

missingness. CRMB.

Binary, ordinal, categorical,

and continuous, 3 months

intervals.

Yale et al. HealthGAN Metrics to capture a synthetic dataset’s resemblance,

privacy, utility and footprint. Evaluating applications.

Application case studies, Reproducibility of studies with SD.

NN-AA,PL,Data obfuscation (DO,medGAN,WGAN-GP,

Synthetic Data Vault,

Real-valued and categorical.

Demographics, vital signs,

diagnoses, and procedures.

Tantipongpipat

et al.

DP-auto-GAN (DP-auto-GAN) Privacy, medGAN adaptation, evaluation metrics. DP-SGD

AE medGAN Renyi Differential Privacy (RDP)

Medical data: binary.

Non-health data: categorical

and real-valued.

Bae et al. GANs for anonymizing private medical data

(AnomiGAN)

Probabilistic scheme that ensures indistinguishability of the

SD can be viewed as encrypted. DP CNN

Binary occurences of medical

codes.

Cui et al. Coplementary pattern Aaugmentation

(CONAN)

Complementary GAN in a rare disease predictor model that

generates positive samples from negatives to alleviate class

imbalance.

Embedding vectors

representing multiple patient

visits and conditions.

Zhu et al. Blood Glucose GAN (GluGAN) Adversarialy trained RNN to predict the upcoming

time-step in physiological time-series conditioned on the

past observations. RNN,CNN,Gated Recurrent Unit (GRU).

RV-TS of blood glucose

measurements, discrete

patient submitted features.

Chen et al. medGAN,WGAN-GP, DC-GAN Privacy analysis of generative models. MI,Full Black-box

Attack,Partial Black-box Attack,White-box Attack,DP-SGD.

Binary vector of medical

codes.

Chin-Cheong

et al.

WGAN with DP (WGAN-DP) Heterogeneous data, effect of differential privacy on utility.

WGAN DP

Categorical, continuous,

ordinal, and binary. Dense or

sparse.

8

Table 4: Summary of the publication included in the review (Continued).

Publication Algorithm(s) Focus, algorithms and techniques Data

Camino et al. - Initially a comparison GANs and VAEs, but they choose

instead to bring attention to the problem of benchmarking.

Analysis of problematic, requirements and suggestions.

GAIN,Six component GAN (HexaGAN) (Hwang et al.,2019),

Missing data IWAE (MIWAE) (Mattei and Frellsen,2019),

Heterogeneous-Incomplete VAE (HI-VAE)(Nazabal et al.,

2020), Multiple Imputation Denoising Autoencoders (MIDA)

(Gondara and Wang,2017)

Real-valued and categorical.

Zhang et al. EMR Wassertein GAN (EMR-WGAN) Improving training, evaluation metrics, sparsity.

WGAN,BN,Layer normalisation (LN),CGAN.

Binary occurences of medical

codes. Low-prevalence of

codes.

Yan et al. Heterogeneous GAN (HGAN) Improvements on EMR-WGAN incorporating record-level

constraints in the loss function. WGAN,BN,LN,CGAN,MI,

PD.

Binary, categorical and

real-valued.

Ozyigit et al. Realistic Synthetic Dataset Generation

Method (RSDGM)

Exploring the feasibility of various methods to generate

synthetic datasets. WGAN

Real-valued and categorical.

Yoon et al. Anonymization through data synthesis using

GAN (ADS-GAN)

Identiﬁability view of privacy. Generator conditioned on

real samples inputs with an identiﬁability loss to satisfy the

identiﬁability constraint. WGAN WGAN-GP DP alternative.

Real-valued and binary.

Goncalves

et al.

MC-medGAN Comparison of GANs with statistical models to generate

synthetic data, evaluation metrics. MI,AD.

Categorical and real-valued.

3.3 Data Types and Feature Engineering

No publications made use of OHD in its initial form, patient records in EHR composed of many related

tables (normalized form). The complexity of a model would explode when maintaining referential integrity

and statistics between multiple tables. The hierarchy by witch these would interact with each other condi-

tionally is no less complicated (Buda et al.,2015;Patki et al.,2016;Zhang and Tay,2015;Tay et al.,2013).

There are published GAN algorithms made to consume normalized database in their original form. In all

publications we considered, feature engineering was used to adapt the data to task requirements, or to

promising algorithms that ﬁt the data characteristics. They transform the data into one of four modalities:

time series, point-processes, ordered sequences or aggregates described in Fig. 5.

9

Table 5: Types of observational health data and features engineering

Type Values and structure Challenges Features engineering

Time-series

Continuous

Regular

Sporadic

- Time-stamped observations

- Continuous, ordinal, categorical

and/or multi-categorical

- Recorded continuously by medical de-

vices, following a schedule by medical

professional, or when necessary

- Observations are often MaR across

time end dimensions, erroneous, or

completely absent for certain pa-

tients.

- Time-series of different concepts are

often highly correlated and their in-

ﬂuence on one another must be ac-

counted for.

Imputation coupled

with training

Regular

Data imputation

Binning in into ﬁxed-

size intervals

Combination of bin-

ning and imputation

Point-processes - Series of timestamped observations

of one variable or medical concept per

patient

- Series of events re-

duced to the time

interval between each

consecutive occur-

rence.

Ordered

sequences

- Ordered vectors representing one or

more patients visits

- Medical codes associated with the di-

agnoses, procedures, measurements

and interventions

Variable length

High-dimensionalLong-tail distribu-

tion of codes

Sequences are pro-

jected into a trained

embedding that

preserves semantic

meaning according

to methods borrowed

from NLP

Tabular

Denormalized

Relational

- Medical and demographic variables

aggregated in tabular format

- Continuous, ordinal, categorical

and/or multi-categorical features

Medical history is aggregated into a

ﬁxed-size vector of binary or aggregated

counts of occurrences and combined

with demographic features.

3.4 Data oriented GAN development

3.4.1 Auto-encoders and categorical features

In what is to the best of our knowledge, the ﬁrst attempt at developing a GAN for OHD. Choi et al. fo-

cus on the problem posed by the incompatibility of categorical and ordinal features with back-propagation.

Their solution is to pretrain an AE to project the samples to and from a continuous latent space representa-

tion. They keep the decoder portion along with its trained weights to form a component of medGAN (Choi

et al.,2017a). It is incorporated into the generator and maps the randomly sampled input vectors from

the real-valued latent space representation back to discrete features. This ﬁrst exemplar of synthetic OHD

generated by GAN inspires a series of enhancements.

Early efforts were to improve the performance of medGAN. Among the ﬁrst, Camino et al. developed

MC-medGAN changing the AE component by splitting its output into a Gumbel-Softmax (Jang et al.,2016)

activation layer for each categorical variable and concatenating the results. (Camino et al.,2018). The au-

thors also developed an adaptation based on recent training techniques: WGAN (Arjovsky et al.,2017) and

aWGAN (Briefed in Panel 1) with Gradient Penalty (Gulrajani et al.,2017). MC-WGAN-GP is the equivalent

of MC-medGAN but with Softmax layers. The authors report that the choice of a model will depend on data

characteristics, particularly sparsity.

Subsequent authors owing to the propensity of OHD to induce mode collapse widely adopted Wasser-

stein’s distance. Baowaly et al. developed MedWGAN also based on WGAN, and MedBGAN borrowing from

Boundary-seeking GAN (BGAN) (Hjelm et al.,2017) which pushes the generator to produce samples that

lie on the decision boundary of the discriminator, expanding the search space. Both led to improved data

10

Panel 1. Wasserstein’s distance

In brief, the Wasserstein distance is a measure between two Probability Distributions (PDs) that has the property of always

providing a smooth gradient. As the loss function of the discriminator, this property improves training stability and mitigates

mode collapse. To make the equation tractable a 1-Lipschitz constraint must be introduced, creating another problem. In the

words of the author:

”Weight clipping is a clearly terrible way to enforce a Lipschitz constraint. If the clipping parameter is large,

then it can take a long time for any weights to reach their limit, [...] If the clipping is small, this can easily lead

to vanishing gradients [...] However, we do leave the topic of enforcing Lipschitz constraints in a neural network

setting for further investigation, and we actively encourage interested researchers to improve on this method.”

(Arjovsky et al.,2017)

Sometimes this prevented the network from modelling the optimal function, but Gradient penalty, a less restrictive regular-

ization replaced the clipping. (Petzka et al.,2018).

quality, in particular MedBGAN (Baowaly et al.,2019;Baowaly et al.,2018). Jackson and Lussetti tested

medGAN on an extended dataset containing demographic and health system usage information,obtaining

results similar to the original (Jackson and Lussetti,2019). HealthGAN, based on WGAN-GP, includes a

data transformation method adapted from the Synthetic Data Vault (Patki et al.,2016) to map categorical

features to and from the unit numerical range (Yale et al.,2020).

3.4.2 Forgoing the autoencoder and introducing conditional training

Claiming that the use of an AE introduces noise, with EMR-WGAN,Zhang et al. dispose of the AE

component of previous algorithms and introduce a conditional training method, along with conditioned

BN and LN techniques to stabilise training (Zhang et al.,2020). The algorithm was further adapted by

Yan et al. as HGAN to better account for the conditional distributions between multiple data types and

enforce record-wise consistency. A recognized problem with medGAN was that it produced common-sense

inconsistencies, such as gender mismatches in medical codes (Yan et al.,2020;Choi et al.,2017a). HGAN

enforces constraints by adding speciﬁc penalties to the loss function, such as limit ranges for numerical

categorical pairs and mutual exclusivity for pairs of binary features (Yan et al.,2020). The algorithm also

performs well on regular time-series of sleep patterns (Dash et al.,2019)

To develop CTGAN,Xu et al. presume that tabular data poses a challenge to GAN owing to the non-

Gaussian multi-modal distribution of real-valued columns and imbalanced discrete columns (Xu et al.,

2019). The fully connected layers, have adaptations to deal with both real-valued and categorical features.

For real-valued features, it use mode-speciﬁc normalization to capture the multiplicity of modes. For dis-

crete features, they introduce conditional training-by sampling to re-sample discrete attributes evenly dur-

ing training, while recovering the real distribution when generating data.

In other efforts, Torﬁ and Beyki develop corGAN, with a 1-dimensional Convolutional AE (1D-CAE)

to capture neighboring feature correlations of the input vectors (Torﬁ and Beyki,2019). Chin-Cheong et al.

use a Feed-forward Network (FFN) based on Wasserstein’s distance to evaluate the capacity of GANs to

model heterogeneous data of dense and sparse medical features (Chin-Cheong et al.,2020). Ozyigit et al.

use the same approach, focusing on reproducing statistical properties (Ozyigit et al.,2020).

3.4.3 Time-series

Esteban et al. devise the LSTM-based RGAN and RC-GAN to generate a regular time-series of physio-

logical measurements from bedside monitors (Esteban et al.,2017). Curiously,the authors dismiss Wasser-

stein’s distance explicitly, and generated each dimension of their time-series independently, where one

11

Panel 2. Transitional distribution

The ehrGAN generator is trained to decode a random vec-

tor zmixed with the latent space representation of a real

patient hto produce a synthetic sample ˜x(Che et al.,2017).

A standard autoencoder (left) is trained to encode a real

patient xto and from a latent representation h, minimiz-

ing the reconstruction error with ¯x. The decoder portion

(left) is then trained to produce realistic synthetic sam-

ples ˜xfrom a combination of the random latent vector z

and the latent space encoding of a real patient x. The

generator thus learns a transition distribution p(˜x|x)with

x∼pdata(x). The amount of contribution of the real

sample is controlled by a random mask according to ˜

h=

m∗z+ (1 −m)·h. This method inspired from Varia-

tional Contrastive Divergeance prevents mode collapse by

design and learns an information rich transition distribu-

tion p(˜x|x)around real samples x.

would assume they are correlated. They observe a considerable loss of accuracy on their utility metric.

3.5 Task oriented GAN development

3.5.1 Semi-supervised learning

ehrGAN is developed for sequences of medical codes Che et al.. It learns a transitional distribution,

combine an Encoder-Decoder CNN (Rankin et al.,2020) with VCD (Che et al.,2017). The ehrGAN generator

is trained to decode a random vector mixed with the latent space representation of a real patient (See Panel

2). The trained ehrGAN model is then incorporated into the loss function of a predictor where it can help

generalization by producing neighbors for each input sample.

SSL is commonly used to augment the minority class in imbalanced datasets, with techniques such

as ST and CT.Yang et al. improve on both by incorporating a GAN in the procedure (Yang et al.,2018).

The GAN is ﬁrst trained on the labelled set and used to re-balance it. A prediction task with a classiﬁer

ensemble is then executed and the data points with highest prediction conﬁdence are labelled. The process

is iterated until labelling expansion ceases. As a ﬁnal step, the GAN is trained on the expanded labelled

set to generate an equal amount of augmentation data. The authors obtained improved performance in a

number of classiﬁcation tasks and multiple tabular datasets.

3.5.2 Domain translation

To address the heterogeneity of healthcare data originating from different sources, Yoon et al. com-

bines the concepts of cycle-consistent domain translation from Cycle-GAN (Zhu et al.,2017b) and multi-

domain translation from Star-GAN (Choi et al.,2017b) to build RadialGAN to translate heterogeneous pa-

tient information from different hospitals, correcting features and distribution mismatches (Yoon et al.,

2018c). One encoder-decoder pair per data endpoint that are trained to map records to and from a shared

latent representation for their respective endpoint.

3.5.3 Individualized treatment effects

The task of estimating ITEs is an ongoing problem. ITEs refer to the response of a patient to a certain

treatment given a set of characterizing features. The problem is that counterfactual outcomes are never

observed or treatment selection is highly biased (Yoon et al.,2018a;McDermott et al.,2018;Walsh et al.,

12

2020). In GANITE Yoon et al. propose a solution by using a pair of GANs: one for counterfactual imputation

and another for ITE estimation (Yoon et al.,2018a). The former captures the uncertainty in unobserved out-

comes by generating a variety of counterfactuals. The output is fed to the latter, which estimates treatment

effects and provides conﬁdence intervals.

McDermott et al. developed CWR-GAN to leverage large amounts of unpaired pre/post-treatment

time-series in Intensive Care Unit (ICU) data for the estimation of ITEs on physiological time-series (Mc-

Dermott et al.,2018). CWR-GAN is a joint regression-adversarial SSL approach inspired by Cycle-GAN.

The algorithm has the ability to learn from unpaired samples, with very few paired samples, to reversibly

translate the pre/post-treatment physiological series.

Chu et al. address the problem of data scarcity by designing ADTEP. The algorithm can harness the

large volume of EHR data formed by triples of non-task speciﬁc patient features, treatment interventions

and treatment outcomes (Chu et al.,2019). ADTEP learns representation and discriminatory features of the

patient, and treatment data by training an AE for each pair of features. In addition to AE reconstruction

loss, a second discriminator is tasked with identifying fake treatment feature reconstructions. Finally, a

fourth loss metric is calculated by feeding the concatenated latent representations of both AEs toaLogistic-

regression (LR) model aimed at predicting the treatment outcome (Chu et al.,2019).

Like Esteban et al.,Wang et al. demonstrated an algorithm to generate a time series of patient states

and medication dosages pairs using LSTM. In contrast to RGAN and RC-GAN, in SC-GAN, patients state at

the current time-step informs the concurrent medication dosage, which in turn affects the patient state in

the upcoming time-step (Wang et al.,2019a). SC-GAN overcame a number of baselines on both statistical

and utility metrics.

3.5.4 Data imputation and augmentation

GAN are naturally suited for data imputation, and can mitigate missingness. Statistical models devel-

oped for the multiple imputation problem increase quadratically in complexity with the number of features,

while the expressiveness of deep neural networks can efﬁciently model all features with missing values si-

multaneously.

In that regard, Yoon et al. adapted the standard GAN to perform imputation on real-valued features

MaR in tabular datasets (Yoon et al.,2018b). In GAIN, the discriminator must classify individual variables

as real or fake (imputed), as opposed to the whole ensemble. Additional input, or hint, containing the

probability of each component being real or imputed is fed to the discriminator to resolve the multiplicity

of optimal distributions that the generator could reproduce. The model performs considerably better than

ﬁve state-of-the-art benchmarks. GAIN was later adapted by Yang et al. to also handle categorical features

using fuzzy binary encoding, the same technique employed in HealthGAN. In parallel, Camino et al. apply

the same VS technique they used ﬁr medGAN to adapt GAIN and run a benchmark against different types

of VAE.

The distribution estimated by a generator can compensate for lack of diversity in a real sample, es-

sentially ﬁlling in the blanks in a manner comparable to data imputation. In such cases, data sampled from

this distribution has the potential to help improve generalization in training predictive models. As an ex-

ample, we mentioned generating unobserved counterfactual outcomes (Yoon et al.,2018b), and generating

neighboring samples to help generalization in predictors (Che et al.,2017).

The adversarially trained Restricted Boltzmann Machine (RMB) developed by Fisher et al. enabled

them to simulate individualized patient trajectories based on their base state characteristics. Due to the

stochastic nature of the algorithm, generating a large number of trajectories for a single patient can provide

13

new insights on the inﬂuence of starting conditions on disease progression or quantify risk (Fisher et al.,

2019).

3.6 Model validation and data evaluation

To asses the solution to a generative modelling problem, it is necessary to validate the model, and to

verify its output. GAN aim to approximate a data distribution P, using a parameterized model distribution

Q(Borji,2019). Thus, in evaluating the model, the goal is to validate that the learning process has led to a

sufﬁciently close approximation. What this means in practice is hard to deﬁne. The concept of ”realism”

ﬁnds more natural application to images of text, but becomes ambiguous when faced with the complexity

of health data.

Walsh et al. employ the term ”statistical indistinguishability” and deﬁne it as the inability of a classi-

ﬁcation algorithm to differentiate real from synthetic samples (Walsh et al.,2020). The terms covers almost

all evaluation methods employed in the publications,which can be divided into two broad categories: those

aimed at evaluating the statistical properties of the data directly,and those aimed at doing so indirectly by

quantifying the work that can be done with the data. There are, nonetheless a few attempts of a qualitative

nature, more in line with the concept of realism.

3.6.1 Qualitative evaluation

Visual inspection of projections of the SD is a common theme, serving mostly as a basic sanity check,

but occasionally presented as evidence. The formal qualitative evaluation approaches found in the liter-

ature are mainly Preference Judgement, Discrimination Tasks or Clinician Evaluation and are generally

carried out by medical professionals (Borji 2018).

-Preference judgment The task is choosing the most realistic of two data points in pairs of one real

and one synthetic (Choi et al.,2017a).

-Discrimination Tasks Data points are shown one by one and must be classiﬁed as real or synthetic

(Beaulieu-Jones et al.,2019).

-Clinician Evaluation Rather than classifying the data points, they must be rated for realism accord-

ing to a predeﬁned numerical scale. (Beaulieu-Jones et al.,2019). Signiﬁcance is determined with a

statistical test such as Mann-Whitney.

-Visualized embeding The real and synthetic data samples are plotted on a graph or projected into

an embeding such as t-SNE or PCA and compared visually. (Cui et al.,2019;Yu et al.,2019;Zhu et al.,

2020a;Yale et al.,2019a;Yang et al.,2019c;Beaulieu-Jones et al.,2019;Tantipongpipat et al.,2019;

Dash et al.,2019).

-Feature analysis In certain ﬁelds, the data can be projected to representations that highlight patterns

or properties that can be easily visually assessed. While this does not provide conclusive evidence

of data realism, it can help get a better understanding of model behaviour during training. As an

example, typical and easily distinguishable patterns in EEG and ECG bio-signals. (Harada et al.,2019)

In general, qualitative evaluation methods based on visual inspection are weak indicators of data

quality. At the dataset or sample level, quantitative metrics provide more convincing evidence of data

quality (Borji 2018).

3.6.2 Quantitative evaluation

Quantitative evaluation metrics can be categorized into three loosely deﬁned groups: comparing the

distributions of real and synthetic data as a whole, assessing the marginal and conditional distributions of

14

features, and evaluating the quality of the data indirectly by quantifying the amount of work that can be

done with the data, referred to as utility.

-Dataset distributions A summary of metrics is presented in Tab. 6.

-Feature Distributions If the model has learned a realistic representation of the real data it should

produce SD that possesses the same quantity and type of information content. Authors attempt by

various metrics to determine if the statistical properties of the SD agree with those of the real data.

These metrics are presented in Table 7. Although statistical similarity provides strong support for the

behavior of the learning process, it is not necessarily informative about their validity. They are often

ambiguous and can be found to be misleading upon further investigation. Given the complexity of

health data, low level relations are unlikely to paint a full picture. Authors often state that no single

metric taken on its own was sufﬁcient, and that a combination of them allowed deeper understanding

of the data.

-Data utility Utility-based metrics, presented in Table 8, often provide a more convincing indicator of

data realism. On the other hand, they mostly lack the interpretability of statistical metrics. We took

the liberty of placing these into one of two categories: tasks mostly deﬁned only for evaluation (Ad

hoc utility metrics) or tasks based on real-world applications (Application utility metrics). Note that

this distinction is not based on a rigorous deﬁnition, but serves to facilitate comparison.

-Analytical The analytical methods were mainly employed for evaluation,but can also provide a better

understanding of the and its behavior.

–Feature Importance The important features (Random Forest (RF)) and model coefﬁcients (LR,

Support Vector Machine (SVM)) of predictors. (Esteban et al.,2017;Xu et al.,2019;Yoon et al.,

2020;Chin-Cheong et al.,2019;Beaulieu-Jones et al.,2019).

–Ablation study The performance of the model is compared against impaired version. This helps

determining if the novel component of the algorithm contributes signiﬁcantly to performance

(Cui et al.,2019;Che et al.,2017;McDermott et al.,2018;Yoon et al.,2018c;Chin-Cheong et al.,

2020).

15

Table 6: Metrics employed to validate trained models based on the comparison of distributions.

Metric Description

Kullback-Leibler diver-

gence (KLD)

Non-symmetric measure of difference between two PDs, related to relative entropy. Given a feature X,

p(x)and q(x)the PD of the real and synthetic data respectively, the KLD of q(x)from p(x)is the amount

of information lost when q(x)is trained to estimate p(x)(Jiawei,2018;Goncalves et al.,2020).

RDP Alternative measure of divergence, which includes KLD as a special case. The RDP includes a parameter

αthat gives it an extra degree of freedom, becoming equivalent to the Shannon-Jensen divergence when

α−→ 1. It showed a number of advantages when compared to the original GAN loss function, and removed

the need for gradient penalty (Van Balveren et al.,2018;Tantipongpipat et al.,2019)

Jaccard similarity Measure of similarity and diversity deﬁned on sets as the size of the intersection over the size of the union

(Ozyigit et al.,2020;Yang et al.,2019c;contributors).

2-sample test (2-ST) Statistical test of the null hypotheses the real and SD samples came from the same distribution. and

synthetic, originate from the same distribution through the use of a statistical test such as Kolmogorov-

Smirnov (KS) or Maximum Mean Discrepency (MMD).(Fisher et al.,2019;Baowaly et al.,2019;Baowaly

et al.,2018;Esteban et al.,2017)

Distribution of Recon-

struction Error

Compares the distributions of reconstruction error for the SD and the training set versus the SD and a

held out testing set. Calculated according to the Nearest-neighbor metric or other measures of distance.

A signiﬁcant difference would indicate over-ﬁtting and can evaluated with a statistical test, such as KS.

(Esteban et al.,2017)

Latent space projections Real and synthetic samples are projected back into the latent space,or encoded with a 0xCE-VAE,comparing

the dimensional mean of the variance or the distance between mode peaks (Zhang et al.,2020). See Section

5.5 for examples of how the latent space encoding can interpreted.

Domain Speciﬁc Measures

(DSMs)

Comparison of the PD with DSMs. For instance the Quantile-Quantile (Q-Q) plot for point-processes (Xiao

et al.,2017). See Section 6.2 for a notion of how DSMs could apply to EHR data.

Classiﬁer accuracy Accuracy of a classiﬁer trained to discriminate real from synthetic units. Predictor accuracy around 0.5

would indicate indistinguishability. (Fisher et al.,2019;Walsh et al.,2020)

16

Table 7: Metrics based on evaluating the statistical properties of the synthetic data distribution.

Metric Description

Dimensions-wise distribution The real and synthetic data are compared feature-wise according to a variety of methods

For example, the Bernoulli success probability for binary features, or the Student T-test

for continuous variables, and Pearson Chi-square test for binary variables is used to deter-

mine statistical signiﬁcance (Beaulieu-Jones et al.,2019;Choi et al.,2017a;Chin-Cheong

et al.,2019;Yan et al.,2020;Baowaly et al.,2019;Baowaly et al.,2018;Ozyigit et al.,2020;

Tantipongpipat et al.,2019;Yoon et al.,2020;Tantipongpipat et al.,2019;Fisher et al.,

2019;Che et al.,2017;Wang et al.,2019a;Yale et al.,2019a;Chin-Cheong et al.,2020;

Ozyigit et al.,2020).

Inter-dimensional correlation Dimension-wise Pearson coefﬁcient correlation matrices for both real and synthetic data

(Beaulieu-Jones et al.,2019;Goncalves et al.,2020;Torﬁ and Beyki,2019;Frid-Adar et al.,

2018;Ozyigit et al.,2020;Yang et al.,2019c;Yoon et al.,2020;Zhu et al.,2020a;Yoon et al.,

2020;Walsh et al.,2020;Yale et al.,2019a;Ozyigit et al.,2020;Dash et al.,2019;Bae et al.,

2020b).

Cross-type Conditional Distribution Correlations between categorical and continuous features, comparing the mean and stan-

dard deviation of each conditional distribution (Yan et al.,2020).

Time-lagged correlations Measures the correlation between features over time intervals. (Fisher et al.,2019;Walsh

et al.,2020).

Pairwise mutual information Checks for the presence multivariate relationships pair-wise for each feature, as a mea-

sure of mutual dependence (Rankin et al.,2020). Quantiﬁes the amount of information

obtained about a feature from observing another.

First-order proximity metric Deﬁned over graphs, captures the direct neighbor relationships of vertices. Zhang et al.

applied to graphs built from the co-occurrence of medical codes and compared the results

between real and synthetic data (Zhang et al.,2020).

Log-cluster metric Clustering is applied to the real and synthetic data combined. The metric is calculated

from the number of real and synthetic samples that fall in the same clusters (Goncalves

et al.,2020).

Support coverage metric Measures how much of the variables support in the real data is covered in the synthetic

data. Support is deﬁned as the percentage of values found in the synthetic data, while

coverage is the reverse operation. The metric is calculated as the average of the ratios

over all features. Penalizes less frequent categories that are underrepresented (Goncalves

et al.,2020).

Proportion of valid samples Deﬁned by Yang et al. as a requirement for records to contain both disease and medication

instances. (Yang et al.,2019c).

PCA Distributional Wassertein distance The Wassertein distance is calculated over k-dimensional PCA projections of the real and

synthetic data (Tantipongpipat et al.,2019).

17

Table 8: Metrics based on evaluating the utility of the synthetic data on practical tasks.

Metric Description

Data utility metrics

DWP Each variable is in turn chosen as the prediction target label and the remaining as features. Two predic-

tors are trained to predict the label, one from the synthetic data and another from a portion of the real

data. Their performance is compared on the left out real data (Choi et al.,2017a;Camino et al.,2018;

Goncalves et al.,2020;Yan et al.,2020;Tantipongpipat et al.,2019;Baowaly et al.,2019).

ARM ARM aims to the discovery of relationships among a large set of variables, commonly occurring variable-

value pairs (Agrawal et al.,1993). The rules obtained from the real and synthetic data are compared

(Baowaly et al.,2019;Baowaly et al.,2018;Bae et al.,2020a;Yan et al.,2020).

Training utility Performance of predictors trained on the synthetic data,often in comparison with the real data or data

generated with DP (Bae et al.,2020a).

TRTS Accuracy on real data of some form of predictor trained on synthetic data (Beaulieu-Jones et al.,2019;

Rankin et al.,2020;Yoon et al.,2020).

TSTR Accuracy on synthetic data of some form of predictor trained on real data (Bae et al.,2020a;Yoon et al.,

2020;Jordon et al.,2019).

Discriminator A predictor is trained to discriminate synthetic from real sample. An accuracy value of 0.5 would indi-

cate that they are indistinguishable (Fisher et al.,2019;Walsh et al.,2020;Yale et al.,2019b).

Siamese discriminator A pair of identical FFN each receive either a real sample or a synthetic sample. Their output is passed

to a third network which outputs a measure of similarity (Torﬁ and Beyki,2019).

Applied utility metrics

Data augmentation A predictor is trained on a combination dataset of real and synthetic data or real data with missing

values imputed and performance is compared with the same predictor trained on real data alone (Yoon

et al.,2020;Yang et al.,2019b,c).

Model augmentation The trained generative model is incorporated into a predictor’s activation function by generating an en-

semble of proximate data points for each instance, thereby improving generalization (Che et al.,2017).

Accuracy The prediction performance of the model is compared against benchmarks of the same type on real data

(Cui et al.,2019;Yoon et al.,2018a;Che et al.,2017;Yu et al.,2019;Zhu et al.,2020a;Baowaly et al.,

2019;Wang et al.,2019a;Walsh et al.,2020;Yoonet al.,2018b;McDermott et al.,2018;Yang et al.,2019c;

Yoon et al.,2018c;Xu et al.,2019;Beaulieu-Jones et al.,2019;Bae et al.,2020a). Models trained to make

forward predictions from past observations or from real data transformed with a known function can

simply be evaluated for accuracy. For example, the RMSE on time-series (Xiao et al.,2018b;McDermott

et al.,2018;Yoon et al.,2018b;Yang et al.,2019b;Zhu et al.,2020a).

3.7 Alternative evaluation

In their publications, Yale et al. propose refreshing approaches to evaluating the utility of SD. For

example, they organized a hack-a-thon type challenge involving the data. During the event,students were

tasked with creating classiﬁers, while provided only with SD (Yale et al.,2020). They were then scored on

the accuracy of their model on real data.

In more rigorous initiatives, they attempted (successfully) to recreate the experiments published in

medical papers based on the MIMIC dataset using only data generated from their model HealthGAN. In a

subsequent version of their article, the authors evaluate the performance of their model against traditional

privacy preservation methods by using the trained discriminator component of HealthGAN to discriminate

real from synthetic samples.

18

3.8 Privacy

Some authors conducted a privacy risk assessment to evaluate the risk of reidentiﬁcation. The empir-

ical analyses were based on the deﬁnitions of MI,AD (Choi et al.,2017a;Goncalves et al.,2020;Yan et al.,

2020;Chen et al.,2019b;Chin-Cheong et al.,2020) and the Reproduction rate (RR) (Zhang et al.,2020).

Cosine similarities between pairs of samples was also used (Torﬁ and Beyki,2019). Most studies report low

success rates for these types of attacks, and little effect from the sample size, although Chen et al. note

that sample sizes under 10k lead to higher risk. Goncalves et al. evaluated MC-medGAN against multiple

non-adversarial generative models in a variety of privacy compromising attacks, including AD, obtaining

inconsistent results for MC-medGAN (Goncalves et al.,2020). While this is not mentioned by the authors,

multiple results reported in the publication point to the fact that the GAN was not properly trained or suf-

fered mode-collapse.In black-box and white-box type attacks, including the LOGAN (Hayes et al.,2017)

method, medGAN performed considerably better than WGAN-GP (Chen et al.,2019b), the algorithm which

served as basis for improvements to medGAN in publications discussed in Section 3.4.1. Overall, the au-

thor notes that releasing the full model poses a high risk of privacy breaches and that smaller training sets

(under 10k) also lead to a higher risk.

3.8.1 The status of fully synthetic data in regards to current privacy regulations

It seems intuitively possible that the artiﬁcial nature of SD essentially prevents associations with

real patients, however the question is never directly addressed in the publications. An extensive Stanford

Technological Review legal analysis of SD concluded that laws and regulations should not treat SD indis-

criminately from traditional privacy preservation methods (Bellovin et al.,2019). They state that current

privacy statutes either outweigh or downplay the potential for SD to leak secrets by implicitly including it

as the equivalent of anonymization.

3.8.2 Traditional privacy

Numerous attempts at applying traditional privacy guarantees, such as deferentially-private stochas-

tic gradient descent can also be found in other ﬁelds, as well as in healthcare (Beaulieu-Jones et al.,2019;

Esteban et al.,2017;Chin-Cheong et al.,2020;Bae et al.,2020a). By limiting the gradient amplitude at each

step and adding random noise, AC-GAN could produce useful data with = 3.5and δ < 10−5according to

the deﬁnition of differential privacy.

3.8.3 Moving forward safely

Some have put forward the notion that preventing over-ﬁtting and preserving privacy may not be

conﬂicting goals (Wu et al.,2019;Mukherjee et al.,2019;Zhu et al.,2020b). Letting go of the negative con-

notation, we can explore the beneﬁts such as improving generalization, stabilizing learning and building

fairer models (Zhu et al.,2020b) and the use of GANs to optimize the trade-off (Chen et al.,2019c).

-Bae et al. ensure privacy with a probabilistic scheme that ensure indistinguishably, but also maxi-

mizes utility. Speciﬁcally, a multiplicative perturbation by random orthogonal matrices with input

entries of kxm medical records and a second second discriminator in the form of a pre-trained pre-

dictor (Bae et al.,2020a).

- In privGAN (Mukherjee et al.,2019), an adversary is introduced, forcing the generator to produce sam-

ples that minimize the risk of MI attacks, in addition to cheating the discriminator. The combination

of both goals has the explicit effect of preventing over-ﬁtting,and their algorithm produces samples

of similar quality to non-private GAN.

19