ArticlePDF Available

Scalable and accurate deep learning for electronic health records


Abstract and Figures

Predictive modeling with electronic health record (EHR) data is anticipated to drive personalized medicine and improve healthcare quality. Constructing predictive statistical models typically requires extraction of curated predictor variables from normalized EHR data, a labor-intensive process that discards the vast majority of information in each patient's record. We propose a representation of patients' entire, raw EHR records based on the Fast Healthcare Interoperability Resources (FHIR) format. We demonstrate that deep learning methods using this representation are capable of accurately predicting multiple medical events from multiple centers without site-specific data harmonization. We validated our approach using de-identified EHR data from two U.S. academic medical centers with 216,221 adult patients hospitalized for at least 24 hours. In the sequential format we propose, this volume of EHR data unrolled into a total of 46,864,534,945 data points, including clinical notes. Deep learning models achieved high accuracy for tasks such as predicting in-hospital mortality (AUROC across sites 0.93-0.94), 30-day unplanned readmission (AUROC 0.75-0.76), prolonged length of stay (AUROC 0.85-0.86), and all of a patient's final discharge diagnoses (frequency-weighted AUROC 0.90). These models outperformed state-of-the-art traditional predictive models in all cases. We also present a case-study of a neural-network attribution system, which illustrates how clinicians can gain some transparency into the predictions. We believe that this approach can be used to create accurate and scalable predictions for a variety of clinical scenarios, complete with explanations that directly highlight evidence in the patient's chart.
Content may be subject to copyright.
Scalable and accurate deep learning for electronic health
Alvin Rajkomar*
, Eyal Oren*
, Kai Chen
, Andrew M. Dai
, Nissan Hajaj
, Peter
J. Liu
, Xiaobing Liu
, Mimi Sun
, Patrik Sundberg
, Hector Yee
, Kun Zhang
, Yi
Zhang1, Gavin E. Duggan1, Gerardo Flores1, Michaela Hardt1, Jamie Irvine1, Quoc
Le1, Kurt Litsch1, Jake Marcus1, Alexander Mossin1, Justin Tansuwan1, De Wang1,
James Wexler1, Jimbo Wilson1, Dana Ludwig2, Samuel L. Volchenboum4, Katherine
Chou1, Michael Pearson1, Srinivasan Madabushi1, Nigam H. Shah3, Atul J. Butte2,
Michael Howell1, Claire Cui1, Greg Corrado1, and Jeff Dean1
1Google Inc, Mountain View, California
2University of California, San Francisco, San Francisco, California
3Stanford University, Stanford, California
4University of Chicago Medicine, Chicago, Illinois
October 2017
Predictive modeling with electronic health record (EHR) data is anticipated to drive per-
sonalized medicine and improve healthcare quality. Constructing predictive statistical models
typically requires extraction of curated predictor variables from normalized EHR data, a labor-
intensive process that discards the vast majority of information in each patient’s record. We
propose a representation of patients’ entire, raw EHR records based on the Fast Healthcare
Interoperability Resources (FHIR) format. We demonstrate that deep learning methods using
this representation are capable of accurately predicting multiple medical events from multiple
centers without site-specific data harmonization. We validated our approach using de-identified
EHR data from two U.S. academic medical centers with 216,221 adult patients hospitalized for
at least 24 hours. In the sequential format we propose, this volume of EHR data unrolled into a
total of 46,864,534,945 data points, including clinical notes. Deep learning models achieved high
accuracy for tasks such as predicting in-hospital mortality (AUROC across sites 0.93-0.94), 30-day
unplanned readmission (AUROC 0.75-0.76), prolonged length of stay (AUROC 0.85-0.86), and
all of a patient’s final diagnoses (frequency-weighted AUROC 0.90). These models outperformed
state-of-the-art traditional predictive models in all cases. We also present a case-study of a
neural-network attribution system, which illustrates how clinicians can gain some transparency
into the predictions. We believe that this approach can be used to create accurate and scalable
predictions for a variety of clinical scenarios, complete with explanations that directly highlight
evidence in the patient’s chart.
*These authors contributed equally
arXiv:1801.07860v1 [cs.CY] 24 Jan 2018
1 Introduction
The promise of digital medicine stems in part from the hope that, by digitizing health data, we
might more easily leverage computer information systems to understand and improve care. In fact,
routinely collected patient healthcare data is now approaching the genomic scale in volume and
complexity [53]. Unfortunately, most of this information is not yet used in the sorts of predictive
statistical models clinicians might use to improve care delivery. It is widely suspected that use such
efforts, if successful, could provide major benefits not only for patient safety and quality but also in
reducing health care costs [4, 33, 26, 42].
In spite of the richness and potential of available data, scaling the development of predictive models
is difficult because, for traditional predictive modeling techniques, each outcome to be predicted
requires the creation of a custom dataset with specific variables [17]. It is widely held that 80% of
the effort in an analytic model is preprocessing, merging, customizing, and cleaning data sets, [45,
38] not analyzing them for insights. This profoundly limits the scalability of predictive models.
Another challenge is that the number of potential predictor variables in the electronic health
record (EHR) may easily number in the thousands, particularly if free-text notes from doctors,
nurses, and other providers are included. Traditional modeling approaches have dealt with this
complexity simply by choosing a very limited number of commonly-collected variables to consider [17].
This is problematic because the resulting models may produce imprecise predictions: false-positive
predictions can overwhelm physicians, nurses, and other providers with false alarms and subsequent
alert fatigue [14], which the Joint Commission identified as a national patient safety priority in 2014
[10]. False-negative predictions can miss significant numbers of clinically important events, leading
to poor clinical outcomes [29]. Incorporating the entire EHR, including clinicians’ free-text notes,
offers some hope of overcoming these shortcomings but is hopelessly unwieldy for most predictive
modeling techniques.
Recent developments in deep learning and artificial neural networks may allow us to address many
of these challenges and unlock the information in the EHR. Deep learning emerged as the preferred
machine learning approach in machine perception problems ranging from computer vision to speech
recognition, but has more recently proven useful in natural language processing, sequence prediction,
and mixed modality data settings([36, 16, 21, 56]). The systems are known for their ability to handle
large volumes of relatively messy data, including errors in labels and large numbers of input variables.
A key advantage is that investigators do not generally need to specify which potential predictor
variables to consider and in what combinations; instead neural networks to learn representations of
the key factors and interactions from the data itself.
We hypothesize that these techniques will translate well to healthcare. Specifically, that deep
learning approaches could incorporate the entire electronic health record, including free-text notes,
to produce predictions for a wide range of clinical problems and outcomes that outperformed state-of-
the-art traditional predictive models. Our central insight is that rather than explicitly harmonizing
EHR data, mapping it into a highly curated set of structured predictors variables and then feeding
those variables into a statistical model, we can instead learn to simultaneously harmonize inputs and
predict medical events through direct feature learning [6].
2 Methods
We included EHR data from the University of California, San Francisco (UCSF) from 2012-2016,
and the University of Chicago Medicine (UCM) from 2009-2016. We refer to each health system
as Hospital A and Hospital B. All electronic health records were de-identified, except that dates
of service were maintained in the UCM dataset. Both datasets contained patient demographics,
provider orders, diagnoses, procedures, medications, laboratory values, vital signs, and flowsheet
data, which represents all other structured data elements (e.g. nursing flowsheets), from all inpatient
and outpatient encounters. The UCM dataset additionally contained de-identified, free-text medical
notes. Each dataset was kept in an encrypted, access-controlled, and audited sandbox.
Ethics review and institutional review boards approved the study with waiver of informed consent
or exemption at each institution.
Data representation and processing
We developed a single data structure that could be used for any prediction, rather than requiring
custom, hand-created datasets for every new prediction. This approach represents the entire EHR
in temporal order: data are organized by patient and by time. To represent events in a patient’s
timeline, we adopted the FHIR standard (Fast Healthcare Interoperability Resources) [23]. FHIR
defines the high-level representation of healthcare data into resources, but leaves values in each
individual site’s idiosyncratic codings [40]. Each event is derived from a FHIR resource and may
contain multiple attributes; for example a medication-order resource could contain the trade name,
generic name, ingredients, and others. Data in each attribute was split into discrete values which
we refer to as a tokens. For notes, the text was split into a sequence of tokens, one for each word.
Numeric values were normalized, as detailed in the appendix. The entire sequence of time-ordered
tokens, from the beginning of a patient’s record until the point of prediction, formed the patient’s
personalized input to the model. This process is illustrated in Figure 1.
We were interesting in understanding whether deep learning could produce valid predictions across
wide range of clinical problems and outcomes. We therefore selected outcomes from divergent domains,
including an important clinical outcome (death), a standard measure of quality of care (readmissions),
a measure of resource utilization (length of stay), and a measure of understanding of a patient’s
problems (diagnoses).
Inpatient mortality
We predicted impending inpatient death, defined as a discharge disposition
of “expired” [52, 30, 54, 57].
30-day unplanned readmission
We predicted unplanned 30 day readmission, defined as an
admission within 30 days after discharge from an “index” hospitalization. A hospitalization was
considered a “readmission” if admission date was within thirty days after discharge of an eligible
index hospitalization. A readmission could only be counted once. There is no standard definition
of “unplanned” [15] so we used a modified form of the Centers for Medicare and Medicaid Services
(CMS) definition [1] which we detail in the appendix. Billing diagnoses and procedures from the
Figure 1: Data from each health system an appropriate FHIR (Fast Healthcare Interoperability
Resources) resource and placed in temporal order. This conversion did not harmonize or standardize
the data from each health-system other than map them to the appropriate resource. The deep
learning model could use all data available prior to the point when the prediction was made. Therefore
each prediction, regardless of the task, used the same data.
index hospitalization were not used for the prediction because they are typically generated after
discharge. We included only readmissions to the same institution.
Long length of stay We predicted a length-of-stay at least 7 days, which was approximately the
75th percentile hospital stays for most services across the datasets. The length-of-stay was defined as
the time between hospital admission and discharge.
We predicted the entire set of primary and secondary ICD-9 billing diagnoses (i.e. from
a universe of 14,025 codes).
Prediction Timing
This was a retrospective study. To predict inpatient mortality, we stepped forward through each
patient’s time course, and made predictions every twelve hours starting 24 hours before admission
until 24 hours after admission. Since many clinical prediction models, such as APACHE, are rendered
24 hours after admission, our primary outcome prediction for in-patient mortality was at that
time-point. Unplanned readmission and the set of diagnosis codes were predicted at admission,
24 hours after admission, and at discharge. The primary endpoints for those predictions were at
discharge, when most readmission prediction scores are computed and when all information necessary
to assign billing diagnoses is available [28]. Long-length of stay was predicted at admission and 24
hours after admission. For every prediction we used all information available in the EHR up to the
time at which the prediction was made.
Study Cohort
We included all consecutive admissions for patients 18 years or older. We only
included hospitalizations of 24 hours or longer to ensure that predictions at various timepoints had
identical cohorts.
To simulate the accuracy of a real-time prediction system, we included patients typically removed
in studies of readmission, such as those discharged against medical advice, since these exclusion
criteria would not be known when making predictions earlier in the hospitalization.
For predicting the ICD-9 diagnoses, we excluded encounters without any ICD-9 diagnosis (2-12%
of encounters). These were generally encounters after October, 2015 when hospitals switched to
ICD-10. We included such hospitalizations, however, for all other predictions.
Algorithm Development and Analysis
We used the same modeling algorithm on both hospitals’
datasets, but treated each hospital as a separate dataset and report results separately.
Patient records vary significantly in length and density of data-points (e.g. vital sign measurements
in an intensive care unit vs outpatient clinic), so we formulated three deep learning neural-network
model architectures that take advantage of such data in different ways: one based on recurrent neural
networks (LSTM) [24], one on an attention-based time-aware neural network model (TANN), and
one on a neural network with boosted time-based decision stumps. Details of these architectures
are explained in the appendix. Each model was trained on each hospital’s data separately for each
prediction and time-point. To optimize model accuracy, our final model is an ensemble of predictions
from the three underlying model architectures [46].
Comparison to previously published algorithms
We implemented models based on previously
published algorithms to establish baseline performance on each dataset. For mortality, we used a
logistic model with variables inspired by NEWS [50] score but added additional variables to make
it more accurate, including the most recent systolic blood pressure, heart rate, respiratory rate,
temperature, and we used 24 common lab tests, like the white-blood cell count, lactate, and creatinine.
We call this the augmented Early-Warning Score, or aEWS, score. For readmission, we used a logistic
model with variables used by the HOSPITAL [12] score, including the most recent sodium and
hemoglobin level, hospital service, occurrence of CPT codes, number of prior hospitalizations, and
length of the current hospitalization. We refer to this as the mHOSPITAL score. For long length
of stay, we used a logistic model with variables similar to those used by Liu [37]: the age, gender,
Hierarchical Condition Categories, admission source, hospital service, and the same 24 common lab
tests used in the aEWS score. We refer to this as the modified Liu (mLiu) score. Details for the
baseline models are in the appendix. We are not aware of any commonly used baseline model for all
diagnosis codes so we compare against known literature.
Explanation of predictions
A common criticism of neural networks is that they offer little insight
into the factors that influence the prediction [9]. Therefore, we used attribution mechanisms to
highlight, for each patient, the data elements that influenced their predictions [2].
The LSTM and TANN models were trained with TensorFlow and the boosting model was
implemented with C++ code. Statistical analyses and baseline models were done in Scikit-learn
Python [43].
Technical details of the model architecture, training, variables, baseline models, and attribution
methods are provided in the appendix.
Model Evaluation and Statistical Analysis
Patients were randomly split into development
(80%), validation (10%) and test (10%) sets. Model accuracy is reported on the test set, and 1000
bootstrapped samples were used to calculate 95% confidence intervals. To prevent overfitting, the
test set remained unused (and hidden) until final evaluation.
We assessed model discrimination by calculating area under the receiver operating characteristic
curve (AUROC) and model calibration using comparisons of predicted and empirical probability
curves.(Pencina and D’Agostino 2015) We did not use the Hosmer-Lemeshow test as it may be
misleadingly significant with large sample sizes [32]. To quantify the potential clinical impact of an
alert with 80% sensitivity, we report the work-up to detection ratio, also known as the number needed
to evaluate [47]. For prediction of the a patient’s full set of diagnosis codes, which can range from 1 to
228 codes per hospitalization, we evaluated the accuracy for each class using macro-weighted-AUROC
[48] and micro-weighted F1 score [49] to compare with the literature. The F1 score is the harmonic
mean of positive-predictive-value and sensitivity; we used a single threshold picked on the validation
set for all classes. We did not create confidence intervals for this task given the computational
complexity of the number of possible diagnoses.
3 Results
We included a total of 216,221 hospitalizations involving 114,003 unique patients. The percent of
hospitalizations with in-hospital deaths was 2.3% (4,930/216,221), unplanned 30-day readmissions
was 12.9% (27,918/216,221), and long length of stay was (23.9%). Patients had a range of 1 to 228
discharge diagnoses. Demographics and utilization characteristics are summarized in Table 1. At the
time of admission, an average admission had 137,882 tokens, which increased markedly throughout the
patient’s stay to 216,744 at discharge (Figure 2). For predictions made at discharge, the information
considered across both datasets included 46,864,534,945 tokens of EHR data.
For predicting inpatient mortality, the AUROC at 24 hours after admission was 0.95
(95% CI 0.94-0.96) for Hospital A and 0.93 (95% CI 0.92-0.94) for Hospital B. This was significantly
more accurate than the traditional predictive model, aEWS which was a 28-factor logistic regression
model (AUROC 0.85 [95% CI 0.81-0.89] for Hospital A and 0.86 [95% CI 0.83-0.88] for Hospital B).
Table 1: Characteristics of Hospitalizations in Training and Test Sets
Training Data (n=194,470) Test Data (n=21,751)
Hospital A Hospital B Hospital A Hospital B
(n=85,522) (n=108,948) (n=9,624) (n=12,127)
Age, median (IQR) y 56 (29) 57 (29) 55 (29) 57 (30)
Female sex, No. (%) 46 848(54.8%) 62 004(56.9%) 5364(55.7%) 6935(57.2%)
Disease Cohort, No (%)
Medical 46 579(54.5%) 55 087(50.6%) 5263(54.7%) 6112(50.4%)
Cardiovascular 4616 (5.4%) 6903 (6.3%) 528 (5.5%) 749 (6.2%)
Cardiopulmonary 3498 (4.1%) 9028 (8.3%) 388 (4.0%) 1102 (9.1%)
Neurology 6247 (7.3%) 6653 (6.1%) 697 (7.2%) 736 (6.1%)
Cancer 14 544(17.0%) 19 328(17.7%) 1617(16.8%) 2087(17.2%)
Psychiatry 788 (0.9%) 339 (0.3%) 64 (0.7%) 35 (0.3%)
Obstetrics & newborn 8997(10.5%) 10 462 (9.6%) 1036(10.8%) 1184 (9.8%)
Other 253 (0.3%) 1148 (1.1%) 31 (0.3%) 122 (1.0%)
Previous Hospitalizations
0 hospitalizations 54 954(64.3%) 56 197(51.6%) 6123(63.6%) 6194(51.1%)
1 and <2 hospitalizations 14 522(17.0%) 19 807(18.2%) 1620(16.8%) 2175(17.9%)
2 and <6 hospitalizations 12 591(14.7%) 24 009(22.0%) 1412(14.7%) 2638(21.8%)
6 hospitalizations 3455 (4.0%) 8935 (8.2%) 469 (4.9%) 1120 (9.2%)
Discharge Location
Home 70 040(81.9%) 91 273(83.8%) 7938(82.5%) 10 109(83.4%)
Skilled Nursing Facility 6601 (7.7%) 5594 (5.1%) 720 (7.5%) 622 (5.1%)
Rehabilitation 2666 (3.1%) 5136 (4.7%) 312 (3.2%) 649 (5.4%)
Another Healthcare Facility 2189 (2.6%) 2052 (1.9%) 243 (2.5%) 220 (1.8%)
Expired 1816 (2.1%) 2679 (2.5%) 170 (1.8%) 265 (2.2%)
Other 2210 (2.6%) 2214 (2.0%) 241 (2.5%) 262 (2.2%)
Primary Outcomes
In-hospital deaths, No. (%) 1816 (2.1%) 2679 (2.5%) 170 (1.8%) 265 (2.2%)
30-day readmissions No. (%) 9136(10.7%) 15 932(14.6%) 1013(10.5%) 1837(15.1%)
Hospital stays
at least 7 days, No. (%) 20 411(23.9%) 26 109(24.0%) 2145(22.3%) 2931(24.2%)
Table 2: Prediction Accuracy of Each Task Made at Different Time Points
Hospital A Hospital B
Inpatient Mortality, AUROC1
(95% CI)
24 hours before admission 0.87 (0.85-0.89) 0.81 (0.79-0.83)
At admission 0.90 (0.88-0.92) 0.90 (0.86-0.91)
24 hours after admission 0.95(0.94-0.96) 0.93(0.92-0.94)
Baseline (aEWS2
) at 24 hours after admission 0.85(0.81-0.89) 0.86(0.83-0.88)
30-day Readmission, AUROC (95% CI)
At admission 0.73 (0.71-0.74) 0.72 (0.71-0.73)
24 hours after admission 0.74 (0.72-0.75) 0.73 (0.72-0.74)
At discharge 0.75(0.75-0.78) 0.76(0.75-0.77)
Baseline (mHOSPITAL3
) at discharge 0.70 (0.68-0.72) 0.68 (0.67-0.69)
Length of Stay at least 7 days AUROC (95% CI)
At admission 0.81 (0.80-0.82) 0.80 (0.80-0.81)
24 hours after admission 0.86(0.86-0.87) 0.85(0.85-0.86)
Baseline (mLiu4
) at 24 hours after admission 0.76 (0.75-0.77) 0.74(0.73-0.75)
Discharge Diagnoses, (weighted AUROC)
At admission 0.87 0.86
24 hours after admission 0.89 0.88
At discharge 0.90 0.90
1Area under the receiver operator curve
2augmented early warning score
3modified HOSPITAL score
4modified Liu score
Figure 2: This boxplot displays the amount of data (on a log scale) in the EHR, along with its
temporal variation across the course of an admission. We define a token as a single data element
in the electronic health record, like a medication name, at a specific point in time. Each token is
considered as a potential predictor by the deep learning model. The line within the boxplot represents
the median, the box represents the interquartile range (IQR), and the whiskers are 1.5 times the
IQR. The number of tokens increased steadily from admission to discharge. At discharge, the median
number of tokens for Hospital A was 86,477 and for Hospital B was 122,961.
Figure 3: The area under the receiver operator curves are shown for predictions of inpatient mortality
made by deep learning and baseline models at twelve hour increments before and after hospital
admission. For inpatient mortality, the deep learning model achieves higher discrimination at every
prediction time compared to the baseline for both the University of California, San Francisco (UCSF)
and University of Chicago Medicine (UCM) cohorts. Both models improve in the first 24 hours,
but the deep learning model achieves a similar level of accuracy approximately 24 hours earlier for
UCM and even 48 hours earlier for UCSF. The error bars represent the bootstrapped 95% confidence
If a clinical team had to investigate patients predicted to be at high risk of dying, the rate of false
alerts at each point in time was roughly halved by our model: at 24 hours, the work-up to detection
ratio of our model compared to the aEWS was 7.4 vs 14.3 (Hospital A) and 8.0 vs 15.4 (Hospital B).
The deep learning model predicted events 24-48 hours earlier than the traditional predictive model
(Figure 3).
For predicting unexpected readmissions within 30-days, the AUROCs at discharge
were 0.77 (95% CI 0.75-0.78) for Hospital A and 0.76 (95% CI 0.75-0.77) for Hospital B. These were
significantly higher than the traditional predictive model (mHOSPITAL) model at discharge, which
were 0.70 (95% CI 0.68-0.72) for Hospital A and 0.68 (95%CI 0.67-0.69) for Hospital B.
Long Length of Stay
For predicting long length-of-stay, the AUROCs at 24 hours after admission
were 0.86 (95% CI 0.86-0.87) for Hospital A and 0.85 (95% CI 0.84-0.86) for Hospital B. These were
significantly higher than the traditional predictive model In (mLiu) at 24 hours, which were 0.76
(95% CI 0.75- 0.77) for Hospital A and 0.74 (95% CI 0.73-0.75) for Hospital B.
Calibration curves for the three tasks are shown in the appendix.
Inferring Discharge Diagnoses The deep learning algorithm predicted patients’ discharge diag-
noses at three time points: at admission, after 24 hours of hospitalization, and at the time of discharge
(but before the discharge diagnoses were coded). For classifying all diagnosis codes, the weighted
AUROCs at admission were 0.87 for Hospital A and 0.86 for Hospital B. Accuracy increased somewhat
during the hospitalization, to 0.88-0.89 at 24 hours and 0.90 for both hospitals at discharge. For
classifying ICD-9 code predictions as correct, we required full-length code agreement. For example,
250.4 (“Diabetes with renal manifestations”) would be considered different from 250.42 (“Diabetes
with renal manifestations, type II or unspecified type, uncontrolled”). We also calculated the micro-F1
score at discharge which were 0.41 (Hospital A) and 0.40 (Hospital B).
Case study of Model Interpretation
In Figure 4, we illustrate an example of attribution methods
on a specific prediction of inpatient-mortality made at 24 hours after admission. For this patient,
the deep learning model predicted the risk of death of 19.9% and the baseline model predicted
9.3%, and the patient ultimately died 10 days after admission. This patient’s record had 175,639
data points (tokens) which were considered by the model. The timeline in Figure 4 highlights the
elements to which the model attends, with a close-up view of the first 24 hours of the most recent
hospitalization. From all the data, the models picked the elements that are highlighted in Figure 4:
evidence of malignant pleural effusions and empyema from notes, antibiotics administered, and
nursing documentation of a high risk of pressure ulcers (e.g. Braden index).(Bergstrom et al. 1987)
The model also placed high weights on concepts such as “pleurx,” the trade-name for a small chest
tube. The bolded sections are exactly what the model identified as discriminatory factors, not a
manual selection. In contrast, the top predictors for the baseline model (not shown in Figure 4) were
the values of the albumin, blood-urea-nitrogen, pulse, and white blood cell count. Note that for
demonstration purposes, this example was generated from TANNs trained on separate modalities
(e.g. flowsheets and notes), which is a common visualization technique to handle redundant features
in the data (e.g. medication orders are also referenced in notes).
4 Discussion
A deep learning approach that incorporated the entire electronic health record, including free-text
notes, produced predictions for a wide range of clinical problems and outcomes that outperformed
Figure 4: The patient record shows a woman with metastatic breast cancer with malignant pleural
effusions and empyema. The patient timeline at the top of the figure contains circles for every
time-step for which at least a single token exists for the patient, and the horizontal lines show the
data-type. There is a close-up view of the most recent data-points immediately preceding a prediction
made 24 hours after admission. We trained models for each data-type and highlighted in red the
tokens which the models attended to – the non-highlighted text was not attended to but is shown for
context. The models pick up features in the medications, nursing flowsheets, and clinical notes to
make the prediction.
state-of-the-art traditional predictive models. Because we were interested in understanding whether
deep learning could scale to produce valid predictions across divergent healthcare domains, we used
a single data structure to make predictions for an important clinical outcome (death), a standard
measure of quality of care (readmissions), a measure of resource utilization (length of stay), and a
measure of understanding of a patient’s problems (diagnoses).
This method represents an important advance in the scalability of predictive models in clinical care
for several reasons. First, our study’s approach uses a single data-representation of the entire EHR
as a sequence of events, allowing this system to be used for any prediction that would be clinically or
operationally useful with minimal data preparation. Traditional predictive models require substantial
work to prepare a hand-crafted, tailored dataset with specific variables, selected by experts and
assembled by analysts for each new prediction [17]. This data preparation and cleaning typically
consumes up to 80% of the effort of any predictive analytics project [45, 38], limiting the scalability
of predictive models in healthcare. Second, using the entirety of a patient’s chart for every prediction
does more than promote scalability, it exposes more data with which to make an accurate prediction.
For predictions made at discharge, our deep learning models considered more than 46 billion pieces
of EHR data and achieved more accurate predictions, earlier in the hospital stay, than did traditional
models. The clinical impact of this improvement is suggested, for example, by the improvement
of number needed to evaluate for inpatient mortality: the deep learning model would fire half the
number of alerts of a traditional predictive model, resulting in many fewer false positives.
However, the novelty of the approach does not lie simply in incremental model performance
improvements. Rather, this predictive performance was achieved without hand-selection of variables
deemed important by an expert. Instead, the model had access to tens of thousands of predictors for
each patient, including free-text notes, and learned what was important for a particular prediction.
Our study also has important limitations. First, it is a retrospective study, with all of the usual
limitations. Second, although it is widely believed that accurate predictions can be used to improve
care [4], this is not a foregone conclusion and prospective trials are needed to demonstrate this [34,
20]. Third, a necessary implication of personalized predictions is that they leverage many small data
points specific to a particular EHR rather than a handful of common variables. Future research is
needed to determine how models trained at one site can be best applied to another site, [22] which
would be especially useful for sites with limited historical data to train a model with. As a first step,
we demonstrated that the same training algorithm yielded comparable models for two geographically
distinct health systems, but further research is needed on this point. Finally, our methods are
computationally intense and at present require specialized expertise to implement. However, the
availability and accessibility of machine learning is rapidly expanding both in healthcare and in other
Perhaps the most challenging prediction in our study is that of predicting a patient’s full suite
of discharge diagnoses. The prediction is difficult for several reasons. First, a patient may have
between 1 and 228 diagnoses, and the number is not known at the time of prediction. Second, each
diagnosis may be selected from approximately 14,025 ICD-9 diagnoses codes, which makes the total
number of possible combinations exponentially large. Finally, many ICD-9 codes are clinically similar
but numerically distinct (for example, 011.30 “Tuberculosis of bronchus, unspecified” vs. 011.31
“Tuberculosis of bronchus, bacteriological or histological examination not done”). This has the effect
of introducing random error into the prediction. The micro-F1 score, which is a metric used when a
prediction has more than a single outcome (e.g. multiple diagnoses), for our model is higher than that
reported in the literature in an ICU data-set with fewer diagnoses [44]. This is a proof-of-concept
that demonstrates that the diagnosis could be inferred from routine EHR data, which could aid with
triggering of decision support [5] or clinical trial recruitment.
The use of free text for prediction also allows a new level of explainability of predictions. Clinicians
have historically distrusted neural network models because of their opaqueness. We demonstrate how
our method can visualize what data the model “looked at” for each individual patient, which can
be used by a clinician to determine if a prediction was based on credible facts, and potentially help
decide actions. In our case study, the model identified elements of the patient’s history and radiology
findings to render its prediction, which are critical data-points that a clinician would also use [41].
This approach may address concerns that such “black box” methods are trustworthy. However,
further research is needed regarding both the cognitive impact of this approach and its clinical utility.
5 Conclusions
Accurate predictive models can be built directly from EHR data for a variety of important clinical
problems with explanations highlighting evidence in the patient’s chart.
6 Acknowledgements
For data acquisition and validation, we thank the following: Julie Johnson, Sharon Markman,
Thomas Sutton, Brian Furner, Julie Johnson, Timothy Holper, Sharat Israni, Jeff Love, Doris Wong,
Michael Halaas, Juan Banda. For statistical consultation, we thank Farzan Rohani. For modeling
infrastructure, we thank Daniel Hurt and Zhifeng Chen. For help with visualizing Figures 1 and 3,
we thank Kaye Mao and Mahima Pushkarna.
A Data Representation
Data from each electronic health record was imported into a new schema based on the open-source
Fast Healthcare Interoperability Resources (FHIR) resource standards. We populated relevant
data into elements from the following resources: Patient, Encounter, Medication, Observation (e.g.
vital signs and nursing documentation), Composition (e.g. notes), Conditions (i.e. diagnoses),
MedicationAdministration, MedicationOrder, ProcedureRequest, and Procedure. We imported
the data directly from the health system, meaning we did not harmonize elements to a standard
terminology or ontology. If a health system included multiple terminologies, like a site-specific coding
scheme and an RxNorm code (a common medication coding scheme), we imported both. The only
exceptions were for diagnoses/procedures, which we mapped to ICD9/10 and CCS categories if the
health system did not already include them (e.g. for CPT codes), and for elements that were used to
define the primary outcomes, as described in the main manuscript.
In the electronic health record datasets, there was a category of data referred to as ”flowsheets,”
which correspond to many structured data elements in clinical care, like vital signs and nursing
documentation. Depending on workflows, data may be collected at the bedside, like a temperature
reading and then entered in the EHR later. This documentation provides (at least) two timestamps
– when the data was technically collected (recorded time) and when it was entered (entry time).
We specifically used the entry time in the EHR because especially during emergent situations,
the recorded times are estimated. We found that using the recorded-times significantly improved
prediction accuracy, but refrained from using them as the data is not actually available in the EHR
at that point-in-time.
For each categorical variable in each set of resources, we created a
-dimensional floating-point
embedding vector,
, with
picked as a hyperparameter. For clinical notes, we created sequences of
embeddings for words that appeared at least
times, with
as a hyperparameter. All embeddings
are randomly initialized. For numeric variables, we also normalized the values. We used hyper-
parameter tuning to select the size of the buckets. We also did tuning to select the best way to
represent specific values as combinations of embeddings representing the nearest neighbors (e.g. linear
combinations of the nearest intervals). These embeddings were randomly initialized and updated
over the course of training the model.
Embeddings representing all data prior to a prediction point were placed in chronological order
Ei, i
= 1
, ..., n
where n is the number of elements in the sequence Each embedding vector was
concatenated with a time-delta value or embedding, ∆
representing the difference in time from the
data element occurring to the time the prediction was made.
B Description of Inclusion Criteria and Outcomes
Inclusion Criteria
Inpatient encounters were defined as followed: 1. Encounter was confirmed as complete or non-
cancelled 2. Encounter had a start and end time 3. Encounter class was defined as inpatient as defined
in dataset 4. Administrative encounters were excluded (e.g. no primary diagnosis was documented);
these encounters accounted for less than 1 percent of hospitalizations in the data received.
The following services were included in the Medical-Surgical Cohort: General Medicine, Cardi-
ology, Neurology, Critical Care Medicine, Hematology, Oncology, Hepatobiliary Medicine, Medical
Specialties, General Surgery, Colorectal Surgery, Otolaryngology, Gynecology, Gynecology-Oncology,
Neurosurgery, Oral-Maxillofacial surgery, Orthopedics, Plastic Surgery, Thoracic Surgery, Transplant
Surgery, Urology, and Vascular Surgery.
Cohort Definitions
We made the following modifications of the CMS cohort definitions[1] to ensure that every primary
diagnosis was listed in a cohort.
We added CCS code 150 (alcoholic liver disease) to the Medicine Cohort.
We created the following new Cohorts: Obstetric Cohort containing CCS codes 176-196. Rehabil-
itation Cohort containing CCS code 254 Injury and Poisoning Cohort containing CCS codes 260 and
Determining Unplanned Readmission
We implemented the logic used by CMS to define planned readmissions[1]. The logic evaluates
whether admissions were for reasons that are defined to be planned (e.g. bone marrow transplants
and chemotherapy), and it distinguishes between surgical procedures that were accompanied by an
acute condition (e.g. acute cholecystitis) or non-acute condition, which were defined to be unplanned
or planned, respectively.
In the 2016 version of the CMS rules, some criteria were defined by a mix of CCS and ICD-9
procedure and diagnosis codes. Given that some hospitalizations only had ICD-10 diagnoses and
procedure codes, we mapped the ICD-9 CMS codes to ICD-10 and then applied the rules. We used
the mapping tables provided by the National Bureau of Economic Research[11].
Fewer than 1 percent of hospitalizations did not have a primary diagnosis marked in the raw data.
Based on a review of a random sample of these hospitalizations, these encounters lacked clinical data
about events in the hospitalization and were therefore not included but they did have administrative
data that indicated that these were unplanned admissions. After confirming with the respective
partner sites, we treated these cases as ineligible to be index discharges given missing data but were
considered unplanned admissions. They were excluded from the mortality and diagnosis prediction
C Model Variants
Weighted Recurrent neural network model (RNN)
In the RNN model, sparse features of each category (such as medication or procedures) were embedded
into the same
-dimensional embedding.
for each category was chosen based on the number of
possible features for that category. The embeddings from different categories are concatenated and for
the same category and same time, they are averaged according to an automatically learned weighting.
The sequence of embeddings were further reduced down to a shorter sequence. Typically, the
shorter sequences were split into time-steps of 12 hours where the embeddings for all features within
a category in the same day were combined using weighted averaging. The weighted averaging is
done by associating each feature with a non-negative weight that is trained jointly with the model.
These weights are also used for prediction attribution. The log of the average time-delta at each
time-step is also embedded into a small floating-point vector (which is also randomly initialized) and
concatenated to the input embedding at each time-step.
This reduced sequence of embeddings were then fed to an n-layer Recurrent Neural Network
(RNN), specifically a Long Short-Term Memory network (LSTM).3 An RNN consists of a sequence
of directed nodes. Embeddings are fed to the RNN one at a time and for each time-step, each node
computes its activation as a nonlinear function of the input embedding. Each subsequent node
receives as input the previous node’s activation and the embedding for that time-step. The LSTM
extends the RNN by adding 3 gates, an input gate, output gate, forget gate to determine what
information to pass on to the next node relative to the previous node’s activation and the current
time-step’s embedding. Each node in the LSTM computes an hidden state vector and cell state
The LSTM is defined by the following set of equations where
corresponds to weight
to biases and the subscript and variable
represent the forget, input and output
represents the hidden output at time
represents the input at time
the cell state at time t.σgrepresents the sigmoid function and σcthe hyperbolic tangent.
The hidden state of the final time-step of the LSTM was fed into an output layer, and the model
was optimized to minimize the log-loss (either a logistic regression or softmax loss depending on the
task). We applied a number of regularization techniques to the model, namely embedding dropout,
feature dropout, LSTM input dropout and variational RNN dropout.4 We also used a small level of
L2 weight decay, which adds a penalty for large weights into the loss. We trained with a batch size
of 128 and clipped the norm of the gradients to 50. Finally, we optimized everything jointly with
Adagrad6. We trained using the TensorFlow framework on the Tesla P100 GPU. The regularization
hyperparameters and learning rate were found via a Gaussian-process based hyperparameter search
on each dataset’s validation performance.
Feedforward Model with Time-Aware Attention
To the sequence of embedding,
Ei, i
= 1
, . . . , n
we added an additional prior embedding to the
with the associated ∆
= 0. For every embedding
Ei, i
= 0
, . . . , n
we created an
attribution logit
using the process described below. Those logits were converted to weights
using softmax,
j=0 eαj(2)
We then took the
dimensional vector of the weighted sum,
, along with the scalars
+ 1) and
), and entered them into a feedforward neural network whose attributes
(e.g. number and dimensions of the layers) were determined by hyperparameter tuning.
For the attribution logits, we used a bank of
, . . . , Ak
(∆), where each
one of the following forms (typically not all forms in the same model):
A(∆) = 1 (constant);
A(∆) = ∆;
A(∆) = log(∆ + 1day);
(∆) = Piece-wise linear function with predetermined inflection points (based on exponential
backoff) and learned slopes.
We defined a
dimensional projection of the embedding by learning a
dimensional matrix
, and for every
= 0
, . . . , n
multiplying it with
to get the
p1,i, . . . , pk,i
. We then
defined the attribution logits to be
pj,iAj(∆i) (3)
The embedding dimension,
, ranged from 16 to 512. The number of layers of the feedforward
network ranged from 0-3 with the width of the networks from 10 to 512.
Boosted, embedded time-series model
For each feature tuple of the token name, value, and time-delta, we algorithmically created (described
below) a set of binary decision rules that partitioned examples into two classes.
There were ten types of decision rules.
The first was whether a variable, X, existed at any-point in a patient’s timeline.
The second was whether a variable
existed more than
times in a patient’s timeline.
was randomly picked from the range of integer values possible for each variable in the dataset.
The third introduced the time sequence nature of the variable: was variable
greater or lower
than threshold
at any time
t <
T (i.e.
x > V
t < T
; or
). Again,
and Twere picked from the space of possible values in the dataset.
The fourth was a modification of the third rule, but rather than a simple binary cutoff, it
was a weighted sum of of the number of times that rule (
x < V
) was satisfied, with the
weights determined by a Hawkes process response with a time decay factor of
. A binary
rule was created by examining if this weighted sum was greater than the activation
, that is
Ainstance > Atemplate
, where
is selected from a random user. Again, we use random
selection of a particular template instance to select
. Then A is computed from the
instance by
I{xi> V }e
The fifth, six, and seventh rules were created by determining if the minimum, maximum, and
average of variable Xwas greater than Vin time t < T .
The eighth and ninth type of rule captured changes in lab values over time (e.g. the decrease
in blood pressure over time). In particular, the eighth predicate checked the velocity, that is
if there is a change in a variable divided by a time window that is greater or lower than a
. The ninth predicate checked if the difference in values within time
is greater or
lower than a threshold V.
The tenth type of rule consists of conjunctions of previous predicates (e.g. does
and does the count of
),We call these decision list predicates as, to preserve
interpretability, they only encompass the true branches of a decision tree. The conjunctions are
mined by picking the best predicate in a random selection of predicates, then, conditioned on
the best predicate, the a second one that also maximizes the weighted information gain with
respect to the label.
The actual instances of each rules, including the selection of variables, value thresholds and
time-thresholds were generated by first picking a random patient, random variable
, and a random
in the patient’s timeline.
is the corresponding value of
at time
is the counts of
times Xoccurs in the patient’s timeline.
Every binary rule, which we refer to as a binary predicate, was assigned a scalar weight, and the
weighted sum was passed through a softmax layer to create a prediction. To train, we first created a
bias binary predicate which was true for all examples and its weight was assigned as the log-odds
ratio of the positive label class across the dataset.
Next, we used rounds of boosting to select predicates. In each round, we picked 25,000 random
predicates from random patients in a batch of 500 patients. Importance-weighted information gain
with respect to the label was calculated for each and the top 50 predicates were picked. Additionally,
for each of those top 50 predicates, 50 more secondary predicates were selected using the same
information gain criteria, conditional on the primary predicate holding true. The best predicate and
second corresponding predicates were then joined together to create 50 more conjunction predicates
for a total of 100 predicates per round. Weights of these predicates were fitted using logistic regression
regularization. We then applied the model to all examples in the training dataset to create
prediction probabilities Q. Each example was then given an importance weight of |Label Q|.
In the next round, we selected 25,000 new random predicates by sampling examples according to
the importance weight. The top 50 by information gain (and 50 more secondary ones) were added
to a new logistic model which included the predicates from the previously determined predicates.
The weights of all predicates were re-calculated (i.e. not just the new predicates), which is known as
totally corrective boosting.
We used 100 rounds, so in total 100,000 predicates were selected from a pool of 5,000,000
which were in turn randomly selected from a potential pool of
num patient num f eatures
num discrete values num time steps
potentially possible predicates. The
regularization was
then used, which could further cull away the 100,000 selected predicates to a smaller set.
The final binary predicates were then embedded into a 1024 dimensional vector space and then
fed to a feed-forward network of depth 2 and 1024 hidden units per layer with ELU non-linearity. For
regularization, Gaussian noise of mean 0 and standard deviation 0.8 was added to the input of the
feed forward network. We also used multiplicative Bernoulli noise of p=0.01 (also known as dropout)
at the input and output (just before the applying the sigmoid function to the logits) of the feed
forward layer. At test time, no Gaussian or Bernoulli noise was used. We optimized everything with
Adam. The union of predicates optimized for different tasks (e.g. readmission or different diagnosis
codes), were all used together in the final model. These final binary predicates have been mined from
different tasks (e.g. for the readmission task, many diagnosis code models might contribute auxiliary
binary predicates that they have mined as features for the feedforward network).
D Methods for All Techniques
Attribution Mechanisms
To explain predictions we implement attribution mechanisms. Inspired by recent results in natural
language processing [3], the feed-forward models implement an attention mechanism identifying the
locations in a sequence of variables which may have played a significant role in affecting the prediction.
Notably, the same variable could be harnessed differently given when it occurred in relationship to
other events for a given patient timeline. The RNN models implement a form of weighting that also
learns which variables are important for prediction relative to other variables. We use both of these
methods to perform attribution.
Illustrating the data that the models attended to is difficult because of the complexity of the
data, including thousands of time-steps with tens- to hundreds-of-thousands of tokens, representing a
large percentage of all the data that is viewable in a patient’s actual EHR record. Moreover, given
the correlation of the data (the heart-rate at time
is related to the rate at
+ 1) and redundancy
(the medication order of “norepinpherine” is redundant with the nurse’s documentation of the rate to
which it is titrated), the models could choose to attend to equivalent data arbitrarily. For visualization
purposes only, we re-trained feed-forward models on a single task, mortality, with models using only
a single data-type (e.g. notes, medications, observation data) to preclude the models using redundant
data among different feature types. These models differ in predictive performance than that of the
full models reported in Table 2.
In Figure 4 of the main manuscript, we render the timeline, populating a circle for every time-step
where at least a single token exists for that patient. We have shown snippets of select time-steps
and highlight the tokens in which the model using that data-type chose to attend it. For tokens
with significant attribution scores, we have “smeared” attribution to directly neighboring tokens for
visualization purposes. For patient privacy reasons, we have obscured information about the dates
and times of all tokens, although the relative time has been retained.
Automated Hyperparameter Tuning
There are many design choices to training neural networks that are beyond the scope of this manuscript
but are well described elsewhere [19]. The hyper-parameters, which are settings that affect the
performance of all above neural networks were tuned automatically using Google Vizier [18] with a
total of >201,000 GPU hours.
For a given prediction task, we could use a variety of algorithms to make a prediction. For example,
we could use a sequence model, feed-forward model, and a boosting model, and their predictions
would be different on the same example. Ensembling combines the multiple predictions to make a
final prediction; this is similar to tallying votes for an election result. We combined the predictions
(probabilities) from the three models of the ensemble by averaging.
E Baseline Models
To understand the performance of our models, we first created baseline models for each prediction
task using traditional modeling techniques. We used recent literature reviews to select commonly
used variables for each task [58, 51, 39]. These hand-engineered features are used only in the baseline
models; our actual models do not need such feature engineering.
We fitted the model on the training set separately for each planet and report results when applying
this model to the test set of each planet.
Mortality Baseline Model - aEWS
Most existing models use a small set of lab measurements, vital signs and mental status. Following
this approach, for the EHR datasets, we created a model that used the most recent systolic blood
pressure, heart-rate, respiratory rate and temperature in fahrenheit (any temperature that was listed
below 90 degree fahrenheit was converted from Celsius to Fahrenheit). Because urine output and
mental status was not coded consistently between sites, we instead used the most recent white
blood cell count, hemoglobin [27], sodium [7], creatinine [8, 35], troponin [55, 31, 25] lactate oxygen
saturation, oxygen source, glucose, calcium, potassium, chloride, blood urea nitrogen (BUN), carbon
dioxide, hematocrit, platelet, magnesium, phosphorus, albumin, aspartate transaminase (AST),
Alkaline Phosphatase, Total Bilirubin, International Normalized Ratio, and Absolute Neutrophil
Count (ANC). All values were log transformed and standardized to have a mean of zero and standard
deviation of 1 based on values for each planet on the development set. We also added the hospital
service and age.
Readmission Baseline Model - modified HOSPITAL score
We created a modified HOSPITAL score [13] that included the most recent value of sodium and
hemoglobin log transformed and standardized (to mean 0 and standard deviation of 1) based on values
for each planet on the development set, binary indicators for hospital service, a binary indicator for
the occurrence of any CPT codes during the hospitalization, a binary indicator for the hospitalization
lasting at least 5 days, prior hospital admissions in the past year discretized to 0,1, 2-5 and
5, and
admission source.
Length of Stay Baseline Model - modified Liu
We created a baseline model similar to those created using electronic health record data for general
hospital populations18,19. In particular, we created a lasso logistic model with the following variables:
age, gender, HCC (Hierarchical Condition Categories) codes in the past year (counts for each one),
admission source, hospital service, and the lab predictors from the mortality baseline model.
We chose to create a baseline model similar to those created using electronic health record data
[37]. We created a lasso logistic model with the following variables: age, gender, prior HCC codes in
the timeline (counts for each one), the principal diagnosis coded as a CCS, hospital service, and the
most recent lab value of each possible lab.
(a) Calibration curve for inpatient mortality
predicted at 24 hours into hospitalization
for hospital A
(b) Calibration curve for inpatient mortality
predicted at 24 hours into hospitalization
for hospital B
(c) Calibration curve for readmission pre-
dicted at discharge for hospital A
(d) Calibration curve for readmission pre-
dicted at discharge for hospital B
(e) Calibration curve for long length of stay
predicted at 24 hours into hospitalization
for hospital A
(f) Calibration curve for long length of stay
predicted at 24 hours into hospitalization
for hospital B
“2016 Measure updates and specifications report: hospital-wide all-cause unplanned readmission
— version 5.0”. In: Yale–New Haven Health Services Corporation/Center for Outcomes Research
& Evaluation (May 2016).
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural Machine Translation by
Jointly Learning to Align and Translate”. In: (Sept. 2014). arXiv: 1409.0473 [cs.CL].
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural Machine Translation by
Jointly Learning to Align and Translate”. In: (Jan. 2014). arXiv: 1409.0473 [cs.CL].
D W Bates et al. “Big data in health care: using analytics to identify and manage high-risk
and high-cost patients”. In: Health Aff. 33.7 (2014), pp. 1123–1131.
David W Bates et al. “Ten commandments for effective clinical decision support: making the
practice of evidence-based medicine a reality”. en. In: J. Am. Med. Inform. Assoc. 10.6 (Nov.
2003), pp. 523–530.
Yoshua Bengio, Aaron Courville, and Pascal Vincent. “Representation Learning: A Review and
New Perspectives”. In: (June 2012). arXiv: 1206.5538 [cs.LG].
Scott W Biggins et al. “Serum sodium predicts mortality in patients listed for liver transplan-
tation”. en. In: Hepatology 41.1 (Jan. 2005), pp. 32–39.
Ion D Bucaloiu et al. “Increased risk of death and de novo chronic kidney disease following
reversible acute kidney injury”. en. In: Kidney Int. 81.5 (Mar. 2012), pp. 477–485.
Federico Cabitza, Raffaele Rasoini, and Gian Franco Gensini. “Unintended Consequences of
Machine Learning in Medicine”. In: JAMA (July 2017).
Vineet Chopra and Laurence F McMahon Jr. “Redesigning hospital alarms for patient safety:
alarmed and potentially dangerous”. en. In: JAMA 311.12 (Mar. 2014), pp. 1199–1200.
CMS’ ICD-9-CM to and from ICD-10-CM and ICD-10-PCS Crosswalk or General Equivalence
Mappings. icd-10- cm-and- pcs- crosswalk- general-
equivalence-mapping.html. Accessed: 2017-7-21.
Jacques Donz´e et al. “Potentially avoidable 30-day hospital readmissions in medical patients:
derivation and validation of a prediction model”. en. In: JAMA Intern. Med. 173.8 (Apr. 2013),
pp. 632–638.
Jacques Donz´e et al. “Potentially avoidable 30-day hospital readmissions in medical patients:
derivation and validation of a prediction model”. en. In: JAMA Intern. Med. 173.8 (22 4 2013),
pp. 632–638.
Barbara J Drew et al. “Insights into the problem of alarm fatigue with physiologic monitor
devices: a comprehensive observational study of consecutive intensive care unit patients”. en.
In: PLoS One 9.10 (Oct. 2014), e110274.
Gabriel J Escobar et al. “Nonelective Rehospitalizations and Postdischarge Mortality: Predictive
Models Suitable for Use in Real Time”. en. In: Med. Care 53.11 (Nov. 2015), pp. 916–923.
Andrea Frome et al. “DeViSE: A Deep Visual-Semantic Embedding Model”. In: Advances in
Neural Information Processing Systems 26. Ed. by C J C Burges et al. Curran Associates, Inc.,
2013, pp. 2121–2129.
Benjamin A Goldstein et al. “Opportunities and challenges in developing risk prediction models
with electronic health records data: a systematic review”. en. In: J. Am. Med. Inform. Assoc.
24.1 (Jan. 2017), pp. 198–208.
Daniel Golovin et al. “Google Vizier: A Service for Black-Box Optimization”. In: Proceedings
of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
ACM, 2017.
[19] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
Kevin Grumbach, Catherine R Lucey, and S Claiborne Johnston. “Transforming From Centers
of Learning to Learning Health Systems: The Challenge for Academic Health Centers”. In:
JAMA 311.11 (Mar. 2014), pp. 1109–1110.
Varun Gulshan et al. “Development and Validation of a Deep Learning Algorithm for Detection
of Diabetic Retinopathy in Retinal Fundus Photographs”. In: JAMA 316.22 (Dec. 2016),
pp. 2402–2410.
John D Halamka and Micky Tripathi. “The HITECH Era in Retrospect”. en. In: N. Engl. J.
Med. 377.10 (Sept. 2017), pp. 907–909.
[23] Health Level 7. Accessed: 2017-8-3. Apr. 2017.
Sepp Hochreiter and J¨urgen Schmidhuber. “Long Short-Term Memory”. In: Neural Comput.
9.8 (Nov. 1997), pp. 1735–1780.
P James et al. “Relation between troponin T concentration and mortality in patients presenting
with an acute stroke: observational study”. en. In: BMJ 320.7248 (Mar. 2000), pp. 1502–1504.
J Larry Jameson and Dan L Longo. “Precision medicine–personalized, problematic, and
promising”. en. In: N. Engl. J. Med. 372.23 (June 2015), pp. 2229–2234.
Paul R Kalra et al. “Hemoglobin and Change in Hemoglobin Status Predict Mortality, Cardio-
vascular Events, and Bleeding in Stable Coronary Artery Disease”. en. In: Am. J. Med. (19 1
Devan Kansagara et al. “Risk prediction models for hospital readmission: a systematic review”.
en. In: JAMA 306.15 (Oct. 2011), pp. 1688–1698.
Kirsi-Maija Kaukonen et al. “Systemic inflammatory response syndrome criteria in defining
severe sepsis”. en. In: N. Engl. J. Med. 372.17 (Apr. 2015), pp. 1629–1638.
John Kellett and Arnold Kim. “Validation of an abbreviated Vitalpac
Early Warning
Score (ViEWS) in 75,419 consecutive admissions to a Canadian regional hospital”. en. In:
Resuscitation 83.3 (Mar. 2012), pp. 297–302.
Lauren J Kim et al. “Cardiac troponin I predicts short-term mortality in vascular surgery
patients”. en. In: Circulation 106.18 (29 10 2002), pp. 2366–2371.
Andrew A Kramer and Jack E Zimmerman. “Assessing the calibration of mortality benchmarks
in critical care: The Hosmer-Lemeshow test revisited”. en. In: Crit. Care Med. 35.9 (Sept. 2007),
pp. 2052–2056.
Harlan M Krumholz. “Big data and new knowledge in medicine: the thinking, training, and
tools needed for a learning health system”. en. In: Health Aff. 33.7 (July 2014), pp. 1163–1170.
Harlan M Krumholz, Sharon F Terry, and Joanne Waldstreicher. “Data Acquisition, Curation,
and Use for a Continuously Learning Health System”. en. In: JAMA 316.16 (Oct. 2016),
pp. 1669–1670.
Jean-Philippe Lafrance and Donald R Miller. “Acute kidney injury associates with increased
long-term mortality”. en. In: J. Am. Soc. Nephrol. 21.2 (Feb. 2010), pp. 345–352.
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. “Deep learning”. en. In: Nature 521.7553
(May 2015), pp. 436–444.
Vincent Liu et al. “Length of stay predictions: improvements through the use of automated
laboratory and comorbidity variables”. en. In: Med. Care 48.8 (Aug. 2010), pp. 739–744.
Steve Lohr. “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights”. In: The New
York Times (Aug. 2014).
Mingshan Lu et al. “Systematic review of risk adjustment models of hospital length of stay
(LOS)”. en. In: Med. Care 53.4 (Apr. 2015), pp. 355–365.
Joshua C Mandel et al. “SMART on FHIR: a standards-based, interoperable apps platform for
electronic health records”. en. In: J. Am. Med. Inform. Assoc. 23.5 (Sept. 2016), pp. 899–908.
Ziad Obermeyer and Ezekiel J Emanuel. “Predicting the Future — Big Data, Machine Learning,
and Clinical Medicine”. In: N. Engl. J. Med. 375.13 (2016), pp. 1216–1219.
Ravi B Parikh, J Sanford Schwartz, and Amol S Navathe. “Beyond Genes and Molecules - A
Precision Delivery Initiative for Precision Medicine”. en. In: N. Engl. J. Med. 376.17 (Apr.
2017), pp. 1609–1612.
Fabian Pedregosa et al. “Scikit-learn: Machine Learning in Python”. In: J. Mach. Learn. Res.
12.Oct (2011), pp. 2825–2830.
Adler Perotte et al. “Diagnosis code assignment: models and evaluation metrics”. en. In: J. Am.
Med. Inform. Assoc. 21.2 (Mar. 2014), pp. 231–237.
Gil Press. Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task,
Survey Says.
most- time - consuming- least-enjoyable- data- science-task- survey-says/
. Accessed:
2017-10-22. Mar. 2016.
Lior Rokach. “Ensemble-based Classifiers”. In: Artif. Intell. Rev. 33.1-2 (Feb. 2010), pp. 1–39.
Santiago Romero-Brufau et al. “Why the C-statistic is not informative to evaluate early warning
scores and what metrics to use”. In: Crit. Care 19.1 (2015), p. 285.
SciKit Learn Documentation on Area Under the Curve scores.
stable/modules/generated/sklearn.metrics.roc_auc_score.html. Accessed: 2017-8-3.
SciKit Learn Documentation on F1 Score.
http: / /scikit - learn. org /stable /modules /
generated/sklearn.metrics.f1_score.html. Accessed: 2017-8-3.
Gary B Smith et al. “The ability of the National Early Warning Score (NEWS) to discriminate
patients at risk of early cardiac arrest, unanticipated intensive care unit admission, and death”.
en. In: Resuscitation 84.4 (Apr. 2013), pp. 465–470.
M E Beth Smith et al. “Early Warning System Scores for Clinical Deterioration in Hospitalized
Patients: A Systematic Review”. In: Ann. Am. Thorac. Soc. 11.9 (2014), pp. 1454–1465.
Ying P Tabak et al. “Using electronic health record data to develop inpatient mortality
predictive model: Acute Laboratory Risk of Mortality Score (ALaRMS)”. en. In: J. Am. Med.
Inform. Assoc. 21.3 (May 2014), pp. 455–463.
The Digital Universe: Driving Data Growth in Healthcare.
report/digital-universe-healthcare-vertical-report-ar.pdf. Accessed: 2017-2-23.
Carl van Walraven et al. “Derivation and validation of an index to predict early death or
unplanned readmission after discharge from hospital to the community”. en. In: CMAJ 182.6
(Apr. 2010), pp. 551–557.
Daniel A Waxman et al. “A model for troponin I as a quantitative predictor of in-hospital
mortality”. In: J. Am. Coll. Cardiol. 48.9 (2006), pp. 1755–1762.
Yonghui Wu et al. “Google’s Neural Machine Translation System: Bridging the Gap between
Human and Machine Translation”. In: (Sept. 2016). arXiv: 1609.08144 [cs.CL].
Hayato Yamana et al. “Procedure-based severity index for inpatients: development and validation
using administrative database”. en. In: BMC Health Serv. Res. 15 (July 2015), p. 261.
Huaqiong Zhou et al. “Utility of models to predict 28-day or 30-day unplanned hospital
readmissions: an updated systematic review”. en. In: BMJ Open 6.6 (27 6 2016), e011060.
... The hope for EHRs (discussed more thoroughly in Section 2.4.3) is that, through the use of secondary analysis of medical records, event prediction and decision support systems can be constructed to aid clinicians and improve the patient experience [139]. Deep learning models for secondary analysis of EHRs is the focus of increasing levels of excitement in academic circles [56,139,233,235], and are used throughout Chapter 5. ...
... Rajkomar et al. [233] presented the first attempt to use all available patient data by mapping the entire EHR to a highly curated set of predictor variables, structured inline with the categories of available data. Although this method achieved strong performance, risk assessment was EHR format-specific, static, and reliant on an ensemble of diverse model structures. ...
... Section 5.5 also addresses how recent state-of-the-art DNNs, trained using these clinical objectives, have employed recurrent models for sequence analysis which rely upon fixed prediction scheduling, carrying out extensive model analysis while overlooking the underlying choice of time window length [43,233,292]. This early-stage modelling decision-necessitated by traditional recurrent neural network (RNN, [76]) structure-loses patient information and timeseries granularity, and ignores the underlying timescales present in EHR data which have been shown to boost model performance [191,37]. ...
The current generation of deep neural network-based models demonstrate tremendous capacity to learn distributions at scale. Given this success, deep learning and deep generative modelling have progressively been applied across a broader range of increasingly demanding applications, as well as in safety-critical domains such as healthcare. However, existing models are reliant upon restrictive theoretical assumptions, deriving from longstanding distributions and divergences at their core, which inhibit their continued advance. By leveraging wider distribution and divergence families, transferring broader parametric assumptions to deep generative models increases the scope of the functions they can approximate. In particular, Kullback-Leibler divergence and the Gaussian distribution are assumed at the heart of variational autoencoders and score-based models, and are central to their limitations. This thesis argues that both assumptions can be viewed through wider lenses—the skew-geometric Jensen-Shannon divergence family and the generalised normal distribution family respectively. Several contributions are made to both the theory of deep learning, specifically deep generative modelling, and its application to electronic health records (EHRs). Firstly, a new type of variational autoencoder is introduced, capitalising on the flexibility of the skew-geometric Jensen-Shannon divergence, to overcome the prior theoretical shortcomings and lack of interpretability of latent space constraints. JSGα-VAEs lead to better reconstruction and generation when compared to baseline VAEs and utilise a single hyperparameter which can be easily interpreted in latent space. Secondly, heavy-tailed denoising score matching (HTDSM) is proposed, motivated by superior concentration of measure for the noising distribution in high-dimensional space. HTDSM offers improved score estimation, controllable sampling convergence, and more class-balanced unconditional generative performance. Finally, several results which indicate that the generalisation of EHR pipelines and models leads to increased flexibility with greater utility in the clinic are presented. Specifically, restrictions on data collation, curation, and chronology are removed while maintaining competitive performance on clinical objectives such as mortality prediction.
... Unfortunately, conventional models do not accurately predict readmissions; model c-statistics are rarely seen above 0.8 [8,9]. Additionally, most of the existing prediction models rely heavily on manual feature engineering [5,[10][11][12][13][14][15][16][17][18][19][20][21][22][23][24], which is based on domain knowledge and experience. Those features are often dataset-dependent, thus limiting generalizability between datasets or jurisdictions. ...
... Recently, machine learning methods that automatically identify which parts of a given set of data are essential for prediction have gained popularity, and there exists such work applied in the domain of readmission prediction as well. Notably, Rajkomar et al. used electronic health records and deep learning models to predict 30-day readmissions and other outcomes [19]. However, their c-statistic for 30-day readmissions did not exceed 0.75 despite their c-statistics for other outcomes such as mortality being above 0.8. ...
... However, their c-statistic for 30-day readmissions did not exceed 0.75 despite their c-statistics for other outcomes such as mortality being above 0.8. There have been several similar studies, but their c-statistics are also moderate, below 0.8 [19,25,26]. Choi et al. explored word embeddings to represent medical concepts [27][28][29], often paired with recurrent neural networks for the prediction of clinical events. ...
Full-text available
Background Hospital readmissions are one of the costliest challenges facing healthcare systems, but conventional models fail to predict readmissions well. Many existing models use exclusively manually-engineered features, which are labor intensive and dataset-specific. Our objective was to develop and evaluate models to predict hospital readmissions using derived features that are automatically generated from longitudinal data using machine learning techniques. Methods We studied patients discharged from acute care facilities in 2015 and 2016 in Alberta, Canada, excluding those who were hospitalized to give birth or for a psychiatric condition. We used population-level linked administrative hospital data from 2011 to 2017 to train prediction models using both manually derived features and features generated automatically from observational data. The target value of interest was 30-day all-cause hospital readmissions, with the success of prediction measured using the area under the curve (AUC) statistic. Results Data from 428,669 patients (62% female, 38% male, 27% 65 years or older) were used for training and evaluating models: 24,974 (5.83%) were readmitted within 30 days of discharge for any reason. Patients were more likely to be readmitted if they utilized hospital care more, had more physician office visits, had more prescriptions, had a chronic condition, or were 65 years old or older. The LACE readmission prediction model had an AUC of 0.66 ± 0.0064 while the machine learning model’s test set AUC was 0.83 ± 0.0045, based on learning a gradient boosting machine on a combination of machine-learned and manually-derived features. Conclusion Applying a machine learning model to the computer-generated and manual features improved prediction accuracy over the LACE model and a model that used only manually-derived features. Our model can be used to identify high-risk patients, for whom targeted interventions may potentially prevent readmissions.
Full-text available
Deep neural networks (DNNs) have transformed the field of computer vision and currently constitute some of the best models for representations learned via hierarchical processing in the human brain. In medical imaging, these models have shown human-level performance and even higher in the early diagnosis of a wide range of diseases. However, the goal is often not only to accurately predict group membership or diagnose but also to provide explanations that support the model decision in a context that a human can readily interpret. The limited transparency has hindered the adoption of DNN algorithms across many domains. Numerous explainable artificial intelligence (XAI) techniques have been developed to peer inside the “black box” and make sense of DNN models, taking somewhat divergent approaches. Here, we suggest that these methods may be considered in light of the interpretation goal, including functional or mechanistic interpretations, developing archetypal class instances, or assessing the relevance of certain features or mappings on a trained model in a post-hoc capacity. We then focus on reviewing recent applications of post-hoc relevance techniques as applied to neuroimaging data. Moreover, this article suggests a method for comparing the reliability of XAI methods, especially in deep neural networks, along with their advantages and pitfalls.
The ever-increasing availability of Electronic Health Records (EHRs) is the key enabling factor of precision medicine, which aims to provide therapies and diagnoses based not only on medical literature, but also on clinical experience and individual information of patients (e.g. genomics, lifestyle, health history). The unstructured nature of EHRs has posed several challenges on their effective analysis, and heterogeneous graphs are the most suitable solution to handle the heterogeneity of information contained in EHRs. However, while EHRs are an extremely valuable data source, information from current medical literature has yet to be considered in clinical decision support systems. In this work, we build an heterogeneous graph from Italian EHRs provided by the Hospital of Naples Federico II, and we define a methodological workflow allowing us to predict the presence of a link between patients and diagnosed diseases. We empirically demonstrate that linking concepts to biomedical ontologies (e.g. UMLS, DBpedia) — which allow us to extract entities and relationships from medical literature — is significantly beneficial to our link-prediction workflow in terms of Area Under the ROC curve (AUC) and Mean Reciprocal Rank (MRR).
Full-text available
In this work, modified Unified Theory of Acceptance and Use of Technology (UTAUT) model is developed for acceptance of Electronic Health Record (EHR) by doctors in United Arab Emirates (UAE). In this model, the factors affecting the level of acceptance by doctors towards usage of EHR are identified and analysed with modified basic constructs of UTAUT model. The collected pilot data is grouped according to the doctor's age, gender, experience qualification and assigned priorities. The weighted probabilities (WP) of the modified constructs are determined by applying Multi-class Support Vector Machine (M-SVM) over each category of candidates. Based on the determined values of WP, the EHR acceptance ratio (EAR) is estimated using linear equation (LE) and linear regression (LR) models. From the statistical results, it is seen that the Expected Performance Construct has the highest impact on EAR in case of both LE and LR models.
Resumen Encontrar la causalidad en medicina es el mayor interés de la investigación científica, para luego generar intervenciones que traten o curen la enfermedad. La mayoría de los modelos estadísticos clásicos permiten inferir asociación, y solo pocos diseños logran demostrar causa efecto con una adecuada metodología y sólida evidencia. La medicina basada en la evidencia respalda sus hallazgos en modelos que desde una hipótesis salen a buscar datos para demostrarla o descartarla. Ello también aplica en la elaboración de modelos predictivos que sean confiables y que produzcan algún impacto en la práctica clínica. La gran cantidad de datos que se están almacenando en los registros clínicos electrónicos y el mayor poder computacional, hacen que las técnicas de aprendizaje de máquina tengan un rol preponderante en el desarrollo de nuevos análisis predictivos y reconocimiento de patrones no conocidos con estos modelos de cómputo, que junto con cambiar la mirada desde los datos a la información, van incorporándose cada vez más en la práctica clínica diaria, con mayor precisión y velocidad para la toma de decisiones. En el presente artículo, se pretende entregar algunas bases teóricas y evidencia de cómo estas técnicas computacionales modernas de aprendizaje de máquina han permitido llegar a mejores resultados y están siendo cada vez más utilizadas.
Big data and (deep) machine learning have been ambitious tools in digital medicine, but these tools focus mainly on association. Intervention in medicine is about the causal effects. The average treatment effect has long been studied as a measure of causal effect, assuming that all populations have the same effect size. However, no “one-size-fits-all” treatment seems to work in some complex diseases. Treatment effects may vary by patient. Estimating heterogeneous treatment effects (HTE) may have a high impact on developing personalized treatment. Lots of advanced machine learning models for estimating HTE have emerged in recent years, but there has been limited translational research into the real-world healthcare domain. To fill the gap, we reviewed and compared eleven recent HTE estimation methodologies, including meta-learner, representation learning models, and tree-based models. We performed a comprehensive benchmark experiment based on nationwide healthcare claim data with application to Alzheimer’s disease drug repurposing. We provided some challenges and opportunities in HTE estimation analysis in the healthcare domain to close the gap between innovative HTE models and deployment to real-world healthcare problems.
Background Reducing hospital readmissions is a federal policy priority, and predictive models of hospital readmissions have proliferated in recent years; however, most such models tend to focus on the 30-day readmission time horizon and do not consider readmission over shorter (or longer) windows.Objectives To evaluate the performance of a predictive model of hospital readmissions over three different readmission timeframes in a commercially insured population.DesignRetrospective multivariate logistic regression with an 80/20 train/test split.ParticipantsA total of 2,213,832 commercially insured inpatient admissions from 2016 to 2017 comprising 782,768 unique patients from the Health Care Cost Institute.Main MeasuresOutcomes are readmission within 14 days, 15–30 days, and 31–60 days from discharge. Predictor variables span six different domains: index admission, condition history, demographic, utilization history, pharmacy, and environmental controls.Key ResultsOur model generates C-statistics for holdout samples ranging from 0.618 to 0.915. The model’s discriminative power declines with readmission time horizon: discrimination for readmission predictions within 14 days following discharge is higher than for readmissions 15–30 days following discharge, which in turn is higher than predictions 31–60 days following discharge. Additionally, the model’s predictive power increases nonlinearly with the inclusion of successive risk factor domains: patient-level measures of utilization and condition history add substantially to the discriminative power of the model, while demographic information, pharmacy utilization, and environmental risk factors add relatively little.Conclusion It is more difficult to predict distant readmissions than proximal readmissions, and the more information the model uses, the better the predictions. Inclusion of utilization-based risk factors add substantially to the discriminative ability of the model, much more than any other included risk factor domain. Our best-performing models perform well relative to other published readmission prediction models. It is possible that these predictions could have operational utility in targeting readmission prevention interventions among high-risk individuals.
While deep semi-supervised learning has gained much attention in computer vision, limited research exists on its applicability in the time-series domain. In this work, we investigate the transferability of state-of-the-art deep semi-supervised models from image to time-series classification. We discuss the necessary model adaptations, in particular, an appropriate model backbone architecture and the use of tailored data augmentation strategies. Based on these adaptations, we explore the potential of deep semi-supervised learning in the context of time-series classification by evaluating our methods on large public time-series classification problems with varying amounts of labeled samples. We perform extensive comparisons under a decidedly realistic and appropriate evaluation scheme with a unified reimplementation of all algorithms considered, which is yet lacking in the field. Further, we shed light on the effect of different data augmentation strategies and model architecture backbones in this context within a series of experiments. We find that these transferred semi-supervised models show substantial performance gains over strong supervised, semi-supervised and self-supervised alternatives, especially for scenarios with very few labeled samples.
Background Peripheral tears of the posterior horn of the medial meniscus, known as “ramp lesions,” are commonly found in anterior cruciate ligament (ACL)–deficient knees but are frequently missed on routine evaluation. Purpose To predict the presence of ramp lesions in ACL-deficient knees using machine learning methods with associated risk factors. Study Design Cohort study (Diagnosis); Level of evidence, 2. Methods This study included 362 patients who underwent ACL reconstruction between June 2010 and March 2019. The exclusion criteria were combined fractures and multiple ligament injuries, except for medial collateral ligament injuries. Patients were grouped according to the presence of ramp lesions on arthroscopic surgery. Binary logistic regression was used to analyze risk factors including age, sex, body mass index, time from injury to surgery (≥3 or <3 months), mechanism of injury (contact or noncontact), side-to-side laxity, pivot-shift grade, medial and lateral tibial/meniscal slope, location of bone contusion, mechanical axis angle, and lateral femoral condyle (LFC) ratio. The receiver operating characteristic curve and area under the curve were also evaluated. Results Ramp lesions were identified in 112 patients (30.9%). The risk for ramp lesions increased with steeper medial tibial and meniscal slopes, higher knee laxity, and an increased LFC ratio. Comparing the final performance of all models, the random forest model yielded the best performance (area under the curve: 0.944), although there were no significant differences among the models ( P > .05). The cut-off values for the presence of ramp lesions on receiver operating characteristic analysis were as follows: medial tibial slope >5.5° ( P < .001), medial meniscal slope >5.0° ( P < .001), and LFC ratio >71.3% ( P = .033). Conclusion Steep medial tibial and meniscal slopes, an increased LFC ratio, and higher knee rotatory laxity were observed risk factors for ramp lesions in patients with an ACL injury. The prediction model of this study could be used as a supplementary diagnostic tool for ramp lesions in ACL-injured knees. In general, care should be taken in patients with ramp lesions and its risk factors during ACL reconstruction.
Conference Paper
Full-text available
Any sufficiently complex system acts as a black box when it becomes easier to experiment with than to understand. Hence, black-box optimization has become increasingly important as systems have become more complex. In this paper we describe Google Vizier, a Google-internal service for performing black-box optimization that has become the de facto parameter tuning engine at Google. Google Vizier is used to optimize many of our machine learning models and other systems, and also provides core capabilities to Google's Cloud Machine Learning HyperTune subsystem. We discuss our requirements, infrastructure design, underlying algorithms, and advanced features such as transfer learning and automated early stopping that the service provides.
Full-text available
Real-time prediction of clinical interventions remains a challenge within intensive care units (ICUs). This task is complicated by data sources that are noisy, sparse, heterogeneous and outcomes that are imbalanced. In this paper, we integrate data from all available ICU sources (vitals, labs, notes, demographics) and focus on learning rich representations of this data to predict onset and weaning of multiple invasive interventions. In particular, we compare both long short-term memory networks (LSTM) and convolutional neural networks (CNN) for prediction of five intervention tasks: invasive ventilation, non-invasive ventilation, vasopressors, colloid boluses, and crystalloid boluses. Our predictions are done in a forward-facing manner to enable "real-time" performance, and predictions are made with a six hour gap time to support clinically actionable planning. We achieve state-of-the-art results on our predictive tasks using deep architectures. We explore the use of feature occlusion to interpret LSTM models, and compare this to the interpretability gained from examining inputs that maximally activate CNN outputs. We show that our models are able to significantly outperform baselines in intervention prediction, and provide insight into model learning, which is crucial for the adoption of such models in practice.
Full-text available
Health care is one of the most exciting frontiers in data mining and machine learning. Successful adoption of electronic health records (EHRs) created an explosion in digital clinical data available for analysis, but progress in machine learning for healthcare research has been difficult to measure because of the absence of publicly available benchmark data sets. To address this problem, we propose four clinical prediction benchmarks using data derived from the publicly available Medical Information Mart for Intensive Care (MIMIC-III) database. These tasks cover a range of clinical problems including modeling risk of mortality, forecasting length of stay, detecting physiologic decline, and phenotype classification. We formulate a heterogeneous multitask problem where the goal is to jointly learn multiple clinically relevant prediction tasks based on the same time series data. To address this problem, we propose a novel recurrent neural network (RNN) architecture that leverages the correlations between the various tasks to learn a better predictive model. We validate the proposed neural architecture on this benchmark, and demonstrate that it outperforms strong baselines, including single task RNNs.
Full-text available
Viewing the trajectory of a patient as a dynamical system, a recurrent neural network was developed to learn the course of patient encounters in the Pediatric Intensive Care Unit (PICU) of a major tertiary care center. Data extracted from Electronic Medical Records (EMR) of about 12000 patients who were admitted to the PICU over a period of more than 10 years were leveraged. The RNN model ingests a sequence of measurements which include physiologic observations, laboratory results, administered drugs and interventions, and generates temporally dynamic predictions for in-ICU mortality at user-specified times. The RNN's ICU mortality predictions offer significant improvements over those from two clinically-used scores and static machine learning algorithms.
The British Thoracic Society (BTS) guideline for the management of adults with community acquired pneumonia (CAP) published in 2009 was compared with the 2014 National Institute for Health and Care Excellence (NICE) Pneumonia Guideline. Of the 36 BTS recommendations that overlapped with NICE recommendations, no major differences were found in 31, including those covering key aspects of CAP management: timeliness of diagnosis and treatment, severity assessment and empirical antibiotic choice. Of the five BTS recommendations where major differences with NICE were identified, one related to antibiotic duration in low and moderate severity CAP, two to the timing of review of patients and two to legionella urinary antigen testing.
The HITECH Act has played an invaluable role in accelerating the adoption of electronic health records throughout the United States. But along the way, the effort to computerize health care lost the hearts and minds of clinicians.
Over the past decade, machine learning techniques have made substantial advances in many domains. In health care, global interest in the potential of machine learning has increased; for example, a deep learning algorithm has shown high accuracy in detecting diabetic retinopathy.¹ There have been suggestions that machine learning will drive changes in health care within a few years, specifically in medical disciplines that require more accurate prognostic models (eg, oncology) and those based on pattern recognition (eg, radiology and pathology).
The past decade has seen an explosion in the amount of digital information stored in electronic health records (EHR). While primarily designed for archiving patient clinical information and administrative healthcare tasks, many researchers have found secondary use of these records for various clinical informatics tasks. Over the same period, the machine learning community has seen widespread advances in deep learning techniques, which also have been successfully applied to the vast amount of EHR data. In this paper, we review these deep EHR systems, examining architectures, technical aspects, and clinical applications. We also identify shortcomings of current techniques and discuss avenues of future research for EHR-based deep learning.
The Precision Medicine Initiative’s advances may add complexity to delivering high-quality, cost-effective care in keeping with patients’ values. A complementary effort could investigate delivery-system interventions that are tailored to individual needs and wishes.