Machine Learning Model Interpretability for Precision Medicine

Article (PDF Available) · October 2016with 286 Reads
Cite this publication
Abstract
Interpretability of machine learning models is critical for data-driven precision medicine efforts. However, highly predictive models are generally complex and are difficult to interpret. Here using Model-Agnostic Explanations algorithm, we show that complex models such as random forest can be made interpretable. Using MIMIC-II dataset, we successfully predicted ICU mortality with 80% balanced accuracy and were also were able to interpret the relative effect of the features on prediction at individual level.
Abstract Interpretability of machine learning
models is critical for data-driven precision
medicine efforts. However, highly predictive
models are generally complex and are difficult
to interpret. Here using Model-Agnostic
Explanations algorithm, we show that complex
models such as random forest can be made
interpretable. Using MIMIC-II dataset, we
successfully predicted ICU mortality with 80%
balanced accuracy and were also were able to
interpret the relative effect of the features on
prediction at individual level.
I. INTRODUCTION
Precision medicine holds a great future in
healthcare as it customizes medical care to an
individual’s unique disease state [1]. The
widespread adoption of electronic health records
has resulted in a tsunami of data that can be
leveraged with machine learning based approaches
in order to dissect clinical heterogeneity and aid the
physician in targeted decision making.
In general, highly accurate machine learning
models tend to become complex and hence are
difficult to interpret. In the trade-off between
predictive modelling and explanatory modelling
[2], explanatory modeling is highly important
among healthcare practitioners because they value
the ability to understand the contribution of specific
features to a model. It is very important to
understand the decision process of a predictive
model before its decision can be utilized in clinical
setting because it affects the life and death of a
patient. A predictive model has to either be
interpretable, or it has to be transformed to be
interpretable, in order for a user of the model to
understand its decision process. Model
interpretability is vital for the successful application
of predictive models in healthcare, especially for
data-driven precision medicine since it involves
understanding a patient’s unique disease state.
Interpretable models would deliver actionable
insights in line with precision medicine initiatives.
In this study, we perform a case study of the
application of model interpretability for precision
medicine on Intensive care unit (ICU) data. ICUs
can benefit from the rich information that can be
extracted from the improved interpretability of the
models. In particular, one large area where ICU
physicians and staff can benefit is in early prediction
of mortality. This is a large problem because the
average mortality rate at hospitals is between 8-
19%, or around 500,000 deaths annually [3]. In this
study, we demonstrate how a complex highly
predictive model trained for ICU mortality
prediction can be approximated as a simple
interpretable model for each patient. Through the
approximated simple models, we show that the
important features’ contributions during the
decision process of the complex predictive model
can be uniquely understood for each patietnt.
II. METHODS
We extracted features from the Multi-Parameter
Intelligent Monitoring in Intensive Care (MIMIC-
II) dataset [4] containing 8,315 patients who
exhibited mortality and 23,974 patients who did not.
We extracted counts of medications, diagnoses, and
lab tests for all patients.
We used 75% of the data for training. Remaining
25% data was used for testing. Feature selection and
classification were performed using scikit-learn
0.17.1 [5]. The top predictive features were selected
by ANOVA F-value feature selection test under 10-
fold cross validation.
Next, a random forest (RF) [6] model with 1000
trees was trained to predict the mortality status
(where value of 0 indicates no mortality and value
of 1 indicates mortality). Gini impurity was used as
the splitting criterion while growing decision trees.
Grid tuning was performed to select the optimum
number of predictors used for splitting a node of a
decision tree in RF.
Gajendra J. Katuwal* and Robert Chen+
*Rochester Institute of Technology, Rochester, NY 14623
+Georgia Institute of Technology, Atlanta, GA 30332
Machine Learning Model Interpretability for Precision Medicine
Simple explanations of the contribution of
important features during the decision process of the
hard to interpret complex RF model were extracted
using locally-interpretable model- agnostic
explanations (LIME) technique [7]. The classifier
decision function (the predicted probabilities by the
RF model for test subjects) was subsequently fed to
a LIME model. The LIME results were used to
interpret the relative contribution of features for a
particular patient. While RF is a complex model,
LIME learns the “explanation” for an instance or a
patient by approximating the RF model by a sparse
liner model local to the vicinity of the patient, thus,
providing a patient specific explanation.
A test subject was randomly selected to
demonstrate the individual-level model
interpretation derived from LIME. The non-linear
decision function of the RF model is approximated
by a sparse linear model in the neighborhood of the
test patient (red “X”; see Fig. 2). At first, perturbed
data points or instances are created around the test
patient X. Then, a sparse linear model is fitted on
the RF model’s prediction for these perturbed
instances where prediction of each perturbed
instance is weighted inversely with its distance
from the test patient X. Finally using the sparse
linear model unique to each patient, an explanation
conatining important features’s contribution during
the decision process of the RF model for the patient
is extracted.
III. RESULTS
Our model yielded 80% balanced accuracy in the
test data. For the test subject randomly selected to
demonstrate the individual-level model
explanations, the top 4 most predictive features
were temperature, total CO2, atrial fibrillation, and
lactate level (Fig. 3). Via LIME, it was identified
that this particular patient with higher lactate, and
more atrial fibrillation was at higher risk (78%) of
mortality, which is consistent with the current
medical understanding.
IV. DISCUSSION
We accurately predicted the mortality rate of ICU
Figure 1: General outline of the study
Figure 2: Non-linear decision function of the complex
predictive model is represented by the orange/blue
background. The red cross is the test patient being
explained (let's call it X). Perturbed instances around X
weighted by their proximity to X are fed into the model. A
sparse linear model (red dashed line) is fitted for the
model’s prediction on these perturbed instances. This
linear model approximates the non-linear decision
function of the predictive model, locally in the
neighborhood of X.
patients and were also able to uniquely identify the
contribution of the important features on mortality
prediction for each patient. We achieved this by
combining the predictive power of RF and the
patient-level model interpretability of LIME, where
we linearly approximated the RF model in the
patient vicinity. The explanation generated from
the LIME model for our test patient is consistent
with current medical knowledge.
In this case study, we successfully demonstrated
that simple explanations can be extracted from a
complex predictive model trained to detect
occurrence of ICU mortality. As the explanations
generated from the model were consistent with
medical understanding, this study demonstrates
that approximating complex models by simple
interpretable models is one way of solving the
overarching problem of uninterpretable black-box
models in healthcare. In addition, approximation of
the complex models uniquely for each patient
provides a unique and more faithful perspective
about the patient. This patient-specific model-
approximation technique can be utilized to gain
knowledge of each patient’s unique disease state
and hence can be very helpful for the success of
data-driven precision medicine efforts. Moreover,
it also helps to answer the over-arching why?
question while applying machine learning models
in healthcare. The adoption of highly useful
machine learning in healthcare has been delayed
due to uninterpretable complex models because
clinicians have not developed enough trust on these
models. Using a model-approximation technique,
simple and truthful explanations of the decision
process of complex models can be generated.
Easily understood and faithful explanations about
the decision process of machine learning models
can help to gain the trust of clinicians and hence
accelerate the adoption of machine learning in
healthcare.
V. CONCLUSION
We constructed an interpretable predictive model
for patient mortality using locally-interpretable
model-agnostic explanation technique. We
generated simple explanations from a complex
model which were consistent with current medical
understanding, and should motivate future work in
Figure 3: Patient specific model interpretation. A) Local model approximation in the vicinity of the patient: correlation of the
features to mortality. Temperature, atrial fibrillation, and lactate level are positively correlated with mortality. B) Feature
contributions for prediction. Higher counts of atrial fibrillation and higher lactate level contribute towards mortality of this
particular patient. C) Value: original value for each feature and Scaled: scaled value, D) Class prediction probabilities. The
Random Forest model predicts 78% mortality for this particular test patient.
data-driven precision medicine.
REFERENCES
[1] F. S. Collins and H. Varmus, “A New
Initiative on Precision Medicine,” N. Engl.
J. Med., vol. 372, no. 9, pp. 793–795, Jan.
2015.
[2] G. Shmueli, “To explain or to predict?,”
Stat. Sci., vol. 25, no. 3, pp. 289–310, 2010.
[3] “ICU Outcomes (Mortality and Length of
Stay) Methods, Data Collection Tool and
Data, 2014.,” 2014. [Online]. Available:
http://healthpolicy.ucsf.edu/content/icu-
outcomes.
[4] M. Saeed, M. Villarroel, A. T. Reisner, G.
Clifford, L.-W. Lehman, G. Moody, T.
Heldt, T. H. Kyaw, B. Moody, and R. G.
Mark, “Multiparameter Intelligent
Monitoring in Intensive Care II (MIMIC-
II): A public-access intensive care unit
database,” Crit Care Med, vol. 39, no. 5,
pp. 952–960, 2011.
[5] F. Pedregosa and G. Varoquaux, “Scikit-
learn: Machine Learning in Python,” J.
Mach. Learn. Res., vol. 12, pp. 2825–2830,
2011.
[6] L. Breiman, “Random Forrest,” Mach.
Learn., pp. 1–33, 2001.
[7] M. T. Ribeiro, S. Singh, and C. Guestrin,
“‘Why Should I Trust You?’: Explaining
the Predictions of Any Classifier,” 2016.
  • Preprint
    Full-text available
    Local Interpretable Model-Agnostic Explanations (LIME) is a popular technique used to increase the interpretability and explainability of black box Machine Learning (ML) algorithms. LIME typically generates an explanation for a single prediction by any ML model by learning a simpler interpretable model (e.g. linear classifier) around the prediction through generating simulated data around the instance by random perturbation, and obtaining feature importance through applying some form of feature selection. While LIME and similar local algorithms have gained popularity due to their simplicity, the random perturbation and feature selection methods result in "instability" in the generated explanations, where for the same prediction, different explanations can be generated. This is a critical issue that can prevent deployment of LIME in a Computer-Aided Diagnosis (CAD) system, where stability is of utmost importance to earn the trust of medical professionals. In this paper, we propose a deterministic version of LIME. Instead of random perturbation, we utilize agglomerative Hierarchical Clustering (HC) to group the training data together and K-Nearest Neighbour (KNN) to select the relevant cluster of the new instance that is being explained. After finding the relevant cluster, a linear model is trained over the selected cluster to generate the explanations. Experimental results on three different medical datasets show the superiority for Deterministic Local Interpretable Model-Agnostic Explanations (DLIME), where we quantitatively determine the stability of DLIME compared to LIME utilizing the Jaccard similarity among multiple generated explanations.
  • Conference Paper
    In many research studies, acquiring large amounts of data for predictive modeling can be expensive and time consuming. Many scientific pilot studies acquire only a limited number of data samples, which may be sufficient for statistical analysis but may not be ideal for using a standard machine learning pipeline. We have implemented and publicly shared a machine learning pipeline that aims to optimally utilize limited data samples in order to provide useful insights into the pre-dictive performance and importance of the input variables. Our proposed pipeline uses support vector machine classifier models to simultaneously yield classification accuracy, a generalizable parametric formulation to classify individual samples, and an optimal way to identify the most predictive variable dimension and combination. These attributes of our pipeline also provide a useful intuition and interpretation of 'black-box' machine learning models to facilitate many other fields of research. The initial version of the pipeline has been successful in a clinical research study with a limited sample size. In this paper, we package the complete version of the pipeline and demonstrate its utility and attributes using two publicly available datasets.
  • Article
    Purpose of review: Antimicrobial resistance (AMR) is a threat to global health and new approaches to combating AMR are needed. Use of machine learning in addressing AMR is in its infancy but has made promising steps. We reviewed the current literature on the use of machine learning for studying bacterial AMR. Recent findings: The advent of large-scale data sets provided by next-generation sequencing and electronic health records make applying machine learning to the study and treatment of AMR possible. To date, it has been used for antimicrobial susceptibility genotype/phenotype prediction, development of AMR clinical decision rules, novel antimicrobial agent discovery and antimicrobial therapy optimization. Summary: Application of machine learning to studying AMR is feasible but remains limited. Implementation of machine learning in clinical settings faces barriers to uptake with concerns regarding model interpretability and data quality.Future applications of machine learning to AMR are likely to be laboratory-based, such as antimicrobial susceptibility phenotype prediction.
  • Article
    We study the problem of personalizing survival estimates of patients in heterogeneous populations for clinical decision support. The desiderata are to improve predictions by making them personalized to the patient-at-hand, to better understand diseases and their risk factors, and to provide interpretable model outputs to clinicians. To enable accurate survival prognosis in heterogeneous populations we propose a novel probabilistic survival model which flexibly captures individual traits through a hierarchical latent variable formulation. Survival paths are estimated by jointly sampling the location and shape of the individual survival distribution resulting in patient-specific curves with quantifiable uncertainty estimates. An understanding of model predictions is paramount in medical practice where decisions have major social consequences. We develop a "personalized interpreter" that can be used to test the effect of covariates on each individual patient, in contrast to traditional methods that focus on population average effects. We extensively validated the proposed approach in various clinical settings, with a special focus on cardiovascular disease.
  • Conference Paper
    Identification of pulmonary diseases comprises of accurate auscultation as well as elaborate and expensive pulmonary function tests. Prior arts have shown that pulmonary diseases lead to abnormal lung sounds such as wheezes and crackles. This paper introduces novel spectral and spectrogram features, which are further refined by Maximal Information Coefficient, leading to the classification of healthy and abnormal lung sounds. A balanced lung sound dataset, consisting of publicly available data and data collected with a low-cost in-house digital stethoscope are used. The performance of the classifier is validated over several randomly selected non-overlapping training and validation samples and tested on separate subjects for two separate test cases: (a) overlapping and (b) non-overlapping data sources in training and testing. The results reveal that the proposed method sustains an accuracy of 80% even for non-overlapping data sources in training and testing.
  • To explain or to predict? ICU Outcomes (Mortality and Length of Stay) Methods, Data Collection Tool and Data, 2014
    • G Shmueli
    G. Shmueli, " To explain or to predict?, " Stat. Sci., vol. 25, no. 3, pp. 289–310, 2010. [3] " ICU Outcomes (Mortality and Length of Stay) Methods, Data Collection Tool and Data, 2014., " 2014. [Online]. Available: http://healthpolicy.ucsf.edu/content/icu- outcomes.
  • Article
    Despite widespread adoption, machine learning models remain mostly black boxes. Understanding the reasons behind predictions is, however, quite important in assessing trust in a model. Trust is fundamental if one plans to take action based on a prediction, or when choosing whether or not to deploy a new model. Such understanding further provides insights into the model, which can be used to turn an untrustworthy model or prediction into a trustworthy one. In this work, we propose LIME, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction. We further propose a method to explain models by presenting representative individual predictions and their explanations in a non-redundant way, framing the task as a submodular optimization problem. We demonstrate the flexibility of these methods by explaining different models for text (e.g. random forests) and image classification (e.g. neural networks). The usefulness of explanations is shown via novel experiments, both simulated and with human subjects. Our explanations empower users in various scenarios that require trust: deciding if one should trust a prediction, choosing between models, improving an untrustworthy classifier, and detecting why a classifier should not be trusted.
  • Conference Paper
    Despite widespread adoption, machine learning models remain mostly black boxes. Understanding the reasons behind predictions is, however, quite important in assessing trust, which is fundamental if one plans to take action based on a prediction, or when choosing whether to deploy a new model. Such understanding also provides insights into the model, which can be used to transform an untrustworthy model or prediction into a trustworthy one. In this work, we propose LIME, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally varound the prediction. We also propose a method to explain models by presenting representative individual predictions and their explanations in a non-redundant way, framing the task as a submodular optimization problem. We demonstrate the flexibility of these methods by explaining different models for text (e.g. random forests) and image classification (e.g. neural networks). We show the utility of explanations via novel experiments, both simulated and with human subjects, on various scenarios that require trust: deciding if one should trust a prediction, choosing between models, improving an untrustworthy classifier, and identifying why a classifier should not be trusted.
  • Article
    "Tonight, I'm launching a new Precision Medicine Initiative to bring us closer to curing diseases like cancer and diabetes - and to give all of us access to the personalized information we need to keep ourselves and our families healthier." - President Barack Obama, State of the Union Address, January 20, 2015 President Obama has long expressed a strong conviction that science offers great potential for improving health. Now, the President has announced a research initiative that aims to accelerate progress toward a new era of precision medicine (www.whitehouse.gov/precisionmedicine). We believe that the time is right for this visionary initiative, . . .
  • Article
    Full-text available
    Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.
  • Article
    We sought to develop an intensive care unit research database applying automated techniques to aggregate high-resolution diagnostic and therapeutic data from a large, diverse population of adult intensive care unit patients. This freely available database is intended to support epidemiologic research in critical care medicine and serve as a resource to evaluate new clinical decision support and monitoring algorithms. Data collection and retrospective analysis. All adult intensive care units (medical intensive care unit, surgical intensive care unit, cardiac care unit, cardiac surgery recovery unit) at a tertiary care hospital. Adult patients admitted to intensive care units between 2001 and 2007. None. The Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II) database consists of 25,328 intensive care unit stays. The investigators collected detailed information about intensive care unit patient stays, including laboratory data, therapeutic intervention profiles such as vasoactive medication drip rates and ventilator settings, nursing progress notes, discharge summaries, radiology reports, provider order entry data, International Classification of Diseases, 9th Revision codes, and, for a subset of patients, high-resolution vital sign trends and waveforms. Data were automatically deidentified to comply with Health Insurance Portability and Accountability Act standards and integrated with relational database software to create electronic intensive care unit records for each patient stay. The data were made freely available in February 2010 through the Internet along with a detailed user's guide and an assortment of data processing tools. The overall hospital mortality rate was 11.7%, which varied by critical care unit. The median intensive care unit length of stay was 2.2 days (interquartile range, 1.1-4.4 days). According to the primary International Classification of Diseases, 9th Revision codes, the following disease categories each comprised at least 5% of the case records: diseases of the circulatory system (39.1%); trauma (10.2%); diseases of the digestive system (9.7%); pulmonary diseases (9.0%); infectious diseases (7.0%); and neoplasms (6.8%). MIMIC-II documents a diverse and very large population of intensive care unit patient stays and contains comprehensive and detailed clinical data, including physiological waveforms and minute-by-minute trends for a subset of records. It establishes a new public-access resource for critical care research, supporting a diverse range of analytic studies spanning epidemiology, clinical decision-rule development, and electronic tool development.
  • Article
    Full-text available
    Statistical modeling is a powerful tool for developing and testing theories by way of causal explanation, prediction, and description. In many disciplines there is near-exclusive use of statistical modeling for causal explanation and the assumption that models with high explanatory power are inherently of high predictive power. Conflation between explanation and prediction is common, yet the distinction must be understood for progressing scientific knowledge. While this distinction has been recognized in the philosophy of science, the statistical literature lacks a thorough discussion of the many differences that arise in the process of modeling for an explanatory versus a predictive goal. The purpose of this article is to clarify the distinction between explanatory and predictive modeling, to discuss its sources, and to reveal the practical implications of the distinction to each step in the modeling process.