Conference PaperPDF Available

A Meta-heuristic LASSO Model for Diabetic Readmission Prediction


Abstract and Figures

Hospital readmission prediction continues to be a highly-encouraged area of investigation mainly because of the readmissions reduction program by the Centers for Medicare and Medicaid services (CMS). The overall goal is to reduce the number of early hospital readmissions by identifying the key risk factors that cause hospital readmissions. This is especially important in Intensive Care Unit (ICU), where patient readmission increases the likelihood of mortality due to the worsening of the patient condition. Traditional approaches use simple logistic regression or other linear classification methods to identify the key features that provide high prediction accuracy. However, these methods are not sufficient since they cannot capture the complex patterns between different features. In this paper, we propose a hybrid Evolutionary Simulating Annealing LASSO Logistic Regression (ESALOR) model to accurately predict the hospital readmission rate and identify the important risk factors. The proposed model combines the evolutionary simulated annealing method with a sparse logistic regression model of Lasso. The ESALOR model was tested on a publicly available diabetes readmission dataset, and the results show that the proposed model provides better results compared to conventional classification methods including Support Vector Machines (SVM), Decision Tree, Naive Bayes, and Logistic Regression.
Content may be subject to copyright.
Proceedings of the 2016 Industrial and Systems Engineering Research Conference
H. Yang, Z. Kong, and MD Sarder, eds.
A Meta-heuristic LASSO Model for Diabetic Readmission
Salih Tutun
Department of Systems Science and Industrial Engineering
Turkish Military Academy, Ankara, Turkey, and Binghamton University, Binghamton, NY
Sina Khanmohammadi, Lu He and Chun-An Chou
Department of Systems Science and Industrial Engineering
Binghamton University, Binghamton, NY
Hospital readmission prediction continues to be a highly-encouraged area of investigation mainly because of the read-
missions reduction program by the Centers for Medicare and Medicaid services (CMS). The overall goal is to reduce
the number of early hospital readmissions by identifying the key risk factors that cause hospital readmissions. This is
especially important in Intensive Care Unit (ICU), where patient readmission increases the likelihood of mortality due
to the worsening of the patient condition. Traditional approaches use simple logistic regression or other linear clas-
sification methods to identify the key features that provide high prediction accuracy. However, these methods are not
sufficient since they cannot capture the complex patterns between different features. In this paper, we propose a hybrid
Evolutionary Simulating Annealing LASSO Logistic Regression (ESALOR) model to accurately predict the hospital
readmission rate and identify the important risk factors. The proposed model combines the evolutionary simulated
annealing method with a sparse logistic regression model of Lasso. The ESALOR model was tested on a publicly
available diabetes readmission dataset, and the results show that the proposed model provides better results compared
to conventional classification methods including Support Vector Machines (SVM), Decision Tree, Naive Bayes, and
Logistic Regression.
Hospital Readmission, Diabetes, Classification, Metaheuristic Optimization, Regularization
1. Introduction
1.1 Background and Motivations
Nowadays, hospital readmission is one of the leading problems in health-care, mainly because of financial and clinical
repercussions. Hospital readmission reduction has become one of the main goals of health-care providers, especially
since the CMS introduced the reimbursement penalty for hospital readmissions that occur within 30 days of patient
discharge [1]. Hospital readmission is especially problematic for diabetes, since 23% of the annual hospitalizations in
the USA are for diabetic patients while they include only 8% of the country’s population [2].
In the literature, many researchers have focused on qualitative research methods to explain readmission risk factors
[3, 4]. Some studies also assessed different variables for hospital readmission prediction [5]. They mostly used logistic
regression because it is easy to calculate the probability of readmission, and to identify the importance of features [6–
8]. Moreover, to improve the prediction accuracy, some researchers combined logistic regression with other methods
such as artificial neural networks (ANN). However, logistic regression has over-training issue for imbalance data, and
combining methods (e.g, ANN, Fuzzy systems) are black-box that cannot provide the probability of readmission and,
therefore, are not easy interpretable [9].
In this paper, we propose a hybrid model called Evolutionary Simulating Annealing LASSO Logistic Regression
(ESALOR) for hospital readmission prediction. The ESALOR model combines the evolutionary simulated annealing
Tutun, Khanmohammadi, He and Chou
optimization method with a least absolute shrinkage and selection operator (LASSO) regression approach. The pro-
posed model can be used to analyze the effect of different risk factors on hospital readmission and predict hospital
readmission. The proposed model is compared with traditional classification approaches including Support Vector
Machines (SVM), Decision Tree (DT), Naive Bayes (NB), and Logistic Regression (LR) to show improvement of
models. The organization of the paper is as follows. In Section 2, data preprocessing is explained followed by the
details of the proposed hybrid model. In Section 3, results are given to show the performance of proposed model. The
paper finishes in Section 4 with a brief conclusion.
2. Materials and Methods
2.1 Data Preprocessing
The diabetes readmission dataset was retrieved from the health facts database, which is a public Electronic Health
Record (EHR) data set concerning diabetes patients [10]. The data includes 55 features (such as diagnoses, number of
visits, etc.), and the class label is whether or not a certain patient is readmitted within 30 days of discharge. The data
set was preprocessed by removing the missing values and applying feature selection methods. We used information
from several filters (correlation and information gain), and wrapper (decision tree) feature selection methods to select
the most relevant features of the data set. After checking all feature selection methods, 13 features such as discharge
disposition, number of inpatients, and diagnosis were selected for our analysis. These selected features will be further
filtered in the LASSO component of our proposed hybrid model.
2.2 Artificial Neural Network
Many difficulties, such as the inability to process abnormal data or work with incomplete information, or to solve
problems with traditional computer software technologies, can be solved with the Artificial Neural Network (ANN)
[11]. The information is contained on the network because information is as precious as the value of connections on
the network in ANNs. Users form their own conclusions with the information obtained from samples and after that
they are able to make similar decisions on similar cases and process incomplete information on uncertain cases. They
are able to make a decision by establishing relevant relationships regarding events after learning them with the help of
data. After training the ANN network, it is able to work with incomplete information and give results even if there is
incomplete information on recently arrived examples. The information distributed on the network shows that it has a
distributed memory. In other words, it is able to work with numeric information [11].
2.3 Support Vector Machine
The Support Vector Machine (SVM) is powerful two category classifier. The algorithm tries to separate hyperplane in
the feature space. The algorithm can calculate the distance between every point of independent data looking hyperplane
[12]. The minimum one for distances is called margin. The aim of the SVM is to obtain hyperplane of optimum margin,
as is seen in Figure 1. In the Figure 1 , you can see the aim of the algorithm with observations on two independent
Figure 1: An example of a separable problem in a two dimensional space [12].
variables. However, using linear hyperplane, the algorithm does work well in some cases. Therefore the researchers
are using different functions (e.g. radial-based functions, kernel functions). Also, for the misclassification penalty
coefficient, the tuning parameters are being used to improve the method in the literature [12, 13].
Tutun, Khanmohammadi, He and Chou
2.4 Naive Bayes Algorithm
This algorithm is a generative-based model because features are produced independently. It is the simplest model for a
machine-learning algorithm. But it also works well for real-world applications. The algorithm considers an unknown
target function as p(y/x). In order to learn, P(y/x)is used in training data to calculate p(x/y)and p(y). By using
these, we can calculate p(y/x)as you see in Equation (1) [13].
P(Y=yi|X=xk) = P(X=xk|Y=yi)p(Y=yi)
For instance, in order to classify output y, the algorithm is using prior distribution p(y). Afterwards, a sequence of
events is made by selecting each event independently from conditional distribution p(x/y).(An event could be repeated
many times). Prior distribution p(y)and conditional distribution p(x/y)can be calculated from the training data set.
The algorithm can make predictions for the test set by looking at likelihoods from distributions. At the same time, we
can estimate parameters by using maximum likelihood or Bayesian estimates. Alternatively, a smoothed estimate can
be used [13].
2.5 Logistic Regression
Logistic regression (LR) is approached by learning from function as p(y/x).Yis discrete value, and xis a vector that
includes discrete or continuous values. The algorithm is directly estimating parameters from training data.
log p(x)
P(x;b,w) = eβ0+xβ
P(Y=1|X) = 1
P(Y=0|X) = e
As you see in Equations (2 - 5), it is like a linear regression model. But the difference is output. For example, in
classification, we need to classifies output. Logistic regression classify output by using the above Equations (2 - 5).
In this method, there is binary classification as y=1 and y=0. By using a logistic regression equation, the algorithm
determines probability. Afterwards, the algorithm classifies the testing value by using threshold. After optimizing the
parameters of equations, we can use them to predict output of testing data [14]. The LR is a linear classifier on xvalue.
At the same time, the LR is a function approximation algorithm to use training data to directly estimate p(y/x)[14].
2.6 Evolutionary Simulating Annealing LASSO Logistic Regression (ESALOR) Model
The objective of the proposed model is to optimize coefficients of Logistic Regression (LR) using the evolution-
ary strategy (ES) and simulated annealing (SA) algorithms and prevent over-training using regularization (Lasso).
Simulated annealing is a random-search technique being a trajectory founded using single based optimization. The
algorithm searches for the feasible solution space by exploring the neighborhoods of initial solutions [15]. The initial
points of the SA algorithm can be identified randomly, however, since searches for nearby points giving initial solu-
tions, it can easily get stuck in local optima. In our framework, we use another meta-heuristic optimization approach
named "evolutionary strategy" to identify a good initial solution for simulated annealing. This concept is represented
in Figure 2. The randomly initialized SA begins to find solutions from S0to S3, after arriving at S3, the algorithm tends
to accept this point as the optimal solution for decision variables, but it is clearly a local optimum. However, when
we initialize the algorithm with solutions found by ES algorithm, the SA algorithm does not get stuck in local optima
and can find the optimal solution [16]. By using a hybrid meta-heuristic optimization approach, the coefficients of the
model are optimized to find the best model, as is seen in Equation (7).
Tutun, Khanmohammadi, He and Chou
Figure 2: Coupling ES and SA [16]
The typical formulation of logistic regression is shown in Equation (6). This method is used in some of the hospital
readmission studies [10, 17], however, the traditional logistic regression model suffers from the over-fitting problem.
Regularization methods have been proven to be an effective approach for solving the overfitting problem by penalizing
the absolute of the regression coefficients. The mathematical formulation of LASSO is provided in Equation (7), where
Nis the number of observations, yiis the response at observation i,Xiis data point, λis a non-negative regularization
parameter βvalues are the coefficients of the regression model. This formulation is optimized by using the evolutionary
strategy based simulated annealing algorithm because formulation (as an objective function) is not linear with absolute
and square values.
Considering the provided information, the proposed framework can be summarized in the following steps:
Step 1 Feature Selection: The best subset of features is selected using a combination of filter and wrapper feature
selection methods.
Step 2 Formulation: The LASSO-logistic regression formulation of the problem is identified.
Step 3 Initialization: The simulated annealing model is initialized using the evolutionary strategy algorithm.
Step 4 Optimization Level: The parameters (coefficients) of the LASSO model are optimized using a hybrid
evolutionary strategy based simulated annealing method. We optimized the parameters of the proposed model.
Step 5 Identifying Solutions: We find the optimal solution by comparing all solutions.
Step 6 Prediction: Hospital readmission of a new patient is predicted using the LASSO model with optimal
2.7 Performance Evaluation
The performance of the proposed model is evaluated using four performance criteria including accuracy, recall, preci-
sion, and F-measure. Equations (8-11) provide details of this four performance criteria. Among these four performance
criteria, the F-measure is generally preferred as it provides a better estimate of the algorithm performance when the
testing data set is imbalanced because it compares learning algorithm for each subclass. These measures are based on
True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN) values.
Accuracy =T P +T N
T P +T N +F P +FN (8)
Recall =T P
T P +FN (9)
Precision =T P
T P +FP (10)
Tutun, Khanmohammadi, He and Chou
F Measure =2T P
2T P +F P +FN (11)
TP is the number of correct classifications for normal patient detection. FP is the number of incorrect classifications
for readmission detection. FN is the number of incorrect classifications for normal patient detection. TN is the number
of correct classifications for readmission detection.
3. The Results and Discussion
In this section, initial feature selection and the proposed methods are used to predict hospital readmission rate, and
to identify the important risk factors (features). For initial feature selection, correlation-based and information gain-
based feature selections are used to select the best subset of features. After checking all feature selection methods,
it turns out that the most significant features indicating readmission include discharge disposition, diagnosis and the
number of inpatients.
Table 1: Selected features by using Gain ratio based feature selection and Correlation based feature selection
Gain Ratio Feature Selection Correlation-based Feature Selection
Ranked Rate Ranked attributes Ranked Rate Ranked Attributes
0.0188 Number of Inpatients 0.1059 Number of Inpatients
0.0049 Discharged Disposition ID 0.0786 Discharged Disposition ID
0.0028 Chlorpropamide 0.0587 Patient Number
0.0021 Miglito 0.0513 Time in Hospital
0.0020 Diagnosis 1 0.0303 Encounter ID
0.0012 Diagnosis 3 0.0280 Number of Emergency
0.0017 Diagnosis 2 0.0231 Metformin
For gain ratio based feature selection, as can be seen in Table 1, number of inpatients, discharge disposition, chlor-
propamide, miglitol, and diagnosis are very effective for our analysis. For correlation based feature selection, as one
can also be seen in Table 1, number of inpatients, discharge disposition, patient number, time in hospital, encounter
ID, number of emergencies, and metformin shows significance ranking for readmission. In conclusion, the best initial
features, such as number of inpatients, discharge disposition, time in hospital, miglitol, diagnosis, number of emer-
gencies, metrofin and chlorpropamide, are found for the proposed model by looking at feature selection results. The
second feature selection is made by using the LASSO shrinkage in the proposed model, as seen in Equation (7). After
using the proposed model, other features become zero by penalizing the absolute of the regression coefficients. There-
fore, discharge disposition, number of inpatients, diagnosis 1, and diagnosis 2 are selected for training (1/3) level and
testing level (2/3) in data.
Table 2: Comparison of ESALOR model with traditional classifiers with testing data.
Methods Accuracy Precision Recall F-measure
SVM 75.11% 0.70 0.75 0.67
ANN 75.85% 0.68 0.75 0.65
LR 74.95% 0.70 0.75 0.65
NB 74.48% 0.68 0.74 0.67
ESALOR 76.20%0.77 0.77 0.86
The results are compared by looking at performance indicators for readmission, and our models are used to make better
predictions. Our approach also shows better results than other approaches in the literature comparing four methods.
More specifically, for results of the SVM, ANN, LR and NB, as is seen in Table 2, prediction accuracy is founded
around 74 % for testing level. Precision and Recall values are less than 0.7 for most methods. At the same time,
F-measure values, which need to be more than 0.8, are founded around 0.65 for these methods. Therefore, when using
outstanding methods such as the SVM, ANN, LR and NB, prediction performance is inadequate for readmission.
Tutun, Khanmohammadi, He and Chou
However, our proposed model’s performance is much better than other methods such as F-measure. It means that the
proposed model works for imbalance data because there is no imbalance learning for each subclass. Therefore, the
proposed model performs better in predicting the readmission rate.
4. Conclusion
With the introduction of a reimbursement penalty by the Centers for Medicare and Medicaid (CMS), hospitals have
become strongly interested in reducing the readmission rate. In this study, we proposed a hybrid classification frame-
work called Evolutionary Simulating Annealing LASSO Logistic Regression (ESALOR) to improve the classification
of readmissions of diabetic patients. The ESALOR model can help health-care providers identify the key risk factors
that cause hospital readmission for diabetic patients. By using the identified risk factors, physicians can develop new
strategies to reduce readmission rates and costs for the care of individuals with diabetes.
1. Centers for Medicare and Medicaid Services, 2016, "Readmissions Reduction Program (HRRP), " retrieved from
2. Centers for Disease Control and Prevention (CDC), and Centers for Disease Control and Prevention (CDC), 2011,
"National Diabetes Fact Sheet: National Estimates and General Information on Diabetes and Prediabetes in the
United States, " retrieved from
3. Long, T., Genao, I., and Horwitz, L. I., 2013, "Reasons for Readmission in an Underserved High-risk Population:
a Qualitative Analysis of a Series of Inpatient Interviews," BMJ open, 3(9),e003212.
4. Strunin, L., Stone, M., and Jack, B., 2007, "Understanding Rehospitalization Risk: Can Hospital Discharge be
Modified to Reduce Recurrent Hospitalization?, " Journal of Hospital Medicine, 2(5), 297–304.
5. Cooper, G. S., Sirio, C. A., Rotondi, A. J., Shepardson, L. B., and Rosenthal, G. E., 1999, "Are Readmissions to
the Intensive Care Unit a Useful Measure of Hospital Performance?, " Medical Care, 37(4), 399–408.
6. Garrison, G. M., Mansukhani, M. P., and Bohn, B., 2013, "Predictors of Thirty-day Readmission Among Hospi-
talized Family Medicine Patients, " The Journal of the American Board of Family Medicine, 26(1), 71–77.
7. Hasan, O., Meltzer, D. O., Shaykevich, S. A., Bell, C. M., Kaboli, P.J., Auerbach, A. D., Wetterneck, T. B., Arora,
V. M., Zhang, J., and Schnipper, J. L, 2010, "Hospital Readmission in General Medicine Patients: a Prediction
Model, " Journal of general internal medicine, 25(3), 211–219.
8. van Walraven, C., Dhalla, I. A., Bell, C., Etchells, E., Stiell, I. G., Zarnke, K., Austin, P. C., and Forster, A. J.,
2010, "Derivation and Validation of an Index to Predict Early Death or Unplanned Readmission After Discharge
from Hospital to the Community," Canadian Medical Association Journal, 182(6), 551–557.
9. Liu, Y., Zayas-Castro, J. L., Fabri, P., and Huang, S., 2014, "Learning High-dimensional Networks with Nonlinear
Interactions by a Novel Tree-embedded Graphical Model," Pattern Recognition Letters, 49, 207–213.
10. Strack, B., DeShazo, J. P., Gennings, C., Olmo, J. L., Ventura, S., Cios, K. J., and Clore, J. N., 2014, "Impact
of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records, "
retrieved from
11. Oztemel, E., 2006, "Yapay sinir a˘
gları, (Artificial neural networks)" Papatya Publishing, 2nd Edition, Istanbul,
12. Cortes, C., and Vapnik, V., 1995, "Support Vector Networks, " Machine Learning, 20(3) 273–297.
13. Dai, W., Brisimi, T. S., Adams, W. G., Mela, T., Saligrama, V., and Paschalidis, I. C., 2015, "Prediction of Hospi-
talization Due to Heart Diseases by Supervised Learning Methods, " International Journal of Medical Informatics,
84(3), 189-197.
14. Mitchell, T. M., 2016, "Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression, " re-
trieved from tom/mlbook/NBayesLogReg.pdf
15. Kirkpatrick, S., 1984, "Optimization by Simulated Annealing: Quantitative Studies, " Journal of Statistical
Physics, 34(5-6), 975–986.
16. Tutun, S., Chou, C. A., and Canıyılmaz, E., 2015, "A New Forecasting Framework for Volatile Behavior in Net
Electricity Consumption: A Case Study in Turkey," Energy, 93, 2406–2422.
17. Walsh, C., and Hripcsak, G., 2014, "The Effects of Data Sources, Cohort Selection, and Outcome Definition on a
Predictive Model of Risk of Thirty-day Hospital Readmissions, " Journal of Biomedical Informatics, 52, 418-426.
... Hospital readmissions are amply studied in a variety of medical conditions, however, they only recently have started to attract attention of researchers in the study of healthcare policies for diabetic patients [27]. Different machine learning approaches, including deep learning, have been attempted in order to predict a diabetic patient's risk of readmission based on their medical history with varying results [6,11,14,24,29,30]. ...
... While the aforementioned traditional approaches are good at identifying key features and achieving high prediction accuracy, they do not capture more complex patterns between features that may be hidden in the data. With this in mind, hybrid approaches, such as in [30], combine meta-heuristic methods, such as evolutionary simulated annealing, and sparse logistic Lasso regression to improve feature selection. Very briefly, the model optimises coefficients of Logistic Regression using evolutionary strategies and simulated annealing algorithms and use Lasso regularization to prevent over-training, a drawback of Logistic Regression when applied to unbalanced data. ...
... The most salient aspects of the dataset can be summarised very briefly as follows: each row corresponds to a hospital visit by a patient and each patient may have more than one visit, i.e., several rows may be associated to the same patient. Demographic information of the patient is stored as categorical variables, including gender and race as well as age, which appears as labels describing intervals measured in years (e.g., [0, 10), [10,20), [20,30) (i.e., readmission occurred after 30 days). Full details of the dataset, including detailed descriptions of the features mentioned earlier and others that have been omitted for brevity, can be found in the original study by [29]. ...
Hospital readmissions pose additional costs and discomfort for the patient and their occurrences are indicative of deficient health service quality, hence efforts are generally made by medical professionals in order to prevent them. These endeavors are especially critical in the case of chronic conditions, such as diabetes. Recent developments in machine learning have been successful at predicting readmissions from the medical history of the diabetic patient. However, these approaches rely on a large number of clinical variables thereby requiring deep learning techniques. This article presents the application of simpler machine learning models achieving superior prediction performance while making computations more tractable.
... α and λ are important hyperparameters of LR models. Tutun et al. applied the evolutionary strategy and simulated annealing to optimize the coefficients of LR [41]. ...
... We utilized Web Crawler techniques (Python 3) to collect a total of 7734 data (training set) points and to obtain various variables related to the target variable, as shown in Figure 3. In addition, we publicly uploaded the code used in this study to GitHub [41]. Our public dataset can be accessed using the following URL: ...
Full-text available
Health authorities have recommended the use of digital tools for home workouts to stay active and healthy during the COVID-19 pandemic. In this paper, a machine learning approach is proposed to assess the activity of users on a home workout platform. Keep is a home workout application dedicated to providing one-stop exercise solutions such as fitness teaching, cycling, running, yoga, and fitness diet guidance. We used a data crawler to collect the total training set data of 7734 Keep users and compared four supervised learning algorithms: support vector machine, k-nearest neighbor, random forest, and logistic regression. The receiver operating curve analysis indicated that the overall discrimination verification power of random forest was better than that of the other three models. The random forest model was used to classify 850 test samples, and a correct rate of 88% was obtained. This approach can predict the continuous usage of users after installing the home workout application. We considered 18 variables on Keep that were expected to affect the determination of continuous participation. Keep certification is the most important variable that affected the results of this study. Keep certification refers to someone who has verified their identity information and can, therefore, obtain the Keep certification logo. The results show that the platform still needs to be improved in terms of real identity privacy information and other aspects.
... However, they only recently have started to attract attention of researchers in the study of healthcare policies for diabetic patients [4]. Different machine learning approaches, including deep learning, have been attempted in order to predict a patient's risk of readmission based on their medical history with varying results [5]- [9]. The present investigation evaluates several machine learning models aimed at predicting readmission from clinical data recorded in previous visits by the diabetic patient. ...
... Random forests models trained on the dataset compiled by [5] have exhibited good precision-recall scores (0.65) [10] whereas other classifiers fine-tuned through evolutionary algorithms (EA) have yielded good performance in terms of 978-1-7281-1614-3/19/$31.00 ©2018 IEEE accuracy, recall and specificity (0.97, 1.00, 0.97, respectively) [11]. Hybrid approaches that can capture more complex patterns between different features have been achieved combining Evolutionary Simulated Annealing method with sparse LOgistic Regression model of Lasso (ESALOR) improving accuracy, precision, recall and F1 (0.76, 0.77, 0.77, 0.86, respectively) over SVN and other conventional methods [9]. Various classifiers have shown varying performance metrics when patients in this dataset are grouped by age [7]. ...
... Then, as the model is optimized, the coefficients of the LASSO model are adjusted using the SAbased hybrid evolution strategy. In the end, the most suitable solution is chosen, and using the LASSO model, the most distinctive items on the test are estimated with optimal coefficients [17]. ...
... Surprisingly, not until recently have EHRs been used in conjunction with advanced algorithms, [10,11] even though they have been shown to lead to better care. [12] Predictive methods, in particular, have been used for example in the context of heart-related problems, [13][14][15] hemodialysis, [16] diabetes in older adults, [17][18][19][20] and multiple disease prediction. [21] To the best of our knowledge, predicting diabetes-related hospitalizations based on EHR history using machine learning algorithms is a novel problem. ...
Objective: To derive a predictive model to identify patients likely to be hospitalized during the following year due to complications attributed to Type II diabetes. Methods: A variety of supervised machine learning classification methods were tested and a new method that discovers hidden patient clusters in the positive class (hospitalized) was developed while, at the same time, sparse linear support vector machine classifiers were derived to separate positive samples from the negative ones (non-hospitalized). The convergence of the new method was established and theoretical guarantees were proved on how the classifiers it produces generalize to a test set not seen during training. Results: The methods were tested on a large set of patients from the Boston Medical Center - the largest safety net hospital in New England. It is found that our new joint clustering/classification method achieves an accuracy of 89% (measured in terms of area under the ROC Curve) and yields informative clusters which can help interpret the classification results, thus increasing the trust of physicians to the algorithmic output and providing some guidance towards preventive measures. While it is possible to increase accuracy to 92% with other methods, this comes with increased computational cost and lack of interpretability. The analysis shows that even a modest probability of preventive actions being effective (more than 19%) suffices to generate significant hospital care savings. Conclusions: Predictive models are proposed that can help avert hospitalizations, improve health outcomes and drastically reduce hospital expenditures. The scope for savings is significant as it has been estimated that in the USA alone, about $5.8 billion are spent each year on diabetes-related hospitalizations that could be prevented.
Predicting hospital readmission with effective machine learning techniques has attracted a great attention in recent years. The fundamental challenge of this task stems from characteristics of the data extracted from electronic health records (EHR), which are imbalanced class distributions. This challenge further leads to the failure of most existing models that only provide a partial understanding for the learning problem and result in a biased and inaccurate prediction. To address this challenge, we propose a new graph-based class-imbalance learning method by fully making use of the data from different classes. First, we conduct graph construction for learning the pattern discrimination from between-class and within-class data samples. Then we design an optimization framework to incorporate the constructed graphs to obtain a class-imbalance aware graph embedding and further alleviate performance degeneration. Finally, we design a neural network model as the classifier to conduct imbalanced classification, i.e., hospital readmission prediction. Comprehensive experiments on six real-world readmission datasets show that the proposed method outperforms state-of-the-art approaches in readmission prediction task.
Email spam is a serious problem that annoys recipients and wastes their time. Machine- learning methods have been prevalent in spam detection systems owing to their efficiency in classifying mail as solicited or unsolicited. However, existing spam detection techniques usually suffer from low detection rates and cannot efficiently handle high-dimensional data. Therefore, we propose a novel spam detection method that combines the artificial bee colony algorithm with a logistic regression classification model. The empirical results on three publicly available datasets (Enron, CSDMC2010, and TurkishEmail) show that the proposed model can handle high-dimensional data thanks to its highly effective local and global search abilities. We compare the proposed model’s spam detection performance to those of support vector machine, logistic regression, and naive Bayes classifiers, in addition to the performance of the state-of-the-art methods reported by previous studies. We observe that the proposed method outperforms other spam detection techniques considered in this study in terms of classification accuracy.
Full-text available
Background In 2008, the United States spent $2.2 trillion for healthcare, which was 15.5% of its GDP. 31% of this expenditure is attributed to hospital care. Evidently, even modest reductions in hospital care costs matter. A 2009 study showed that nearly $30.8 billion in hospital care cost during 2006 was potentially preventable, with heart diseases being responsible for about 31% of that amount. Methods Our goal is to accurately and efficiently predict heart-related hospitalizations based on the available patient-specific medical history. To the best of our knowledge, the approaches we introduce are novel for this problem. The prediction of hospitalization is formulated as a supervised classification problem. We use de-identified Electronic Health Record (EHR) data from a large urban hospital in Boston to identify patients with heart diseases. Patients are labeled and randomly partitioned into a training and a test set. We apply five machine learning algorithms, namely Support Vector Machines (SVM), AdaBoost using trees as the weak learner, Logistic Regression, a naïve Bayes event classifier, and a variation of a Likelihood Ratio Test adapted to the specific problem. Each model is trained on the training set and then tested on the test set. Results All five models show consistent results, which could, to some extent, indicate the limit of the achievable prediction accuracy. Our results show that with under 30% false alarm rate, the detection rate could be as high as 82%. These accuracy rates translate to a considerable amount of potential savings, if used in practice.
Full-text available
Management of hyperglycemia in hospitalized patients has a significant bearing on outcome, in terms of both morbidity and mortality. However, there are few national assessments of diabetes care during hospitalization which could serve as a baseline for change. This analysis of a large clinical database (74 million unique encounters corresponding to 17 million unique patients) was undertaken to provide such an assessment and to find future directions which might lead to improvements in patient safety. Almost 70,000 inpatient diabetes encounters were identified with sufficient detail for analysis. Multivariable logistic regression was used to fit the relationship between the measurement of HbA1c and early readmission while controlling for covariates such as demographics, severity and type of the disease, and type of admission. Results show that the measurement of HbA1c was performed infrequently (18.4%) in the inpatient setting. The statistical model suggests that the relationship between the probability of readmission and the HbA1c measurement depends on the primary diagnosis. The data suggest further that the greater attention to diabetes reflected in HbA1c determination may improve patient outcomes and lower cost of inpatient care.
Full-text available
To gather qualitative data to elucidate the reasons for readmissions in a high-risk population of underserved patients. We created an instrument with 27 open-ended questions based on current interventions. Yale-New Haven Hospital. Patients at the Yale Adult Primary Care Center (PCC). We conducted semi-structured qualitative interviews of patients who had four or more admissions in the previous 6 months and were currently readmitted to the hospital. We completed 17 interviews and identified themes relating to risk of readmission. We found that patients went directly to the emergency department (ED) when they experienced a change in health status without contacting their primary provider. Reasons for this included poor telephone or urgent care access and the belief that the PCC could not treat acute illness. Many patients could not name their primary provider. Conversely, every patient except one reported being able to obtain medications without undue financial burden, and every patient reported receiving adequate home care services. These high-risk patients were receiving the formal services that they needed, but were making the decision to go to the ED because of inadequate access to care and fragmented primary care relationships. Formal transitional care services are unlikely to be adequate in reducing readmissions without also addressing primary care access and continuity.
Full-text available
Purpose: Hospital readmissions within 30 days of initial discharge occur frequently. In studies of elderly patients receiving Medicare, readmissions have been associated with poor-quality inpatient care, ineffective hospital-to-home transitions, patient characteristics, disease burden, and socioeconomic status. Among adult family medicine patients spanning a wide age range, we hypothesize that previous hospitalizations, length of stay, number of discharge medications, medical comorbidities, and patient demographics are associated with a greater risk of hospital readmission within 30 days. A retrospective case-control study of 276 family medicine inpatients was conducted to determine the factors associated with 30-day readmission. Bivariate statistics were computed and a multivariate analysis using logistic regression was performed to determine the independent effects of each factor. Patients readmitted within 30 days had more hospitalizations, more emergency department visits, longer hospital stays, more comorbidities, and more discharge medications and were less likely to be married. Multivariate logistic regression found that hospitalization within the previous 12 months (odds ratio, 2.71) and long hospital stays (odds ratio, 2.16) were associated with 30-day readmission; being married (odds ratio, 0.54) had a protective effect. This study demonstrates that factors previously found to be associated with 30-day readmission among elderly patients receiving Medicare also apply to family medicine patients of all ages. It also demonstrates prior hospitalizations, length of stay, and marital status are useful proxies for many more complicated factors, such as disease burden, medical complexity, and social issues, that influence hospital readmission.
Thesupport-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data.High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Network models have been widely used in many domains to characterize relationships between physical entities. Although extensive research efforts have been conducted for learning networks from data, many of them were developed for learning networks with linear relationships. As both linear and nonlinear relationships may appear in many applications, in this paper, we developed a novel graphical model, the sparse tree-embedded graphical model (STGM), which is able to uncover both linear and nonlinear relationships from a large number of variables. We further proposed an efficient regression-based algorithm for learning the STGM from data. We conducted simulation studies that demonstrated the superiority of the STGM over other network learning methods and applied the STGM on a real-world application that demonstrated its efficacy on discovering interesting nonlinear relationships in practice.
Background: Hospital readmission risk prediction remains a motivated area of investigation and operations in light of the hospital readmissions reduction program through CMS. Multiple models of risk have been reported with variable discriminatory performances, and it remains unclear how design factors affect performance. Objectives: To study the effects of varying three factors of model development in the prediction of risk based on health record data: (1) reason for readmission (primary readmission diagnosis); (2) available data and data types (e.g. visit history, laboratory results, etc); (3) cohort selection. Methods: Regularized regression (LASSO) to generate predictions of readmissions risk using prevalence sampling. Support Vector Machine (SVM) used for comparison in cohort selection testing. Calibration by model refitting to outcome prevalence. Results: Predicting readmission risk across multiple reasons for readmission resulted in ROC areas ranging from 0.92 for readmission for congestive heart failure to 0.71 for syncope and 0.68 for all-cause readmission. Visit history and laboratory tests contributed the most predictive value; contributions varied by readmission diagnosis. Cohort definition affected performance for both parametric and nonparametric algorithms. Compared to all patients, limiting the cohort to patients whose index admission and readmission diagnoses matched resulted in a decrease in average ROC from 0.78 to 0.55 (difference in ROC 0.23, p value 0.01). Calibration plots demonstrate good calibration with low mean squared error. Conclusion: Targeting reason for readmission in risk prediction impacted discriminatory performance. In general, laboratory data and visit history data contributed the most to prediction; data source contributions varied by reason for readmission. Cohort selection had a large impact on model performance, and these results demonstrate the difficulty of comparing results across different studies of predictive risk modeling.