ThesisPDF Available

Predicting Coronary Heart Disease Using Machine Learning and Statistical Analysis

Authors:
Thesis

Predicting Coronary Heart Disease Using Machine Learning and Statistical Analysis

Abstract and Figures

Healthcare is such an enormous domain. The use of data science is a necessity for healthcare to form meaningful transformations. Using it in the most efficient and powerful way to discover hidden correlations of risk factors is the objective of this study. The aim is to analyze the coronary artery disease data sets and predict the possibilities of a given patient to have heart disease. The first data set of this study comprises patients' observations of 14 features. The second data set was used as an additional evaluation of only male patients, since the ratio for males having heart disease was more than twice as much as females in the first data set. This study analyzes the attributes' effect on the outcome of heart disease. The machine learning algorithms used for analysis were Logistic Regression, Support Vector Machines (SVM), and Random Forest. The models' features were tuned using ensemble methods of Stepwise Regression, Variable Importance, Bortua, and Recursive Feature Elimination. These models were evaluated using cross-validation for the best models to predict heart disease. The features in the data set were also evaluated using parametric statistical techniques of chi-square tests and ANOVA. The methodologies and results of the machine learning methods, tuning, cross-validation, and the statistical analysis are included in detail. This study's goal was to find the most significant features of patients and the most accurate machine learning algorithm for the most optimized and tuned method for heart disease predictions. This report includes all the necessary visualizations, descriptions, comments, and the results. It concludes with the significance of this study to help combat heart disease.
Content may be subject to copyright.
Predicting Coronary Heart Disease Using
Machine Learning and Statistical Analysis
Ankur Patel, Drazen Zack, Mohit Supe
Saint Peter’s University
Abstract - Healthcare is such an enormous domain. The use of data science is a necessity for healthcare to
form meaningful transformations. Using it in the most efficient and powerful way to discover hidden
correlations of risk factors is the objective of this study. The aim is to analyze the coronary artery disease
data sets and predict the possibilities of a given patient to have heart disease. The first data set of this
study comprises patients’ observations of 14 features. The second data set was used as an additional
evaluation of only male patients, since the ratio for males having heart disease was more than twice as
much as females in the first data set. This study analyzes the attributes’ effect on the outcome of heart
disease. The machine learning algorithms used for analysis were Logistic Regression, Support Vector
Machines (SVM), and Random Forest. The models’ features were tuned using ensemble methods of
Stepwise Regression, Variable Importance, Bortua, and Recursive Feature Elimination. These models
were evaluated using cross-validation for the best models to predict heart disease. The features in the data
set were also evaluated using parametric statistical techniques of chi-square tests and ANOVA. The
methodologies and results of the machine learning methods, tuning, cross-validation, and the statistical
analysis are included in detail. This study’s goal was to find the most significant features of patients and
the most accurate machine learning algorithm for the most optimized and tuned method for heart disease
predictions. This report includes all the necessary visualizations, descriptions, comments, and the results.
It concludes with the significance of this study to help combat heart disease.
I. Introduction
Coronary heart disease (CHD) is not
only the most common type of heart disease, but
also the leading cause of death for both men and
women in the United States. Cardiovascular
diseases, which includes CHD, is currently the
leading cause of death globally [2], and the
World Health Organization estimates that the
death rate will inflate to 23.6 million by 2030
from heart related deaths [3]. It occurs when the
blood flow to the heart muscle is restricted
because of the narrowing or blockage of the
coronary arteries. The arteries are the muscular-
walled tubes that supply oxygen-rich blood to
the heart muscle, and waxy substance called
plaque builds up in arteries and restrict the blood
flow. That results in angina, which is chest pain
or irritation, or a heart attack if the oxygen-rich
blood flow is cut-off. Without a quick treatment,
the weakened heart muscle can cause a heart
attack.
The coronary arteries damaged by the
following factors cause heart disease - smoking,
high blood pressure, high cholesterol, high
levels of blood sugar, blood vessel
inflammation, and sedentary lifestyle. These risk
factors can be treated to reduce the chances of
heart disease. Other factors that human beings
cannot change, such as age and gender, also
impact heart disease [4].
The heart disease data set, accessed
from UCl Machine Learning Repository for this
study, contained a subset of 303 observations of
14 attributes that were analyzed and used to
predict the chance of a given patient to have
heart disease. Following is a brief description
table of the attributes, or features, with their type
and range:
Attribute
Short Description
Range
Age
In Years
Continuous
Sex
Female or Male
0.1
ChestPain
Asymptomatic,
Nonanginal, Nontypical,
Typical
1,2,3,4
RestBP
Rest blood pressure
Continuous
Chol
Cholesterol
Continuous
Fbs
Fasting blood suger >
120 mg/dl
0,1
RestECG
Normal, STT,
Hypertrophy
0,1,2
MaxHR
Maximum Heart Rate
Continuous
ExAng
Exercise induced angina
0,1
Oldpeak
ST depression induced by
exercise relative to rest
Continuous
Slope
Slope of the peak
exercise ST segment
1,2,3
Ca
Number of major vessels
(0-3) colored by
fluoroscopy
1,2,3,4
Thal
Normal, Fixed,
Reversible
1,2,3
AHD
Heart disease- Yes
(>50% diameter
narrowing), No (<50%
diameter narrowing)
1,2
Table 1: Description of Data Set Variables
Using R, the data was cleaned and
exploratory data analysis was executed to
understand the data and the relationships
between the variables. Box plots and bar graphs
were crafted to view and understand the
interactions of the features in response to the
target feature of CHD. Features were then
selected and tuned using feature selection
methods. Machine learning methods of logistic
regression, support vector machines, and random
forest were used as effective techniques to
predict the risk of the disease. Harnessing the
power of machine learning, the models were
tuned to best predict heart disease using different
methods - stepwise regression, variable
importance, Rpart, Bortua, and recursive feature
elimination. These models were then used with
cross-validation to gather an understanding of
the models. After cross-validation each model
was evaluated on their results from the test data
set. To further evaluate the models, parametric
statistics was performed on this data set, which
included chi-square test and analysis of variance
(ANOVA). The statistical analysis was an
evaluation to confirm the models. The results of
the classification showing the relevance and
significance of attributes and the machine
learning predictions illustrated the usefulness of
these techniques to select significant features to
predict and treat heart disease.
The second dataset was evaluated as
extended parts of this study. Since the ratio for
males having heart disease was more than twice
of the ratio for females, they were used as an
extension of this study. The males data set for
heart disease, obtained from a machine learning
software called Tanagra for research and
academic purposes[20], was itself an extension to
the first data set, as it had more observations for
6 of the same features Age, ChestPain,
RestBP, Blood Sugar (Fbs), RestECG, MaxHR,
ExAng, CHD.
This study was to evaluate the CHD data
sets to compare different classifications and
machine learning techniques for an efficient and
effective methodology. The risk of such heart
problems can and should be treated to help
prevent the disease. Treatments of lifestyle
changes or medical procedures, according to the
patients’ conditions and factors, can treat
coronary heart disease.
II. Related Work
The healthcare industry around the
world stores such data that is mostly accessible
for research purposes. There are other studies for
predicting heart disease with this UCI data set
and other data sets in the medical and healthcare
domain. Researchers have applied different data
mining methods to find which methods
successfully predict heart disease for patients.
For example, D. Chaki et al’s research
paper of A Comparison of Three Discrete
Methods for Classification of Heart Disease”[5]
for this data set found that SVM outperformed
the methods of naïve Bayes and C4.5 classifier
with an accuracy of 84.12%. Authors of “A
Comprehensive Investigation and Comparison
of Machine Learning Techniques in the Domain
of Heart Disease”[9] obtained an accuracy of
84.15% using SVM, which was their best out of
six other methods. On the other hand, authors
from “A Comparative Study for Predicting Heart
Diseases Using Data Mining Classification
Methods”[2] found that SVM was outperformed
by Random Forest, Decision Tree or Naïve Bays
using this data set. Results from the authors of
“Heart Disease Diagnosis Using Predictive Data
Mining”[17] showed that Naive Bayes
outperformed Decision Tree.
These studies where all focused on
finding the best classification method for
predicting heart disease using the UCI data set.
While our study also predicts if a patient has
heart disease according to these features, we also
selected the features that are the best to
diagnose, predict, or treat heart disease.
III. Exploratory Data Analysis
Exploring the data, the values of the
numeric attributes were analyzed to check their
correlation first, as seen in Figure 1. Since the
correlations were mostly close to 0, there wasn’t
any strong correlation between the numerical
variables. The most correlated numerical
features were MaxHR and Age at -0.39, which
shows that the max heart rate decreases with
age.
Figure 1: Correlation of numeric variables
To better understand the data, statistics
of the numeric attributes were calculated. The
mean showed more AHD outcomes for older
age, and the standard deviation being high for no
AHD outcomes meant that age varied for no
AHD patients. The following results were also
significant: mean and median of cholesterol was
higher and the MaxHR was lower in AHD
patients.
Next, the frequency tables of AHD were
taken for each categorical attribute. When
looking at Sex, the probability of males getting
AHD was higher than females. For ChestPain,
the asymptomatic factor showed the most yes
outcomes than the other three categories. Hence,
most of the patients were not aware of the
infection. The fasting blood sugar, or Fbs,
feature was roughly divided equally to have
AHD or not in either above or below 120 mg/dl
condition, which is the diabetes threshold value
in the US[4]. The table for exercise induced
angina, or ExAng, showed that AHD occured
the least without ExAng. Although asymptotic
factors occurred most for ChestPain, the angina
must be most commonly caused by the exercise
rather than AHD.
The graph below give insights into some
attributes of the data set and their interactions.
The graph showed that MaxHR decreased with
the patients’ age, and the chances of AHD
increases, which is again understandable. The
progressive decrease in MaxHR, independent of
sex, physical activities, and other factors, is an
inevitable result of aging [21].
Figure 2: Age vs MaxHR by AHD
IV. Methodology
With the growth of complex data and
the field of statistics, data mining has become
extremely beneficial in finding patterns in the
data using algorithms. This research focused on
improving the diagnosis of heart diseases for
patients by evaluating different risk factors using
feature selection and data mining methods.
Cross-validation was used to further evaluate
how the models were performing with the
training set.
A) Feature Selection
In machine learning and statistics,
feature selection is a key part in selecting
relevant attributes in model construction to
predict the outcome. These methods identify and
remove unnecessary attributes from the model
for better accuracy [6]. After cleaning the data
set, the feature selection techniques were
implemented on the classification methods to
predict CHD. The four different feature selection
techniques were Stepwise Regression, Variable
Importance, Bortua, and Recursive Feature
Elimination.
Stepwise Regression consists of adding
and removing features to end with the best
subset of features by some type of measure.
Akaike Information Criterion (AIC) was the
measure used here to select the best subset of
features. Using backwards selection, all possible
features in the model were calculated iteratively
and the least contributive features were
removed. This process is repeated until no
improvement is seen or there are no features left.
The models are listed later in Table 2, after all
the feature selection methods that created the
models.
Variable Importance was used to
evaluate the correlation structure between the
features by importance calculations and ranks.
The varImp() function for AHD~. model was
created for logistic regression, SVM, and
random forests. Importance was estimated by
using ROC curve analysis on each feature,
except for random forests, which used the gini
index to score features. Figures 3,4,5 show the
importance calculation and rank of variables for
each machine learning method.
Figure 3: Logistic Regression Variable Importance
Figure 4: Support Vector Machine Variable Iimportance
Figure 5: Random Forest Variable Importance
Boruta is another R package, which has
an algorithm built around random forest. It
duplicates the data set and shuffles the values in
the columns. It trains a Random Forest Classifier
on the shuffled values, called shadow features. It
uses “Mean Decrease Accuracy” to train for
importance of all features, where higher scores
mean importance [7]. Iteratively, it removes the
features that are not important. It stops when all
features are either rejected or confirmed, or it
has reached the specified number of random
forests. Figure 6 provides a boxplot of all
variables importance for each feature.
Figure 6: Bortua Variable Importance Boxpot
A popular feature selection method in
the caret R package that automatically performs
classification according to accuracy called
Recursive Feature Elimination (RFE) was used
[8]. Similar to Bortua, it provides the best
performing subset of features for a model after
constructed repeatedly by removing features
iteratively. The next models are constructed with
the remaining features, until all features are
used. The features are ranked by their order of
elimination. The output shows the accuracy of
the feature’s number. Figure 7 shows the
accuracy of the number of features that RFE
tested. Along with the number of features that
work best, RFE also gave the best set of
variables.
Figure 7: RFE Number of Variables with Accuracy
Two more methods were performed to get the
best model Trial and Error, and P-value. The
Trial and Error model was chosen from the
exploratory data analysis and the best
combination of features. The P-value model
features were chosen from the condition of p-
value less than 0.05, which was the chosen
criterion for statistical significance.
Table 2 is a table of the feature selection models
with their attributes that were compared for
performance.
Method
Features
Control
All independent features
Stepwise
Regression
Sex, ChestPain, RestBP,
ExAng, Oldpeak, Slope,
Ca, Thal
Variable
Importance
(Logistic
Regression)
Ca, Sex, ChestPain, Slope,
RestBP, Thal, ExAng,
Oldpeak, MaxHR
Variable
Importance
(SVM)
Ca, MaxHR, Oldpeak,
ChetPain, ExAng, Thal,
Slope, Sex, Age
Variable
Importance
(Random
Forest)
Age, ChestPain, MaxHR,
Oldpeak, Ca, Thal, RestBP
Trial and
Error
Age, Sex, ChestPain,
RestBP, MaxHR, Oldpeak,
Slope, Ca, Thal, Chol
P-value
Sex, ChestPain, RestBP,
Slope, Ca
Recursive
Feature
Elimination
Ca, Thal, Sex, ChestPain,
Oldpeak, ExAng, Slope,
MaxHR
Bortua
Age, Sex, ChestPain,
RestBP, MaxHR, ExAng,
Oldpeak, Slope, Ca, Thal
Table 2: Feature Selection Methods with Variables Picked
B) Data Mining
The implementation of data mining is
when a specific algorithm finds patterns from
data [11]. Data mining techniques are divided into
four groups: Classification, Clustering
Regression, and Association rule mining [1]. For
this study, Logistic Regression, Support Vector
Machines, and Random Forests were used on the
data set.
Logistic regression is suited for testing
hypotheses about relationships between a
dependent categorical variable, which is heart
disease in our case, and one or more categorical
or continuous independent variables[12]. Support
Vector Machines are a supervised learning
method for classification of binary data, and
they serves as the linear separator between data
points in a multidimensional environment to
identify two different classes in the data[13].
Random Forest is an ensemble learning method
for classification that is created by assembling a
host of decision trees and outputting the class by
voting of individual trees. Advantages of
random forest are efficiency and improving the
prediction accuracy without a noteworthy jump
in cost.[14][1].
C) Cross-Validation
Cross-validation is a common
resampling technique to assess, compare, and
select the least-biased and most-accurate
machine learning models. In K-fold cross-
validation, the process is repeated a defined
number of times, where each subsample is used
for validation exactly once. The limited numbers
of observations in this data set led to using K-
fold cross-validation for data partitioning [9].
Each classification model after the tuning
methods was cross-validated with 5-fold to
further evaluate the multilevel performance of
the predictors [10].
With the limited number of observations in
the UCI data set, cross-validation was
implemented to better understand the models
according to their accuracy and bias. It showed
how each of the models tested on new data,
which provided our research with great insights
on their performance. Table 3 shows the average
accuracy of the models that were cross-validated
with 5-fold.
Logistic
Regression
SVM
Random
Forest
All
0.81
0.81
0.79
Stepwise
0.79
0.77
0.80
Variable
Importance
0.82
0.83
0.76
Trail &
Error
0.80
0.81
0.78
P-value
0.83
0.80
0.81
RFE
0.82
0.82
0.76
Bortua
0.81
0.81
0.80
Table 3: Cross-validated Models’ Accuracy
All of the cross-validation results were
factored in when evaluating the model and
choosing the best one to predict CHD.
D) Statistical Analysis
To further evaluate the features of this
dataset, the chi-squared tests and ANOVA were
performed. The chi-squared test is used for
testing if two categorical variables are related.
Since that only tells us if the they’re associated
or not, the Cramér’s V was calculated to
measure the strength of the correlation. The
statistical relationship between CHD and
continuous variables was calculated using
ANOVA, which compares or tests the mean
difference of multiple groups for statistical
significance.
When conducting the chi-squared test on
the UCI heart disease data set, some interesting
insights were found. The categorical variables
that were tested against CHD were Sex, Chest
Pain, Fbs, RestECG, ExAng Slope, Ca, and
Thal. The only categorical variable that didn’t
have a statistical relationship with CHD was
Fbs. After looking back at the feature selection
methods, Fbs was not used because it didn’t
have a relationship with CHD. With all of the
variables that had a statistical correlation with
CHD, a Cramer’s V was conducted. Three of the
strongest variables were Ca at .49, Chest Pain at
.51, and Thal at .52. Looking back at the models
from the feature selection, all three of these
variables were used throughout. An ANOVA
test was implemented to see if the mean
difference in CHD (Yes, No) was statistically
significant with Age, RestBP, Chol, Oldpeak,
and MaxHR. The results showed that there was a
significant difference in the means for all
continuous variables except Chol. As seen from
the feature selection methods, Chol was not used
in any of the models.
V. Test Results
The multilevel models applied on the
test set and accuracies can be compared in Table
4.
Logistic
Regression
SVM
Random
Forest
All
0.93
0.90
0.85
Stepwise
0.90
0.90
0.83
Variable
Importance
0.92
0.93
0.85
Trail &
Error
0.93
0.90
0.85
P-value
0.85
0.85
0.78
RFE
0.93
0.91
0.82
Bortua
0.92
0.83
0.85
Table 4: Tested Models’ Accuracy
From these results, it was clearly
observed that Random Forest had the lowest
performance metrics. Although random forest
has advantages in efficiency and improving the
prediction accuracy without a noteworthy jump
in cost, this data set was too small for it to
predict accurately. Logistic Regression and
SVM were seen to have similar and high
accuracies, and they were in the low nineties and
only one in the eighties.
Figure 8 shows the calculations of the
performance measures, which are also necessary
to calculate for the probability ratios.
Model
Accuracy
Model
Accuracy
Figure 8: Performance Measures
For this study or in any case of
predicting life or death diseases, type II errors or
false negatives should be limited. In this study, a
type II error is predicting No but the actual
outcome is Yes. Clinical decisions that are
faulty can lead to consequences that could
become disastrous and are unacceptable [15].
Medical diagnosis is considered a significant
task that needs to be carried out with the upmost
precision and efficiency [16]. Limiting the number
of type II errors needs to be a key factor when
making clinical decisions. Sensitivity and
negative predictive value become important
measures when examining type II errors. Higher
the sensitivity and negative predictive value, the
better the model handles type II error. Table 5
displays the four best models based on accuracy
and other performance measures.
Model
Acc
Sen
Spec
PPV
NPV
GLM Trial and
Error
0.93
0.87
0.97
0.95
0.92
GLM RFE
0.93
0.87
0.97
0.95
0.92
SVM Variable
importance
0.93
0.91
0.95
0.91
0.95
SVM RFE
0.93
0.91
0.95
0.91
0.95
Table 5: Best Multilevel Models’ Performance Measures
From the four best models, the SVM
with Variable Importance was concluded to be
the best model for this data set. The features in
that model are as following: Age, Sex, MaxHR,
ChestPain, Ca, Oldpeak, ExAng, Thal, and
Slope. Although this SVM model shares the
results exactly between Variable Importance and
RFE, since the Variable Importance included
Age, which played a significant role in
exploratory data analysis, the best model was
decided to include that.
VI. Males- Feature Selection & Results
In the first data set, the ratio of heart
disease for males was much higher than that of
females, as seen in Figure 9. About 26% of the
females had CHD, whereas about 56% of the
males had CHD; in other words, the infection
rate was more than twice for men than females.
Figure 9: AHD: Males vs Females
Furthermore, for sudden cardiac events, between
70% to 89% occur in men[18]. Since the
occurrence of heart disease is clearly more in
males, it was further evaluated using a data sets
of only males.
The males data set extended the
features of Age, ChestPain, RestBP, Blood
Sugar (Fbs), RestECG, MaxHR, ExAng, and
CHD from first data set. The same feature
selection and tuning methods were repeated, and
only the best model’s details are listed here. RFE
and Boruta for Random Forest showed the exact
same accuracy and performance mearures, so
they are listed as the best model.
Best Model Features
Accuracy
ExAng, ChestPain, MaxHR
0.8197
Sensitivity
Specificity
Neg Pred
Value
Pos Pred
Value
0.6944
0.8617
0.7937
0.7864
Table 6: Best Model Accuracy & Performance Measures
The best model’s features showed that
exercise angina, chest pain, and max heart rate
were most significant from that data set to
predict heart disease. As mentioned earlier, the
model’s performance measures are also
No
Yes
significant to calculate for decision-making.
Looking at the significant features from this data
set, they match the reasoning of the medical
condition of angina, which is “a type of chest
pain caused by reduced blood flow to the heart”
and “a symptom of coronary artery disease” [19].
When implementing the chi-squared
test, the relationships between CHD with Chest
Pain, RestECG, Blood Sugar, and ExAng were
evaluated. The tests for Chest Pain and RestECG
were inconclusive because they were unable to
make correct assumptions due to the small size
of the data set. Blood Sugar and ExAng, on the
other hand, both resulted in a significant
correlation with CHD; ExAng had a .79
correlation, while although Blood Sugar had no
significant relationship with CHD in UCI data
set, it had low correlation at .12 in this male data
set. The ANOVA on the three continues
variables showed that Age and MaxHR had a
significant difference in their means. Also,
RestBP didn’t show a significant difference in
the means.
Based on the results from the testing on
the UCI and the males-only data sets, some
insights were gained with proof. ExAng had a
strong relationship for both data sets, which
points to ExAng being a strong predictor for
heart disease. Blood Sugar or Fbs had a very low
correlation so they were insignificant for
predicting heart disease. When focusing on the
continuous variables, Age and MaxHR returned
a significant difference in the means for both
data sets. This shows that Age and MaxHR will
be good predictors of heart disease. While
RestBP results are inconclusive because of
having a significant difference in the UCI data
set but not in the males-only data set.
VII. Conclusion
In this paper, the CHD data set of 303
observations from UCI Machine Learning
Repository was first used to analyze and create
methods to predict heart disease in patients.
After exploring the data set and seeing the heart
disease’s occurrence for males more than twice
as much as for females, another males-only data
set was found to repeat the process. Different
combinations of features were gathered from
machine learning techniques of Stepwise
Regression, Variable Importance, Recursive
Feature Elimination, and Bortua, along with
feature selection from P-values and exploratory
data analysis, and they were cross-validated to
get a comprehensive look at the model’s metrics.
Additionally, they were evaluated with
parametric statistics using chi-squared test and
ANOVA to confirm the models. The trained
models were validated against the test data sets.
Since the validation resulted in 93% accuracy
for SVM with Variable Importance, the features
in this model - Age, Sex, MaxHR, ChestPain,
Ca, Oldpeak, ExAng, Thal, and Slope - can be
used to diagnose and predict heart disease.
Doctors can use this study to medically screen
the patients for these features instead of
screening full bodies. The patients can also use it
themselves to predict their chances of being
infected.
Most healthcare institutions around the
world store their data in some type of electronic
format. This data is building up at an alarming
rate and has become very complex. This paper
serves as a ground-base of the research for
preventing and treating cardiovascular disease
(CVD). Since CVD includes a range of medical
diseases and conditions, this research will be
expanded for more studies to combat heart
disease. By thoroughly researching the role of a
human’s attributes, such as physical appearance,
lifestyle, or genetics, new tools or practices can
be developed to prevent or treat heart disease.
This medical research can lead to new and more
effective screening or treatments such as
therapies, medicines, medical and surgical
procedures, or artificial intelligence
technologies.
References
[1] UCI Machine Learning Repository: Heart Disease Data
Set
1. Hungarian Institute of Cardiology. Budapest: Andras
Janosi, M.D.
2. University Hospital, Zurich, Switzerland: William
Steinbrunn, M.D.
3. University Hospital, Basel, Switzerland: Matthias
Pfisterer, M.D.
4. V.A. Medical Center, Long Beach and Cleveland
Clinic Foundation:Robert Detrano, M.D., Ph.D.
[2] Zriqat, Esraa & Altamimi, Ahmad & Azzeh,
Mohammad. (2017). A Comparative Study for Predicting
Heart Diseases Using Data Mining Classification Methods.
[3] Marikani, T. Shyamala, K. ”Prediction of Heart Disease
Using Supervised Learning Algorithms.” International
Journal of Computer Applications, Volume 165, May 2017.
[4] Hajar, Rachel. “Risk Factors for Coronary Artery
Disease: Historical Perspectives.” Heart Views. 2017 Jun-
Sep; 18(3): 109-114.
[5] Chaki, Dipankar & Das, Amit & Moinul, Zaber. (2015).
A comparison of three discrete methods for classification of
heart disease data. Bangladesh Journal of Scientific and
Industrial Research. 50. 293. 10.3329/bjsir.v50i4.25839.
[6] M. Anbarasi, E. Anupriya, N.Ch.S.N.Iyegar. Enhanced
Prediction of Heart Disease with Feature Subset Selection
using Genetic Algorithm, International Journal of
Engineering Science and Technology Vol. 2(10), 2010,
5370- 5376
[7] Kursa, Miron B. & Witold B. Rudnicki. “Feature
Selection with Boruta Package.” Journal of Statistical
Software. September 2010. Volume 36, Issue 11.
[8] Q, Chen et al. ”Decision Variants for the Automatic
Determination of Optimal Feature Subset in RF-RFE.”
Genes (Basel). 2018 Jun 15 ; 9(6). Pii : E301. Doi :
10.3390/genes9060301
[9] Pouriyeh, Seyedamin & vahid, sara & Sannino,
Giovanna & De Pietro, Giuseppe & Arabnia, Hamid &
Gutierrez, Juan. (2017). A Comprehensive Investigation
and Comparison of Machine Learning Techniques in the
Domain of Heart Disease. 10.1109/ISCC.2017.8024530.
[10] Afshartous, David & Leeuw, Jan de (2005). Prediction
in Multilevel Models.
[11] M. Fayyad, Usama & Piatetsky-Shapiro, Gregory &
Smyth, Padhraic. (1996). From Data Mining to Knowledge
Discovery in Databases. AI Magazine. 17. 37-54.
10.1609/aimag.v17i3.1230.
[12] Peng, C., Lee, K.L., & Ingersoll, G.M. (2003). An
Introduction to Logistic Regression Analysis and
Reporting.
[13]Chaitali Vaghela, Nikita Bhatt and Darshana Mistry.
Article: A Survey on Various Classification Techniques for
Clinical Decision Support System. International Journal of
Computer Applications 116(23):11-17, April 2015
[14] Lin Li, Y.W.a.M.Y., Experimental Comparisons of
Multi-class Classifiers. Informatica, 2015.
[15] S. A. Pattekari and A. Parveen, “PREDICTION
SYSTEM FOR HEART DISEASE USING NAIVE
BAYES,” International journal of Advanced Computer and
Mathematical Sciences, vol. 3, no. 3, pp. 290294, 2012.
[16] K.Srinivas & B.Kavihta, Rani & Govardhan, Dr.
(2010). Applications of Data Mining Techniques in
Healthcare and Prediction of Heart Attacks. International
Journal on Computer Science and Engineering. 2.
[17] B.Venkatalakshmi, M.V Shivsankar, “Heart Disease
Diagnosis Using Predictive Data mining”, IJIRSET
Volume 3, Special Issue 3, March 2014 ,pp. 1873-1877.
[18] Roger VL, Go AS, Lloyd-Jones DM, Benjamin EJ,
Berry JD, Borden WB, et al. Heart disease and stroke
statistics2012 update: a report from the American Heart
AssociationExternal. Circulation. 2012;125(1):e2220.
[19] “Angina.” Mayo Clinic, Mayo Foundation for Medical
Education and Research, 18 Jan. 2018,
www.mayoclinic.org/diseases-
conditions/angina/symptoms-causes/syc-20369373.
[20] Rakotomalala, Ricco. (2005). "TANAGRA: a free
software for research and academic purposes".
[21] Christou, Demetra D. & Seals, Douglas R. ”Decreased
maximal heart rate with aging is related to reduced β-
adrenergic responsiveness but is largely explained by a
reduction in intrinsic heart rate.” J Appl Physiol (1985).
2008 Jul; 24-29.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Feature selection, which identifies a set of most informative features from the original feature space, has been widely used to simplify the predictor. Recursive feature elimination (RFE), as one of the most popular feature selection approaches, is effective in data dimension reduction and efficiency increase. A ranking of features, as well as candidate subsets with the corresponding accuracy, is produced through RFE. The subset with highest accuracy (HA) or a preset number of features (PreNum) are often used as the final subset. However, this may lead to a large number of features being selected, or if there is no prior knowledge about this preset number, it is often ambiguous and subjective regarding final subset selection. A proper decision variant is in high demand to automatically determine the optimal subset. In this study, we conduct pioneering work to explore the decision variant after obtaining a list of candidate subsets from RFE. We provide a detailed analysis and comparison of several decision variants to automatically select the optimal feature subset. Random forest (RF)-recursive feature elimination (RF-RFE) algorithm and a voting strategy are introduced. We validated the variants on two totally different molecular biology datasets, one for a toxicogenomic study and the other one for protein sequence analysis. The study provides an automated way to determine the optimal feature subset when using RF-RFE.
Conference Paper
Full-text available
This paper aims to investigate and compare the accuracy of different data mining classification schemes and their combinations through Ensemble Machine Learning Techniques for predicting heart disease. The Cleveland dataset for heart diseases, containing 303 instances, has been used in this study. Due to the limited number of samples, 10-Fold Cross-Validation is applied in order to portion the data into training and testing datasets. In this study different machine learning classifiers, such as Decision Tree (DT), Naïve Bayes (NB), Multilayer Perceptron (MLP), K-Nearest Neighbor (K-NN), Single Conjunctive Rule Learner (SCRL), Radial Basis Function (RBF) and Support Vector Machine (SVM) are utilized. Moreover, the ensemble prediction of classifiers including bagging, boosting, and stacking are applied to the dataset. The result of the experiment indicates that SVM method using boosting technique outperformed among the aforementioned methods.
Article
Full-text available
Improving the precision of heart diseases detection has been investigated by many researchers in the literature. Such improvement induced by the overwhelming health care expenditures and erroneous diagnosis. As a result, various methodologies have been proposed to analyze the disease factors aiming to decrease the physicians practice variation and reduce medical costs and errors. In this paper, our main motivation is to develop an effective intelligent medical decision support system based on data mining techniques. In this context, five data mining classifying algorithms, with large datasets, have been utilized to assess and analyze the risk factors statistically related to heart diseases in order to compare the performance of the implemented classifiers (e.g., Na\"ive Bayes, Decision Tree, Discriminant, Random Forest, and Support Vector Machine). To underscore the practical viability of our approach, the selected classifiers have been implemented using MATLAB tool with two datasets. Results of the conducted experiments showed that all classification algorithms are predictive and can give relatively correct answer. However, the decision tree outperforms other classifiers with an accuracy rate of 99.0% followed by Random forest. That is the case because both of them have relatively same mechanism but the Random forest can build ensemble of decision tree. Although ensemble learning has been proved to produce superior results, but in our case the decision tree has outperformed its ensemble version.
Article
Full-text available
This article describes a R package Boruta, implementing a novel feature selection algorithm for finding all relevant variables. The algorithm is designed as a wrapper around a Random Forest classification algorithm. It iteratively removes the features which are proved by a statistical test to be less relevant than random probes. The Boruta package provides a convenient interface to the algorithm. The short description of the algorithm and examples of its application are presented.
Article
The multi-class classification algorithms are widely used by many areas such as machine learning and computer vision domains. Nowadays, many literatures described multi-class algorithms, however there are few literature that introduced them with thorough theoretical analysis and experimental comparisons. This paper discusses the principles, important parameters, application domain, runtime performance, accuracy, and etc. of twelve multi-class algorithms: decision tree, random forests, extremely randomized trees, multi-class adaboost classifier, stochastic gradient boosting, linear and nonlinear support vector machines, K nearest neighbors, multi-class logistic classifier, multi-layer perceptron, naive Bayesian classifier and conditional random fields classifier. The experiment tested on five data sets: SPECTF heart data set, Ionosphere radar data set, spam junk mail filter data set, optdigits handwriting data set and scene 15 image classification data set. Our major contribution is that we study the relationships between each classifier and impact of each parameters to classification results. The experiment shows that gradient boosted trees, nonlinear support vector machine, K nearest neighbor reach high performance under the circumstance of binary classification and minor data capacity; Under the condition of high dimension, multi-class and big data, however, gradient boosted trees, linear support vector machine, multi-class logistic classifier get good results. At last, the paper addresses the development and future of multi-class classifier algorithms.
Article
Clinical Decision Support systems link health observations with health knowledge to determine health option by clinicians for improved health care. The main idea of Clinical Decision Support System is a set of rules derived from medical professionals applied on a dynamic knowledge. Data mining is well suited to give decision support for the healthcare. There are several classification techniques available that can be used for clinical decision support system. Different techniques are used for different diagnosis. In this paper, various classification techniques for clinical decision support system are discussed with example