International Journal of Electrical and Computer Engineering (IJECE)
Vol. 5, No. 6, December 2015, pp. 1569~1576
Comparing Performance of Data Mining Algorithms in
Prediction Heart Diseses
Moloud Abdar
, Sharareh R. Niakan Kalhori
, Tole Sutikno
, Imam Much Ibnu Subroto
, Goli Arji
Department of Engineering, Damghan University, Iran
Department of Health Information Management, Tehran University of Medical Sciences, Iran
Department of Electrical Engineering, Universitas Ahmad Dahlan, Yogyakarta, Indonesia
Department of Informatics Engineering, Universitas Islam Sultan Agung, Semarang, Indonesia
Health Information Management, Tehran University of Medical Sciences, Iran
Article Info
Article history:
Received Aug 4, 2015
Revised Oct 11, 2015
Accepted Oct 27, 2015
Heart diseases are among the nation’s leading couse of mortality and moribidity. Data
mining teqniques can predict the likelihood of patients getting a heart disease. The
purpose of this study is comparison of different data mining algorithm on prediction of
heart diseases. This work applied and compared data mining techniques to predict the
risk of heart diseases.After feature analysis, models by six algorithms including decision
tree, neural network, support vector machine and k-nearest neighborhood developed and
validated. C5.0 Decision tree has been able to build a model with greatest accuracy
93.02%, KNN, SVM, Neural network have been 88.37%, 86.05% and 80.23%
respectively. Produced results of decision tree can be simply interpretable and
applicable; their rules can be understood easily by different clinical practitioner.
C5.0 Algorithm
Data Mining
Heart Disease
Neural Network
Corresponding Author:
Goli Arji,
Faculty of Allied Medical Sciences
Tehran University of Medical Sciences
Addres, Tehran,Iran
According to the latest statistics from the World Health Organization (WHO), heart diseases have a
great deal of attention in medical research due to its impact on human health [1]. Cardiovascular disease is
the number one cause of death in industrialized countries and not only have a major impact on individuals
and their quality of life in general, but also on public health costs and the countries’ economies. Diagnosis of
heart disease was more costly decision in diagnosis. Artificial Intelligence (AI) techniques were used vastly
in medical diagnosis.With the advancement of science, the volume of accumulated data in various fields has
been increased that it is well known the explosion of information [2]. When analyzing the accumulated data
they could reveal their hidden useful information. By performing data mining, which is a new science, we
able to extract the hidden knowledge of the data. Performing data mining reveals useful relationship existed
among data, and this rule can apply for right decision making [3],[4]. Classification is one of the subdivisions
of data mining, which acts in accordance with If-Then rule. Its purpose is to predict a variable based on other
features that are known as predictors. Neural Network , Support vector machine, and Decision Tree are
different form of classification algorithms [5-9]. The purpose of this study is comparison of different machin
learning algorithm on prediction of heart diseases.
This section summarises various technical articles on KDD process and data mining classification
techniques applied on heart diseses datasets:
Ram Bilas Pachori and his colleagues [10] have been studying and diagnosing heart disease using
tunable-Q wavelet obtained from heart rate signals. Since manual data entry occurs with errors and also it is
time consuming, Tunable-Q Wavelet Transform (TQWT) method is recommended in the present study.
Using the least squares support vector machine (LS-SVM), they have reported the accuracy of 96.8%,
sensitivity equal to 100%, and specificity of 93.7%.
Another study conducted by Yongqiang Lyu et al. [11] has been based on an evaluation model of
coronary artery disease by using data mining algorithm. In this research a new dynamic model, which makes
it possible to assess lifetime, suggests linear time-invariant approach to assess CHD. The model result based
on SYNTAX scores indicates a 5% possible error al [12] in this study they have used J4.8 Decision tree
method, and the reported precision was 84.1 percent.
In another study using genetic algorithm, SVM and SSVM conducted by Sumit Bhatia et al [13] in
classification of cardiac patients the features have been selected by genetic algorithm to help the SSVM in
the best mode of input selection, the obtained precision is 72.55%, while the precision obtained by GA-
SSVM has improved the result and its precision equals to 90.57%. Peter C. Austin and colleagues [14]
discuss heart malfunctions in their paper. The associated physicians have divided the patients into two groups
of "with" and "without" disease. They have found that the use of decision tree in data mining will have better
results than regression model. Using MV5, Saba Bashir [15] applied MV5 algorithn and its precision
was 88.52%.
Another research done by Jasmine Nahar et al. [16] for finding relationship between heart disease
risk factors in men and women. It refers to the fact that coronary heart disease risk in women is less than
men. Doing exercise men and women can easily overcome their chest pain. One of the extracted points in this
paper introduces "Rest ECG" in both forms of normal and hyper, and "Slope being flat" is introduced as a
risk factor. However, the research resulth indicate that Rest ECG for men is considered a risk factor only in
its hyper form. The study concludes that Rest ECG should be considered as important factor to predict heart
disease in women. The research techniques including Apriori, Predictive Apriori and Tertius have compared
to each other and precision of predictive Apriori was 90%.
Kyle. Walker [17] note that heart disease is
the principal cause of death in America, Texas. Therefore, the performed a study on different areas of Texas
using cluster analysis and result show that factors such as poor hygiene and economic deprivation and other
conditions affect the outbreak of disease.
In the paper presented by K. Rajeswari and colleagues [18], they study the heart disease using
Neural Network. They have studied the influence of feature selection for neural network algorithm in
identifying patients with Ischemic heart disease. 12 features have been used in the paper. The result of their
study shows that when all the features(attributes)are applied, the precision rate in training mode 89.4% and in
test mode is 82.2%. An interesting point in the conclusion is that any reduction in features entry causes the
precision decrease in both training and test modes. AV Senthil Kumar [19] applied fuzzy mechanism on
cardiac patients The calculated precision in this paper was 94.11%. Some examples of research done on
cardiac patients with different techniques have briefly mentioned below.
The present study conducted by using data from the University of California, Irvine (UCI).This data
includes 13 features classified into 2 classes of "with" and "without" heart disease. After feature analysis,
models by six algorithms including decision tree, neural network, support vector machine and k-nearest
neighborhood developed and validated.
2.1. C5.0 Algorithm
C5.0 algorithm developed from C4.5 algorithm is one of the most important and widely used
algorithms in data mining. C4.5 itself is the extended form of ID3 algorithm. C5.0 has the ability to be
applied for classifying as a decision tree or a set of rules. Because of the understandability of their rules set,
they are preferred in many applications. The strength of the algorithm is in handling missing values or its
large number of entries, as well as the fact that less time is necessary to learn it [20], [21], [22], [23].
If S is training set and X contains n attributes so that the set S is divided into N sub categories: The
algorithm to test the features makes use of element is called the gain ratio [24].
The number of samples in the S is displayed in (S1, S2, S3,....Sn). For calculating the number of
samples that belong to Ci (the value Parameter i is [i = 1,2,3,4, ..., N]) is used in the following formula:
,. Also for calculate an instance belonging the Ci is used to the formula: ,/||
Training set can be calculated according to the formula
includes information can be identified by all the samples in S. After the division of S to all
its subsets, Gain ratio is calculated as follows:
2.2. SVM Algorithm
Support Vector Machine (SVM) is a regulatory algorithm introduced by Vapnik in 1995. The base
of the algorithm is using the precision to generalize the errors. The algorithm makes "hyperplane" and
divides the data into classes so that all samples belonging to one class will be categorized on one side and the
rest on the other side. Linear SVM Classifier is defined for the SVM classifying task, and dividing them
occurs provided that the chosen line involves the most marginalized sure [13], [25].
2.3. KNN Algorithm
K-nearest neighbor algorithm is a method for classification based on similarity to other cases. Those
close to others, are called a "neighbor". When a case is new, its distance from each of the cases in the model
is calculated. Applying this classification, specifies the case as being the nearest neighbor, which is the most
similar. Therefore, it puts the case into the group that contains the nearest neighbors. The algorithm is also
able to calculate values continuously for a target. In this situation, the average or the median target value of
the nearest neighbor is used to obtain the predicted value of new case [26].
2.4. Neural Network Algorithm
Artificial Neural Network is a data processing algorithm, originated from human brain. The system
includes a large number of tiny processors to handle data processing. The processors act in the form of an
interconnected network parallel to each other to solve a problem. Using programming knowledge, in this
networks a data structure is designed that can act as neurons. This data structure is called the neuron[27],
[28], [29], [30].
2.5. Accuracy Measurment
In order to evaluate the prediction rate,there are several indices such as specificity, sensitivity,
precision, and accuracy to assess to assess the models’ validity. These indices(equation 6-9) are calculated by
the cofusion matrix (Figure 1). This matrix is a useful tool for analyzing the performance of classification
method in data diagnosis or observations of various categories. The ideal state, most parts of the relevant data
with the observations should be located on the main diagonal of the matrix, and the remaining values of the
matrix are zero or near zero [31], [32].
FN= The number of positively labeled data, which falsely have been classified as "Negative".
TN= The number of negatively labeled data, which have been classified as "Correct".
TP= The number of positively labeled data, which have been classified as "Correct".
FP= The number of negatively labeled data, which falsely have been classified as "Positive".
Figure 1. Confusion matrix
2.6. Data Set
In this study 270 record with 13 features has been used [33]. Patients’ attributions applied for
modeling, their definitions and their range of values presented in Table 1.
Table 1. Patients’ attributions applied for modeling, their definitions and their range of values.
Variable Variable Definition Categories of Values
Age Age of Heart Disease [29-77]
Sex Gender of Heart Disease (1 = male; 0 = female)
CP chest pain type [1-4]
RBP resting blood pressure [94-200]
SC serum cholestoral in mg/dl [126-564]
FBS fasting blood sugar > 120 mg/dl [0-1]
RER resting electrocardiographic results [0-2]
MHRA maximum heart rate achieved [71-202]
EIA exercise induced angina [0-1]
Oldpeak ST depression induced by exercise relative to rest [0-6.2]
Slope the slope of the peak exercise ST segment [1-3]
NUM number of major vessels (0-3) colored by
Thal Normal, fixed defect, reversible defect [3, 6, 7]
Variable to be predicted Class of Heart Disease Absence (1) or presence (2) of heart disease
By means of logestic regression variables which are significantly correlated with target variable are
selected as predictor (P<=0.05).they are presented an defined in Table 2.
Table 2. variables which are significantly correlated with target variable by using logestic Regression
Variable Variable Definition Categories of Values B Wald Sig Exp
Sex Gender 1 = male; 0 = female 1.104 6.337 0.012 3.018
CP chest pain type [1-4] 0.731 13.648 0.000 2.077
RBP resting blood pressure [94-200] 0.023 5.238 0.022 1.023
EIA exercise induced angina [0-1] 1.236 10.182 0.001 3.442
NUM number of major vessels (0-3) colored by flourosopy [0-3] 1.133 25.224 0.000 3.106
Thal Normal, fixed defect, reversible defect [3, 6, 7] 0.397 16.848 0.000 1.488
This section presents the experimental results and analysis done for this study.In this work, four
classifiers including C5.0, SVM, KNN and Neural Network. Data divided into trainset and testset (70% and
30% respectively). The training set is used to build the classifier and test set used to validate it. Model
development is conducted in two main steps including model fitness and model accuracy. To calculate the
model fitness criteria we used the data of training set; however, to compute the model accuracy
measurements, data of testing set is applied which is merely much more valuable to judge about our models
accuracy. Related results of these experiments are demonstrated in Table 3.
Table 3. Comparison on model fitness and model accuracy of six various applied machine learning
Model Fitness (through using training set) Model Accuracy (through using testing set)
Algorithms Specificity Sensitivity Precision Training
Specificity Sensitivity Precision Testing
C5.0 89.62 % 84.61 % 85.71 % 87.50 % 90.90 % 95.23 % 90.90 % 93.02%
SVM 84.90 % 79.48 % 79.48 % 82.61 % 90.90 % 80.95 % 89.47 % 86.05%
KNN 91.50 % 79.48 % 87.32 % 86.41 % 88.63 % 88.09 % 88.09 % 88.37%
91.50 % 78.20 % 87.14 % 85.87 % 86.36 % 73.80 % 83.78 % 80.23%
C5.0 Decision tree has been able to build a model with greatest accuracy since the model prediction
accuracy is 93.02%. Model accuracies obtained from other classifiers are different as this value for
KNN,SVM, Neural network have been 88.37%,86.05% and 80.23% respectively.By analyzing the variables
importance in c5, 0 model we find that attention to features such as Thal, CP and Slope are so important in
prediction of heart diseases (Figure 2).
Figure 2.variable importance for heart diseases prediction based on C5.0 model
Figures 3 and 4 are comparative ROC curves based on risk of heart diseases.This figures show two
ROC curve for logistic regression and C5.0 decision tree C5.0 has outperformed than logistic regression with
area under curve (AUC) 0.869. AUC for logistic regression was 0.835. Overall, these results of area under
curve reveals better performance of C4.5 decision tree classification algorithm.
Figure 3. ROC curve for logistic regression Figure 4.ROC curve for C5.0 decision tree
In a study conducted to comparing between data mining tools for heart diseases data set in [34] and
[35] variable like blood pressure, blood sugar, age and sex showed a significant association with heart
diseases. The study conducted by Jasmine Nahar and her colleagues [16] also pointed out that sex was highly
important in predicting heart disease, wheras in this study features such as resting blood pressure, sex, chest
pain type, exercise induced angina and number of major vessels played a major role.In a paper Zahra
Alizadeh Sani et al [36] have used the C4.5 and Bagging algorithms to diagnosing coronary heart disease.
For C4.5 algorithms have reported the best accuracy rate. K. Rajeswari et al [18] applied neural network on
ischemic heart disease that the accuracy obtained for training and testing was 89.4 % and 82.2 %
respectively. T. John Peter and K. Somasundaram [37] have been used hybrid attribute selection method for
prediction of heart disease.The accuracy obtained by this model was 83.62 %. Kemal Polat and Salih Gunes
[38] by use of C4.5 decision tree algorithm obtained 92.59 % accuracy.
In this study, KNN, SVM, C5.0, Logistic Regression and Neural Network were implemented on
UCI dataset. Based on
investigated methods, decision tree has achieved the best performance.There are
different issues that influence the performance of applied models including type of problem and type of input
data(discrete or continous).due to the fact that dataset mainly was discrete,decision tree able to handle
numerical data.Because output variable labeled with two class:’with’ and ‘without’ heart diseases,decision
tree yielded better performance than other algorithms.
Decision trees are able to generate understandable
rules and can perform classification without requiring much computation and clearly indicate that which
fields are most important for prediction or classification.
