ArticlePDF Available

Abstract and Figures

Heart diseases are among the nation's leading couse of mortality and moribidity. Data mining teqniques can predict the likelihood of patients getting a heart disease. The purpose of this study is comparison of different data mining algorithm on prediction of heart diseases. This work applied and compared data mining techniques to predict the risk of heart diseases.After feature analysis, models by six algorithms including decision tree, neural network, support vector machine and k-nearest neighborhood developed and validated. C5.0 Decision tree has been able to build a model with greatest accuracy 93.02%, KNN, SVM, Neural network have been 88.37%, 86.05% and 80.23% respectively. Produced results of decision tree can be simply interpretable and applicable; their rules can be understood easily by different clinical practitioner. © 2015 Institute of Advanced Engineering and Science. All rights reserved.
Content may be subject to copyright.
International Journal of Electrical and Computer Engineering (IJECE)
Vol. 5, No. 6, December 2015, pp. 1569~1576
ISSN: 2088-8708 1569
Journal homepage:
Comparing Performance of Data Mining Algorithms in
Prediction Heart Diseses
Moloud Abdar
, Sharareh R. Niakan Kalhori
, Tole Sutikno
, Imam Much Ibnu Subroto
, Goli Arji
Department of Engineering, Damghan University, Iran
Department of Health Information Management, Tehran University of Medical Sciences, Iran
Department of Electrical Engineering, Universitas Ahmad Dahlan, Yogyakarta, Indonesia
Department of Informatics Engineering, Universitas Islam Sultan Agung, Semarang, Indonesia
Health Information Management, Tehran University of Medical Sciences, Iran
Article Info
Article history:
Received Aug 4, 2015
Revised Oct 11, 2015
Accepted Oct 27, 2015
Heart diseases are among the nation’s leading couse of mortality and moribidity. Data
mining teqniques can predict the likelihood of patients getting a heart disease. The
purpose of this study is comparison of different data mining algorithm on prediction of
heart diseases. This work applied and compared data mining techniques to predict the
risk of heart diseases.After feature analysis, models by six algorithms including decision
tree, neural network, support vector machine and k-nearest neighborhood developed and
validated. C5.0 Decision tree has been able to build a model with greatest accuracy
93.02%, KNN, SVM, Neural network have been 88.37%, 86.05% and 80.23%
respectively. Produced results of decision tree can be simply interpretable and
applicable; their rules can be understood easily by different clinical practitioner.
C5.0 Algorithm
Data Mining
Heart Disease
Neural Network
Copyright © 2015 Institute of Advanced Engineering and Science.
All rights reserved.
Corresponding Author:
Goli Arji,
Faculty of Allied Medical Sciences
Tehran University of Medical Sciences
Addres, Tehran,Iran
According to the latest statistics from the World Health Organization (WHO), heart diseases have a
great deal of attention in medical research due to its impact on human health [1]. Cardiovascular disease is
the number one cause of death in industrialized countries and not only have a major impact on individuals
and their quality of life in general, but also on public health costs and the countries’ economies. Diagnosis of
heart disease was more costly decision in diagnosis. Artificial Intelligence (AI) techniques were used vastly
in medical diagnosis.With the advancement of science, the volume of accumulated data in various fields has
been increased that it is well known the explosion of information [2]. When analyzing the accumulated data
they could reveal their hidden useful information. By performing data mining, which is a new science, we
able to extract the hidden knowledge of the data. Performing data mining reveals useful relationship existed
among data, and this rule can apply for right decision making [3],[4]. Classification is one of the subdivisions
of data mining, which acts in accordance with If-Then rule. Its purpose is to predict a variable based on other
features that are known as predictors. Neural Network , Support vector machine, and Decision Tree are
different form of classification algorithms [5-9]. The purpose of this study is comparison of different machin
learning algorithm on prediction of heart diseases.
This section summarises various technical articles on KDD process and data mining classification
techniques applied on heart diseses datasets:
ISSN: 2088-8708
IJECE Vol. 5, No. 6, December 2015 : 1569 – 1576
Ram Bilas Pachori and his colleagues [10] have been studying and diagnosing heart disease using
tunable-Q wavelet obtained from heart rate signals. Since manual data entry occurs with errors and also it is
time consuming, Tunable-Q Wavelet Transform (TQWT) method is recommended in the present study.
Using the least squares support vector machine (LS-SVM), they have reported the accuracy of 96.8%,
sensitivity equal to 100%, and specificity of 93.7%.
Another study conducted by Yongqiang Lyu et al. [11] has been based on an evaluation model of
coronary artery disease by using data mining algorithm. In this research a new dynamic model, which makes
it possible to assess lifetime, suggests linear time-invariant approach to assess CHD. The model result based
on SYNTAX scores indicates a 5% possible error al [12] in this study they have used J4.8 Decision tree
method, and the reported precision was 84.1 percent.
In another study using genetic algorithm, SVM and SSVM conducted by Sumit Bhatia et al [13] in
classification of cardiac patients the features have been selected by genetic algorithm to help the SSVM in
the best mode of input selection, the obtained precision is 72.55%, while the precision obtained by GA-
SSVM has improved the result and its precision equals to 90.57%. Peter C. Austin and colleagues [14]
discuss heart malfunctions in their paper. The associated physicians have divided the patients into two groups
of "with" and "without" disease. They have found that the use of decision tree in data mining will have better
results than regression model. Using MV5, Saba Bashir [15] applied MV5 algorithn and its precision
was 88.52%.
Another research done by Jasmine Nahar et al. [16] for finding relationship between heart disease
risk factors in men and women. It refers to the fact that coronary heart disease risk in women is less than
men. Doing exercise men and women can easily overcome their chest pain. One of the extracted points in this
paper introduces "Rest ECG" in both forms of normal and hyper, and "Slope being flat" is introduced as a
risk factor. However, the research resulth indicate that Rest ECG for men is considered a risk factor only in
its hyper form. The study concludes that Rest ECG should be considered as important factor to predict heart
disease in women. The research techniques including Apriori, Predictive Apriori and Tertius have compared
to each other and precision of predictive Apriori was 90%.
Kyle. Walker [17] note that heart disease is
the principal cause of death in America, Texas. Therefore, the performed a study on different areas of Texas
using cluster analysis and result show that factors such as poor hygiene and economic deprivation and other
conditions affect the outbreak of disease.
In the paper presented by K. Rajeswari and colleagues [18], they study the heart disease using
Neural Network. They have studied the influence of feature selection for neural network algorithm in
identifying patients with Ischemic heart disease. 12 features have been used in the paper. The result of their
study shows that when all the features(attributes)are applied, the precision rate in training mode 89.4% and in
test mode is 82.2%. An interesting point in the conclusion is that any reduction in features entry causes the
precision decrease in both training and test modes. AV Senthil Kumar [19] applied fuzzy mechanism on
cardiac patients The calculated precision in this paper was 94.11%. Some examples of research done on
cardiac patients with different techniques have briefly mentioned below.
The present study conducted by using data from the University of California, Irvine (UCI).This data
includes 13 features classified into 2 classes of "with" and "without" heart disease. After feature analysis,
models by six algorithms including decision tree, neural network, support vector machine and k-nearest
neighborhood developed and validated.
2.1. C5.0 Algorithm
C5.0 algorithm developed from C4.5 algorithm is one of the most important and widely used
algorithms in data mining. C4.5 itself is the extended form of ID3 algorithm. C5.0 has the ability to be
applied for classifying as a decision tree or a set of rules. Because of the understandability of their rules set,
they are preferred in many applications. The strength of the algorithm is in handling missing values or its
large number of entries, as well as the fact that less time is necessary to learn it [20], [21], [22], [23].
If S is training set and X contains n attributes so that the set S is divided into N sub categories: The
algorithm to test the features makes use of element is called the gain ratio [24].
The number of samples in the S is displayed in (S1, S2, S3,....Sn). For calculating the number of
samples that belong to Ci (the value Parameter i is [i = 1,2,3,4, ..., N]) is used in the following formula:
,. Also for calculate an instance belonging the Ci is used to the formula: ,/||
IJECE ISSN: 2088-8708
Comparing Performance of Data Mining Algorithms in Prediction Heart Diseses (Moloud Abdar)
Training set can be calculated according to the formula
 includes information can be identified by all the samples in S. After the division of S to all
its subsets, Gain ratio is calculated as follows:
2.2. SVM Algorithm
Support Vector Machine (SVM) is a regulatory algorithm introduced by Vapnik in 1995. The base
of the algorithm is using the precision to generalize the errors. The algorithm makes "hyperplane" and
divides the data into classes so that all samples belonging to one class will be categorized on one side and the
rest on the other side. Linear SVM Classifier is defined for the SVM classifying task, and dividing them
occurs provided that the chosen line involves the most marginalized sure [13], [25].
2.3. KNN Algorithm
K-nearest neighbor algorithm is a method for classification based on similarity to other cases. Those
close to others, are called a "neighbor". When a case is new, its distance from each of the cases in the model
is calculated. Applying this classification, specifies the case as being the nearest neighbor, which is the most
similar. Therefore, it puts the case into the group that contains the nearest neighbors. The algorithm is also
able to calculate values continuously for a target. In this situation, the average or the median target value of
the nearest neighbor is used to obtain the predicted value of new case [26].
2.4. Neural Network Algorithm
Artificial Neural Network is a data processing algorithm, originated from human brain. The system
includes a large number of tiny processors to handle data processing. The processors act in the form of an
interconnected network parallel to each other to solve a problem. Using programming knowledge, in this
networks a data structure is designed that can act as neurons. This data structure is called the neuron[27],
[28], [29], [30].
ISSN: 2088-8708
IJECE Vol. 5, No. 6, December 2015 : 1569 – 1576
2.5. Accuracy Measurment
In order to evaluate the prediction rate,there are several indices such as specificity, sensitivity,
precision, and accuracy to assess to assess the models’ validity. These indices(equation 6-9) are calculated by
the cofusion matrix (Figure 1). This matrix is a useful tool for analyzing the performance of classification
method in data diagnosis or observations of various categories. The ideal state, most parts of the relevant data
with the observations should be located on the main diagonal of the matrix, and the remaining values of the
matrix are zero or near zero [31], [32].
FN= The number of positively labeled data, which falsely have been classified as "Negative".
TN= The number of negatively labeled data, which have been classified as "Correct".
TP= The number of positively labeled data, which have been classified as "Correct".
FP= The number of negatively labeled data, which falsely have been classified as "Positive".
Figure 1. Confusion matrix
2.6. Data Set
In this study 270 record with 13 features has been used [33]. Patients’ attributions applied for
modeling, their definitions and their range of values presented in Table 1.
Table 1. Patients’ attributions applied for modeling, their definitions and their range of values.
Variable Variable Definition Categories of Values
Age Age of Heart Disease [29-77]
Sex Gender of Heart Disease (1 = male; 0 = female)
CP chest pain type [1-4]
RBP resting blood pressure [94-200]
SC serum cholestoral in mg/dl [126-564]
FBS fasting blood sugar > 120 mg/dl [0-1]
RER resting electrocardiographic results [0-2]
MHRA maximum heart rate achieved [71-202]
EIA exercise induced angina [0-1]
Oldpeak ST depression induced by exercise relative to rest [0-6.2]
Slope the slope of the peak exercise ST segment [1-3]
NUM number of major vessels (0-3) colored by
Thal Normal, fixed defect, reversible defect [3, 6, 7]
Variable to be predicted Class of Heart Disease Absence (1) or presence (2) of heart disease
By means of logestic regression variables which are significantly correlated with target variable are
selected as predictor (P<=0.05).they are presented an defined in Table 2.
Table 2. variables which are significantly correlated with target variable by using logestic Regression
Variable Variable Definition Categories of Values B Wald Sig Exp
Sex Gender 1 = male; 0 = female 1.104 6.337 0.012 3.018
CP chest pain type [1-4] 0.731 13.648 0.000 2.077
RBP resting blood pressure [94-200] 0.023 5.238 0.022 1.023
EIA exercise induced angina [0-1] 1.236 10.182 0.001 3.442
NUM number of major vessels (0-3) colored by flourosopy [0-3] 1.133 25.224 0.000 3.106
Thal Normal, fixed defect, reversible defect [3, 6, 7] 0.397 16.848 0.000 1.488
IJECE ISSN: 2088-8708
Comparing Performance of Data Mining Algorithms in Prediction Heart Diseses (Moloud Abdar)
This section presents the experimental results and analysis done for this study.In this work, four
classifiers including C5.0, SVM, KNN and Neural Network. Data divided into trainset and testset (70% and
30% respectively). The training set is used to build the classifier and test set used to validate it. Model
development is conducted in two main steps including model fitness and model accuracy. To calculate the
model fitness criteria we used the data of training set; however, to compute the model accuracy
measurements, data of testing set is applied which is merely much more valuable to judge about our models
accuracy. Related results of these experiments are demonstrated in Table 3.
Table 3. Comparison on model fitness and model accuracy of six various applied machine learning
Model Fitness (through using training set) Model Accuracy (through using testing set)
Algorithms Specificity Sensitivity Precision Training
Specificity Sensitivity Precision Testing
C5.0 89.62 % 84.61 % 85.71 % 87.50 % 90.90 % 95.23 % 90.90 % 93.02%
SVM 84.90 % 79.48 % 79.48 % 82.61 % 90.90 % 80.95 % 89.47 % 86.05%
KNN 91.50 % 79.48 % 87.32 % 86.41 % 88.63 % 88.09 % 88.09 % 88.37%
91.50 % 78.20 % 87.14 % 85.87 % 86.36 % 73.80 % 83.78 % 80.23%
C5.0 Decision tree has been able to build a model with greatest accuracy since the model prediction
accuracy is 93.02%. Model accuracies obtained from other classifiers are different as this value for
KNN,SVM, Neural network have been 88.37%,86.05% and 80.23% respectively.By analyzing the variables
importance in c5, 0 model we find that attention to features such as Thal, CP and Slope are so important in
prediction of heart diseases (Figure 2).
Figure 2.variable importance for heart diseases prediction based on C5.0 model
Figures 3 and 4 are comparative ROC curves based on risk of heart diseases.This figures show two
ROC curve for logistic regression and C5.0 decision tree C5.0 has outperformed than logistic regression with
area under curve (AUC) 0.869. AUC for logistic regression was 0.835. Overall, these results of area under
curve reveals better performance of C4.5 decision tree classification algorithm.
ISSN: 2088-8708
IJECE Vol. 5, No. 6, December 2015 : 1569 – 1576
Figure 3. ROC curve for logistic regression Figure 4.ROC curve for C5.0 decision tree
In a study conducted to comparing between data mining tools for heart diseases data set in [34] and
[35] variable like blood pressure, blood sugar, age and sex showed a significant association with heart
diseases. The study conducted by Jasmine Nahar and her colleagues [16] also pointed out that sex was highly
important in predicting heart disease, wheras in this study features such as resting blood pressure, sex, chest
pain type, exercise induced angina and number of major vessels played a major role.In a paper Zahra
Alizadeh Sani et al [36] have used the C4.5 and Bagging algorithms to diagnosing coronary heart disease.
For C4.5 algorithms have reported the best accuracy rate. K. Rajeswari et al [18] applied neural network on
ischemic heart disease that the accuracy obtained for training and testing was 89.4 % and 82.2 %
respectively. T. John Peter and K. Somasundaram [37] have been used hybrid attribute selection method for
prediction of heart disease.The accuracy obtained by this model was 83.62 %. Kemal Polat and Salih Gunes
[38] by use of C4.5 decision tree algorithm obtained 92.59 % accuracy.
In this study, KNN, SVM, C5.0, Logistic Regression and Neural Network were implemented on
UCI dataset. Based on
investigated methods, decision tree has achieved the best performance.There are
different issues that influence the performance of applied models including type of problem and type of input
data(discrete or continous).due to the fact that dataset mainly was discrete,decision tree able to handle
numerical data.Because output variable labeled with two class:’with’ and ‘without’ heart diseases,decision
tree yielded better performance than other algorithms.
Decision trees are able to generate understandable
rules and can perform classification without requiring much computation and clearly indicate that which
fields are most important for prediction or classification.
[1] WHO Report, the Top 10 Causes of Death, last accessed 12/9/2013 from http://, ( accessed 01.04.2015).
[2] Hamid Bagheri, Abdusalam Abdullah Shaltooki. Big Data: Challenges, Opportunities and Cloud Based Solutions.
International Journal of Electrical and Computer Engineering (IJECE), 2014; 5(2): 340-343.
[3] Vijayajothi P, Tan SY, Sarinder KD, Amandeep SS. A methodological review of data mining techniques in
predictive medicine: An application in hemodynamic prediction for abdominal aortic aneurysm disease. Published by
Elsevier, Biocybernetics and Biomedical Engineering, 2014; 34(3):139-145.
[4] K.C. Tan, E.J. Teoh, Q. Yu, K.C. Goh. A hybrid evolutionary algorithm for attribute selection in data mining. Expert
Systems with Applications, 2009; 36: 8616–8630.
[5] Nikola K, Elisa C. Spiking neural network methodology for modelling classification and understanding of EEG
spatio-temporal data measuring cognitive processes. Information Sciences, 2015; 294: 565–575.
[6] F. Lotte, M. Congedo, A. Lécuyer, F. Lamarche, B.A. Arnaldi. Review of classification algorithms for EEG-based
brain–computer interfaces. J. Neural Eng. 2007; 4(2):1-25.
IJECE ISSN: 2088-8708
Comparing Performance of Data Mining Algorithms in Prediction Heart Diseses (Moloud Abdar)
[7] C. Anderson, D. Peterson. Recent advances in EEG signal analysis and classification, in: R. Dybowski, V. Gant
(Eds.). Clinical Applications of Artificial Neural Networks, Cambridge University Press, UK. 2001: 175–191
(Chapter 8).
[8] C. Anderson, E. Stolz, S. Shamsunder,” Multivariate autoregressive models for classification of spontaneous
electroencephalogram during mental tasks. IEEE Trans. Biomed. Eng. 1998; 45 (3): 277–286.
[9] K. Padmavathi, K. Sri Ramakrishna. Detection of Atrial Fibrillation using Autoregressive modeling. International
Journal of Electrical and Computer Engineering (IJECE), 2015; 5(1): 64-70.
[10] Shivnarayan P, Ram BP, U. Rajendra A. Automated diagnosis of coronary artery disease using tunable-Q wavelet
transform applied on heart rate signals. Knowledge-Based Systems, 2015; 82: 1-10.
[11] Yongqiang L, Jiaming H, Yiran W, Jijiang Y , Yida T, Wenyao W, Nazim A. Dynamic evaluation model of coronary
heart disease for ubiquitous healthcare. Computers in Industry, 2015; 69: 35-44.
[12] Mai Sh, Tim T, Rob S. Using Decision Tree for Diagnosing Heart Disease Patients. AusDM'11, Proceedings of the
9-th Australasian Data Mining Conference, Ballarat, Australia, 2011.
[13] Sumit B, Praveen P, G.N. Pillai. SVM Based Decision Support System for Heart Disease Classification with Integer-
Coded Genetic Algorithm to Select Critical Features. WCECS. Proceedings of the World Congress on Engineering
and Computer Science, San Francisco, USA, October 22 – 24, 2008.
[14] Peter C. Austin, Jack V. Tu, Jennifer E. Ho, Daniel Levy, Douglas S. Lee. Using methods from the data-mining and
machine-learning literature for disease classification and prediction: a case study examining classification of heart
failure subtypes. Journal of Clinical Epidemiology, 2013; 66(4): 398-407.
[15] Saba B, Usman Q, Farhan HK, M. Younus J. MV5: A Clinical Decision Support Framework for Heart Disease
Prediction Using Majority Vote Based Classifier Ensemble. Arab J Sci Eng, 2014; 39(11): 7771-7783.
[16] Jesmin N, Tasadduq I, Kevin ST, Yi-Ping Ph Ch. Association rule mining to detect factors which contribute to heart
disease in males and females. Expert Systems with Application, 2013; 40(4): 1086–1093.
[17] Kyle E. Walker*, Sean M. Crotty. Classifying high-prevalence neighborhoods for cardiovascular disease in Texas.
Applied Geography, 2014; 57: 22-31, 2014.
[18] K.Rajeswari, V.Vaithiyanathan, T.R. Neelakantan. Feature Selection in Ischemic Heart Disease Identification using
Feed Forward Neural Networks. International Symposium on Robotics and Intelligent Sensors 2012 (IRIS 2012),
Procedia Engineering, 2012; 41: 1818–1823.
[19] A.V Senthil Kumar. Generating Rules for Advanced Fuzzy Resolution Mechanism to Diagnosis Heart Disease.
International Journal of Computer Applications, 2013; 77(11): 6-12.
[20] Quinlan J R. Induction of decision trees. Machine Learning, 1986; 4: 81–106.
[21] Quinlan J R. C4.5: Programs for machine learning. Machine,Learning, 1994; 3:235–240.
[22] Quinlan J R. Bagging, Boosting and C4.5. Proceedings of 14th National Conference on Artificial Intelligence, 1996:
[23] Xindong W , Vipin K , J. Ross Q , Joydeep Gh, Qiang Y, Hiroshi M , Geoffrey J. M, Angus Ng, Bing L, Philip S.
Yu, Zhi-Hua Z, Michael S, David JH, Dan S. Top 10 algorithms in data mining. Springer, 2008; 14(1): 1-37.
[24] Shuonan H, Rongtao H, Xinming S, Jun W, Chengshang Y, Research on C5.0 Algorithm Improvement and the
Test in Lightning Disaster Statistics”, International Journal of Control and Automation, vol. 7, no1, pp. 181-190,
[25] Vapnik, V. N. The nature of statistical learning theory. New York:Springer, 1995.
[26]. Yazdani A, Ebrahimi T, Hoffmann U. Classification of EEG signals using Dempster Shafer theory and a K-nearest
neighbor classifier. IEEE. In: Proc of the 4th int EMBS conf on neural engineering, 2009: 327–30.
[27] Daubechies I. The wavelet transform, time-frequency localization and signal analysis. IEEE. Trans Inform Theor,
1990; 36: 961–1005.
[28] Demuth H, Beale M, Hagan M. Neural network Toolbox™ user’s guide. The MathWorks, Inc.; 2009.
[29] Leng, G., McGinnity, T.M., Prasad, G. Design for self-organizing fuzzy neural networks based on genetic
algorithms. IEEE. Trans. Fuzzy Syst. 2006; 14 (6): 755–766.
[30] Frank H. F. Leung, H. K. Lam, S. H. Ling, Peter K. S. Tam . Tuning of the structure and parameters of a neural
network using an improved genetic algorithm. IEEE. Trans. Neural Networks, 2003; 14 (1): 79–88.
[31] Alizadeh S, Ghazanfari M,”Teimorpour B .Data Mining and Knowledge Discovery”, Publication of Iran University
of Science and Technology . 2nd ed, 2011. [Persian].
[32] Han J. Kamber M.chapter 1: introduction: Data Mining: Concepts and Techniques. Morgan Kaufman Publisher. 2nd
ed, 2006.
[33] UCI Archive, Machine Learning Repository,”
databases/statlog/heart/ ( accessed 02.05.2015).
[34] G.Subbalakshmi, K. Ramesh, M. Chinna Rao. Decision Support in Heart Disease Prediction System using Naive
Bayes. Indian Journal of Computer Science and Engineering (IJCSE). 2011; 2(2): 170-176.
[35] Aditya M, Prince K, Himanshu A, Pankaj K. Early Heart Disease Prediction Using Data Mining Techniques.
Computer Science & Information Technology (CS & IT). 2014: 53-59.
[36] Roohallah A, Jafar H, Zahra A, Hoda M, Reihane B, Asma Gh, Fahime Kh, Fariba A. Diagnosing Coronary Artery
Disease via Data Mining Algorithms by Considering Laboratory and Echocardiography Features. Official Journal of
Rajaie Cardiovascular Medical and Research Center. 2013; 2(3): 133-139.
[37] T. John Peter, K. Somasundaram. Study and Development of Nevel Feature Selection Frmework for Heart Disease
Preciction. International Journal of Scientific and Research Publications. 2012; 2(10): 1-7.
[38] Kemal Polat, Salih Gunes. A hybrid approach to medical decision support systems: Combining feature selection,
fuzzy weighted pre-processing and AIRS. computer methods and programs in biomedicine. 2007; 88 :164–174.
ISSN: 2088-8708
IJECE Vol. 5, No. 6, December 2015 : 1569 – 1576
Moloud Abdar. He received his Undergraduate (Bachelor) degree in Computer
Engineering (Software Engineering) from the University of Damghan, Iran in 2015. He
has more than 7 conference and journal papers
about the Data Mining. Currently, his
research interests include data mining, web and text mining
, Artificial Intelligence
and Image
Goli Arji. She is PHD student in health information management, Tehran
university of medical science. She is interested in data mining, fuzzy logic,
clinical decision support system, telemedicin and consumer health informatics.
... There are five algorithms including decision tree, neural network, support vector machine and k-nearest neighbour, logistic regression are used for classification and comparison. Jolliffe I, et.a1., [9] Principal component analysis (PCA) is a mathematical algorithm that reduces the dimensionality of the data while retaining most of the variation in the dataset . It accomplishes reduction by identifying directions, called principal components, along which the variation in the data is maximal. ...
... Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components (or Sometimes, principal modes of variation. [9] To perform dimensionality reduction while preserving as much of the randomness in the high-dimensional space is possible. Principal Component Analysis is realized on ...
... istic regression, and decision trees in (Mythili et. al. 2013) by applying on the Cleveland Heart Disease database, it is claimed that combining classification rules on the classifier produces better accuracy than the classifier alone. A comparative study was conducted on the data mining techniques to predict heart disease using the UCI dataset in (Moloud et. al. 2015). The features of gender, chest pain type, resting blood pressure, exercise-induced angina, number of significant vessels colored by fluoroscopy, and thal were the most dominating features. They concluded that the decision tree with a small set of features gives better accuracy than other mining techniques. ...
Data is an asset in the digital era, and enormous data was generating day by day in all the fields, including the healthcare industry. The data on the healthcare industry data consists of personal information and disease-related information about a patient and stored in various formats and units. Machine learning and Artificial Intelligence techniques will help us analyze the voluminous amount of data to identify the hidden patterns of a specific disease from the healthcare data and help us predict a particular disease in the future. In this paper, we proposed a decision support system to predict heart disease, especially cardiovascular disease, through machine learning algorithms. This system experimented with the reduced set feature of the UCI Machine learning repository dataset using a linear kernel-based support vector machine algorithm. This system has also compared it with other machine learning algorithms such as K-Nearest Neighbours, Decision tree, and Random forest in Python. All four machine learning algorithms' performance has been evaluated based on accuracy, misclassification rate, precision, recall, and f-score value. From the experimental results, SVM with a linear kernel function classification algorithm produces better accuracy of 95.08% compared with others for predicting heart disease.
... There were many kinds of research studies that have analyzed the performance of data mining algorithms, PPDM algorithms, and side effects (Celik et al., 2017;Arboleda, 2019;Hussain, 2019;Nopour et al., 2021;Abdar et al., 2015). However, extremely limited research comparing all PPDM algorithms based on their side effects has been investigated. ...
Full-text available
The data mining sanitization process involves converting the data by masking the sensitive data and then releasing it to public domain. During the sanitization process, side effects such as hiding failure, missing cost and artificial cost of the data were observed. Privacy Preserving Data Mining (PPDM) algorithms were developed for the sanitization process to overcome information loss and yet maintain data integrity. While these PPDM algorithms did provide benefits for privacy preservation, they also made sure to solve the side effects that occurred during the sanitization process. Many PPDM algorithms were developed to reduce these side effects. There are several PPDM algorithms created based on different PPDM techniques. However, previous studies have not explored or justified why non-traditional side effects were not given much importance. This study reported the findings of the side effects for the PPDM algorithms in a newly created web repository. The research methodology adopted for this study was Design Science Research (DSR). This research was conducted in four phases, which were as follows. The first phase addressed the characteristics, similarities, differences, and relationships of existing side effects. The next phase found the characteristics of non-traditional side effects. The third phase used the Privacy Preservation and Security Framework (PPSF) tool to test if non-traditional side effects occur in PPDM algorithms. This phase also attempted to find additional unknown side effects which have not been found in prior studies. PPDM algorithms considered were Greedy, POS2DT, SIF_IDF, cpGA2DT, pGA2DT, sGA2DT. PPDM techniques associated were anonymization, perturbation, randomization, condensation, heuristic, reconstruction, and cryptography. The final phase involved creating a new online web repository to report all the side effects found for the PPDM algorithms. A Web repository was created using full stack web development. AngularJS, Spring, Spring Boot and Hibernate frameworks were used to build the web application. The results of the study implied various PPDM algorithms and their side effects. Additionally, the relationship and impact that hiding failure, missing cost, and artificial cost have on each other was also understood. Interestingly, the side effects and their relationship with the type of data (sensitive or non-sensitive or new) was observed. As the web repository acts as a quick reference domain for PPDM algorithms. Developing, improving, inventing, and reporting PPDM algorithms is necessary. This study will influence researchers or organizations to report, use, reuse, or develop better PPDM algorithms.
... This algorithm uses elements with higher entropy differences than others for tree splitting. So it has samples categorizing capability with the high distinction that will be existed between the subtrees when having the classified samples with the highest frequency based on different output classes in their leave nodes [41][42][43]. Random-Forest: The random forest is a hybrid decision tree algorithm including various subtrees as classifiers with specified depths and nodes. This algorithm has reasonable flexibility for making the decision trees utilizing the multiple features for splitting the trees randomly. ...
... Confusion matrix is a valuable tool for evaluating the performance of cataloging method in data mining. In the confusion matrix relevant data with the observations should be positioned on the main diagonal of the matrix, and the remaining values of the matrix are zero or near zero [12]. FN= Total number of positively categorized data, which have been categorized as "Negative" falsely. ...
Weather being a random phenomenon its prediction has been always a challenge for the meteorologist all over the world. There are number of approaches for predicting this weather based on atmospheric data collected. Rain forecasting is a puzzling, composite, vigorous and mind-boggling task. Rain forecasting pretenses right from the primeval times as a challenging task, because it be influenced by numerous parameters like temperature, wind speed and direction, rainfall, humidity, station level pressure, mean sea-level pressure, dry bulb temperature, dew point temperature and vapour pressure. Various data mining techniques were implemented for rain forecasting. With compared to orthodox methods predicting rainfall rate, the methods that were applying chronological records and data mining technology shows improvement in computing accurate results with more accuracy. Many researchers have done excellent works to construct forecasting models with data mining methods;but in them most just test the predicting accuracy at one particular geographical area. In this paper, we analyzed the performance of k-NN, Random Forest, C5.0 and AdaBoost algorithms on different locations and compared the performance using precision, recall, f-measure and classification accuracy. The daily surface data was collected from India Meteorological Department (IMD), Pune of 3 stations form the period 2005 to 2015. The k-NN algorithm perform better accuracy 98.02 % on Jodhpur dataset with compare to other datasets, the ratio of 90:10 of training and testing records and the value of K is 10. The highest accuracy is 99.270 % of AdaBoost algorithm.
... The Healthcare business today produces tremendous measures of complex information about patients, infection finding, emergency clinics assets and restorative gadgets, which is hard to process by manual strategies [8]. Information mining gives a lot of devices and methods to discover examples and concentrate information to give better patient consideration and it consolidates measurable investigation, AI and database innovation to extricate concealed examples and connections from enormous databases [9]. The recognition of heart disease from different elements or side effects is a multi-layered issue, which is not liberated from bogus assumptions frequently joined by eccentric impacts. ...
Cardiovascular disease (CVD) is possibly the greatest reason for casualty and death rate among the number of inhabitants on the planet. Projection of cardiopathy is viewed as one of the most crucial subjects in the area of clinical records exploration. The measure of information in the social insurance industry is massive. The Data mining process transforms the huge range of unrefined medical service data into meaningful information that can lead to erudite decision and projection. Some recent investigations have applied data exploratory procedures too in CVD estimation. However, only very few studies have revealed the elements that play crucial role in envisioning CVDs. It is imperative to opt for the combination of correct and significant elements that can enhance the functioning of the forecasting prototypes. This study aims to ascertain meaningful elements and data mining procedures that can enrich the correctness of foretelling CVDs. Prognostic models were formulated employing distinctive blend of features selection modified teaching learning optimization techniques, SVM and boosting classification. Here the proposed strategy gives high precision outcomes with existing classification.
Full-text available
Teachers are the most significant part of the educational system in terms of improving student learning and ensuring their future success. Teacher's performance has a direct impact on student learning and student progress. The performance of a teacher in the classroom is based upon various factors such as Lecture preparation, teachi ng method/communication ability, Utilization of teaching aids, Coursework and day-today living are inextricably linked, distribution of Study materials, Subject-matter expertise, Completion of the curriculum preparation Punctuality and regularity, Class c ontrol and behavior with students. The aim of this paper is to predict a teacher's performance by using various Machine learning algorithms. For prediction of teacher's performance, we develop models using Decision tree (CART) , k nearest neighbor(KNN), Naïve Baye's Classifier, Support Vector Machine (SVM) and Artificial Neural Network (ANN). We consider above 10 Independent variables to develop models. We collected primary data from students by designing a questionnaire which is called feedback form. Data analysis was done by using R studio. This study observe that Artificial Neural Network (ANN) had higher accuracy than other algorithms.
A fundamental difficulty in the healthcare system is a lack of manpower and machine power. Soft computing is critical for the advancement of healthcare technologies. To forecast existing mobility concerns and cardiac disease, soft computing is essential. The heart is the second most essential organ in the body. It carries blood to all of the body's organs. The use of data analytics to predict heart disease is an important medical issue. By offering more information, data analysis aids medical organizations in forecasting illnesses. The accuracy of several classification algorithms in predicting heart disease is compared in this study. Data from the past can be utilized to forecast future illnesses. The basic objective is to predict illness based on recent diagnoses. On diverse data sets, numerous scientific learning approaches and procedures have been evaluated. This study evaluates the most effective learning algorithms for predicting heart disease using a variety of learning methods and performance criteria from Electrocardiogram (ECG) signals. These ECG signals were mainly collected using plastic optical fiber-based ECG sensor device. The accuracy and precision of three separate data sets are examined in this study. The data sets were from public sources and each had over 250 ECG samples. The accuracy, precision, and F1-value of each data set are evaluated in this study. Logistic Regression, k-Nearest Neighbor (KNN), Decision Tree, Random Forest, Support Vector Machine (SVM), Gaussian Naive Bayes (NB), Linear Discriminate Analysis, Ada Boost Classifier Gradient Boosting Classifier, Quadratic Discriminate Analysis, and Multilayer Perception (MLP) Classifier are some of the machine learning algorithms we look at. In addition, each data set's outcome was compared and shown.
Full-text available
Penyakit Jantung atau disebut juga penyakit kardiovaskular merupakan salah salah satu penyakit berbahaya yang dapat menyebabkan kematian. Seiring berkembangnya teknologi dan peningkatan popularitas teknologi machine learning, teknologi machine learning tersebut dapat digunakan untuk membantu mendeteksi penyakit jantung dengan menggunakan data pasien. Terdapat berbagai jenis metode yang dapat digunakan untuk mendiagnosa apakah seseorang terkena penyakit jantung atau tidak. Penelitian ini mengimplementasikan penggunaan algoritma yaitu logistic regresi, dimana algoritma tersebut memakai fungsi logistik untuk menghasilkan binary atau nol dan satu sebagai penentuan klasifikasi. Setelah eksperimen dilakukan dengan algoritma logistik regresi memberikan hasil yang memiliki keunggulan yang berbeda-beda terhadap metode lainnya berdasarkan model analisa confusion matrix. Pada data training, metode logistik regresi mempunyai nilai sensitivity yang paling tinggi yaitu 88.54% dibanding metode lainnya. Pada data testing, metode logistik regresi mempunyai nilai kekhususan yang paling tinggi yaitu 87.50% dibanding metode lainnya.
Full-text available
p class="Abstract">A ‎ atrial fibrillation (AF) is the arrhythmia that commonly causes death in the adults. We measured AR coefficients using Burg’s method for each 15 second segment of ECG. These features are classified using the different statistical classifiers: kernel SVM and KNN classifier. The performance of the algorithm was evaluated on signals from MIT Physionet database.. The effect of AR model order and data length was tested on the classification results. This method shows better results can be used for practical use in the clinics. ‏ </p
Conference Paper
Full-text available
The successful application of data mining in highly visible fields like e-business, marketing and retail has led to its application in other industries and sectors. Among these sectors just discovering is healthcare. The Healthcare industry is generally “information rich”, but unfortunately not all the data are mined which is required for discovering hidden patterns & effective decision making .Discovery of hidden patterns and relationships often goes unexploited. Advanced data mining modeling techniques can help remedy this situation. This research paper intends to use data mining Classification Modeling Techniques, namely, Decision Trees, Naïve Bayes and Neural Network, along with weighted association Apriori algorithm and MAFIA algorithm in Heart Disease Prediction. Using medical profiles such as age, sex, blood pressure and blood sugar it can predict the likelihood of patients getting heart disease.
Full-text available
Data Mining refers to using a variety of techniques to identify suggest of information or decision making knowledge in thedatabase and extracting these in a way that they can put to use in areas such as decision support, predictions, forecasting and estimation. The healthcare industry collects huge amounts of healthcare data which, unfortunately, are not “mined” to discover hidden information for effective decision making. Discovering relations that connect variables in a database is the subject of data mining. This research has developed a Decision Support in Heart Disease Prediction System (DSHDPS) using data mining modeling technique, namely, Naïve Bayes. Using medical profiles such as age, sex, blood pressure and blood sugar it can predict the likelihood of patients getting a heart disease. It is implemented as web based questionnaire application. It can serve a training tool to train nurses and medical students to diagnose patients with heart disease.
Full-text available
In this paper we review classification algorithms used to design brain–computer interface (BCI) systems based on electroencephalography (EEG). We briefly present the commonly employed algorithms and describe their critical properties. Based on the literature, we compare them in terms of performance and provide guidelines to choose the suitable classification algorithm(s) for a specific BCI.
Application of computer data processing in medical domain has been witnessing several significant revolutions in recent days. Particularly data mining has played an important role in knowing the hidden patterns of clinically relevant data sets. This technique can be employed for the diagnosis of heart attack. This is due to new information on the nature of diseases and their diagnostic criteria have been increasing at a tremendous level. Nevertheless, the data on certain illnesses always is in heterogeneous in nature. It is highly difficult to interpret such a voluminous data to arrive at a strong conclusion. Hence an organized data is mandatory. Diagnosis to carry out a suitable treatment is a difficult task for some fatal diseases. Doctor requires a precise diagnosis out of many clinical reports of the individual concerned. Therefore automation of data and mining would be advantageous for a medical professional to initiate tre atment regime. Further computer machine processing neglects error and extraordinary time consumption for prediction. Data mining techniques enhance a comparative knowledge base and user friendly working environment. It helps to get the accuracy of the heart disease diagnosis.
The aim of this study is to combine the neural networks (ANNs) and Fuzzy Logic (FL) to make a powerful tool to diagnosis heart disease. By combining the Fuzzy inference system and neural network, the input values are passed through the input layer (by input membership function) and the output could be seen in output layer (by output membership functions). Training involves iterative adjustment of parameters of the adaptive neuro-fuzzy inference system using a hybrid learning procedure to diagnosis the heart disease. This mechanism presents five layer, each layer has its own nodes. Layer 1 had the input variables with membership function. T-norm operator that perform the AND operator can be used in layer 2. The sum of all rules firing strengths are assigned in layer 3. The nodes in layer 4 are adaptive and perform the consequent of the rules. Single node computes the overall output in layer 5. The proposed method is tested with Cleveland heart disease dataset. The ANFIS approach is implemented using MATLAB. The proposed mechanism can work more effectively for diagnosis of heart disease and also improves the accuracy. The result of the proposed methods is compared with earlier method using accuracy as metrics.
Two different procedures are studied by which a rrequency analysis of a time-dependenl signal can be effected, locally in lime. The lirst procedure is the short-time or windowed Fourier transform, the second is the "wavelet transform," in which high frequency components are sludied wilh sharper time resolution than low frequency components. The similarities and the differences between these two methods are discussed. For both scbemes a detailed study is made of Ibe reconslruetion method and ils stability, as a function of the chosen time-frequency density. Finally the notion of "time-frequency localization" is made precise, within this framework, by two localization theorems.
In the history of research of the learning problem one can extract four periods that can be characterized by four bright events: (i) Constructing the first learning machines, (ii) constructing the fundamentals of the theory, (iii) constructing neural networks, (iv) constructing the alternatives to neural networks.