Content uploaded by Eka Miranda
Author content
All content in this area was uploaded by Eka Miranda on Apr 18, 2022
Content may be subject to copyright.
XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE
Machine Learning-Based Prediction Models of
Coronary Heart Disease Using Gaussian Naïve
Bayes and Random Forest Algorithms
Charles Bernando
Information Systems Department,
School of Information Systems
Bina Nusantara University
Jakarta, Indonesia 11480
charles.bernando@binus.ac.id
Eka Miranda
Information Systems Department,
School of Information Systems
Bina Nusantara University
Jakarta, Indonesia 11480
ekamiranda@binus.ac.id
Mediana Aryuni
Information Systems Department,
School of Information Systems
Bina Nusantara University
Jakarta, Indonesia 11480
mediana.aryuni@binus.ac.id
Abstract—Coronary heart disease, alternatively known as
cardiovascular disease (CVD) is the number one cause of death
in the world. Accordingly, a lot of research have been
conducted to predict the early diagnosis of the heart disease
and determine the most important risk factors associated with
the disease. Despite these considerable efforts, the accuracy of
the prediction has remained inadequate and the most
important risk factors have remained elusive. This research
paper discusses many risk factors associated with the disease
and presents the prediction models of coronary heart disease
using supervised machine learning algorithms, namely
Gaussian Naïve Bayes and Random Forest algorithms. It uses
the public dataset from the Cleveland database of UCI
repository of coronary heart disease patients. The results show
that the Gaussian Naïve Bayes and Random Forest algorithms
have accuracies of 85.00% and 75.00%, respectively.
Moreover, the precision, F-measure and recall of the Gaussian
Naïve Bayes are higher than those of Random Forest
algorithm, signifying its importance in predicting the early
diagnosis of the disease.
Keywords—heart disease, Gaussian Naïve Bayes, Random
Forest, machine learning, risk factors
I. INTRODUCTION
Coronary heart disease (CHD), generally known as
cardiovascular disease (CVD) is the number one cause of
death in the world, responsible for around 9 million of deaths
worldwide or 16% of the world’s total deaths.[1] In
principle, cardiovascular disease is a general term to describe
conditions influencing the heart or blood vessels, the causes
of which are related to the deposits of fat in the arteries, and
non-functional arteries in the patients’ brain, heart and
kidneys. The cardiovascular disease consists of four types of
disease, namely stroke, peripheral arterial disease, aortic
disease, and coronary heart disease, the latter of which is the
focus of the current paper. Patients with coronary heart
disease have blocked flow of blood to the heart muscle,
which may give rise to angina, heart attacks and heart failure.
These three diseases are primarily affected by lifestyle risk
factors, such as alcoholic lifestyle, smoking, high caffeine
consumption, physical inactivity, and the physiological risk
factors, such as high cholesterol, hypertension, overweight
and obesity. These factors need to be examined to determine
and predict the early diagnosis of the disease. The
examination of these factors can be conducted using machine
learning techniques [2]
The population of patients with cardiovascular disease
has been examined in Indonesia. Based on the data by
Riskesdas in 2018, CVD mostly occurs to people aged 65-74
years old and people who are 75 years old or older, which
accounts to 9.3% of the population [3]. Furthermore, the
prevalence of CVD is the highest for people in Kalimantan
Utara province, and the lowest for people in Nusa Tenggara
Timur. CVD mostly occur to women compared to men, to
most educated people compared to the less educated people,
to people who work for the government compared to people
having other occupations, and to people who live in urban
compared to those in rural areas. In 2013, however, the CVD
mostly occur to people aged 65 years old and older, which
accounts to 6.8% of the population, which signifies a 37%
increase of patients with CVD in this age range in 2018
compared to that in 2013. [4] Moreover, the CVD mostly
occurs to less educated people and to people in rural area in
2013, which depicts a significant shift of occurrence from
less educated people and people who live in rural area in
2013 to highly educated people and people who live in urban
area in 2018. These findings are confirmed by studies
conducted by other researchers. CVD risk factors among the
blue-collar and white-collar workers aged 40 to 69 years old
in Indonesia has been conducted by several researchers. [5]
The results show that the cardiovascular disease was
associated with occupation, namely, the white-collar workers
were 1.6 times as likely to be diagnosed with CVD as to blue
collar workers. In addition, the leading risk factor that causes
CVD in Indonesia is hypertension, which contributes to
20%-25% of all CHD and 36%-42% of all strokes in men
and women, followed by smoking which causes 25% of
CHD and 17% of strokes. [6] Moreover, the household
screening for cardiovascular risk factors in Malang District
found that 29.2% of adults aged 40 years and older had the
coronary heart disease risk, stroke or other atherosclerotic
disease, with the greater prevalence of high CVD risk for
people who live in urban compared to people who live in
semi-urban and rural areas. [7] Similarly, another work has
resulted a finding that shows the socioeconomic disparity in
CVD risk factors, with the prevalence of obesity,
hypertension and diabetes which is higher among urban and
the richest and well-educated districts, whereas physical
inactivity and smoking is higher among people in rural area
and least educated districts. [8] The increase of prevalence of
CVD in Indonesia calls for a robust technique to predict the
early diagnosis of the disease. In this paper, we present two
machine learning techniques to accurately predict the early
diagnosis of heart disease, namely Gaussian Naïve Bayes
and Random Forest. Therefore, in this study we want to
answer research questions on how to accurately predict the
early diagnosis of heart disease using Gaussian Naïve Bayes
and Random Forest models, and which model has the better
performance out of the two.
II. LITERATURE STUDY
A. Gaussian Naïve Bayes
Naïve Bayes is a supervised classification technique
based on Bayes’ Theorem with an assumption of
independence among predictors, which can be used for
binary and multi-class classification problems. In short,
Bayes theorem provides a way that we can calculate the
probability of a hypothesis given our prior knowledge. The
Bayes Theorem is written as:
in which P(a|b) is the probability of hypothesis a given the
data b (posterior probability), P(b|a) is the probability of data
b given that the hypothesis a was true, P(a) is the probability
of hypothesis a is true (prior probability of a), and P(b) is the
probability of the data b.
The Naïve Bayes can be extended to a Gaussian and a
Bernoulli type of probability. Gaussian Naïve Bayes is
applicable for attributes with real values. Thus, the mean and
standard deviation of input values for each class can be
calculated. The probabilities of new input values are
calculated using the Gaussian Probability Density Function
(PDF). This PDF provides an estimate of the probability of
the new input value for that class. The PDF used in this paper
is shown below:
in which P(xi|y) is the Gaussian PDF, σy is the standard
deviation, xi is the new input values for the input variable,
and µy is the mean value.
The Naïve Bayes has been extensively utilized to model
and predict the early diagnosis of heart disease. [9] Research
conducted by using Naïve Bayes applied on 11 risk factors
has resulted a 89.77% accuracy [10]. Moreover, other
researchers proposed the hybrid approach of SVM with
Naïve Bayes. [11] They applied this approach again on the
same parameters of the dataset from UCI repository and
achieved the accuracy of 100%. However, as the approach
was hybrid, researcher [12] proposed a model called Hidden
Naïve Bayes (HNB) to obtain a dependent algorithm, using
which they hit the target of 100% accuracy. These results
imply that Naïve Bayes is one of the most powerful
algorithms to classify and predict heart disease. This paper
further extends the Naïve Bayes to Gaussian Naïve Bayes,
applied on 13 risk factors. [13]
B. Random Forest
Random Forest is a supervised learning algorithm which
classify data through an ensemble of decision trees. Each
individual tree in the random forest produces a class
prediction, and the class with the most occurrences become
the model’s prediction. In short, random forest builds
numerous decision trees and merges them to obtain a more
accurate prediction. Random Forest produces additional
randomness to the model, since it looks for the best features
or risk factors among a random subset of features.
The prediction of heart disease using Random Forest has
been conducted by researchers. Recently, Random Forest is
applied on a dataset containing 303 samples and 14
attributes, resulting in 86.9% accuracy for the prediction of
heart disease [14] Moreover, a study utilizing Random Forest
involving 498 patients conducted in Xi’an Medical
University between 2011 to 2018 has resulted 9 variables
that greatly affected the heart disease prediction. [15] In
addition, the Random Forest algorithm is applied on the
Cleveland heart disease dataset, which produces an accuracy
of 85.81% [16] These results suggest the importance of
Random Forest technique in predicting the early diagnosis of
heart disease.
III. DATA AND METHODS
A. Dataset
This research utilizes Cleveland Heart Disease dataset
from the UCI repository [17]. The dataset was composed
from the Cleveland Clinic Foundation and comprises about
303 records, each having 76 attributes. These 76 attributes
were further reduced into 13, which are taken into account to
predict the Exist (values 1) or Not Exist (value 0) class of
heart disease. These 13 attributes are shown below:
B. Research Methods
In this paper, the original dataset consisting of 303
records and 76 columns is obtained from Cleveland dataset
from UCI repository. The dataset is extracted onto the
Jupyter Notebook utilizing Python to build the machine
learning models. The steps taken to predict and classify the
heart disease existence are:
1. Data Preprocessing
In the data preprocessing step, the data are cleaned
and transformed into a form that is ready to be used
as an input to the machine learning models. The
preprocessing steps consist of selecting the 13
columns of interest, which are shown in Fig. 1.
Afterwards, the null values are checked to ensure the
validity of the results, since the algorithm may
produce different results if the data have null values.
Thereafter the cross tabulation of the data based on
several attributes may be examined to obtain
descriptive visualization of the data.
2. Feature or attributes selection
The next step is to investigate the correlation
between the 13 attributes or risk factors. If the
attributes are not correlated one to another, all 13
attributes can be included in the model. However, if
there is a strong correlation between two attributes,
one of them should be dropped. The correlation
method utilizes the Pearson correlation. The result of
the correlation shows that the 13 attributes have
weak correlation between them, which signifies that
they are independent attributes. Thus, all 13
attributes are included in the model.
3. Data Splitting
In this step, the data consisting of 303 records are
split into training and testing data. The train-test data
split is conducted randomly, with a ratio of train-test
data split of 80:20. This is the optimum fraction of
the train-test data split.
4. Model Training
This step is used to build the models. The training
data serve as an input data to the models. In this step,
the Gaussian Naïve Bayes and Random Forest
algorithms are utilized to build the models
independently. The hyperparameters of the
algorithms are set and adjusted to optimize the
accuracy of the models.
5. Model Evaluation
This last step involves the use of the developed
models on testing data. The results are the confusion
matrix, which cross tabulates the actual positive and
negative with the predicted positive and negative,
where positive implies the Exist class of heart
disease, whereas negative implies the Not Exist class
of heart disease. Therefore, confusion matrix
consists of 4 classes, namely TP (True Positive:
actual and predicted values are positive), FP (False
Positive: actual value is negative and predicted value
is positive), TN (True Negative: actual and predicted
values are negative), and FN (False Negative: actual
value is positive and predicted value is negative).
From the confusion matrix, the accuracy, precision,
F measure and recall of Gaussian Naïve Bayes and
Random Forest can be obtained, according to:
Accuracy = TP + TN / (TP + TN + FP + FN)
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F-measure = 2 x Recall x Precision / (Recall +
Precision)
The complete research method is depicted in Fig. 1
Fig 1. Research Method
IV. RESULT AND DISCUSSION
In this paper, the results are the models developed using
Gaussian Naïve Bayes and Random Forest algorithms to
predict the early diagnose of heart disease.
A. Gaussian Naïve Bayes
The model developed using Gaussian Naïve Bayes gives
the best accuracy to diagnose heart disease. The result of the
confusion matrix is shown in Fig.2. This prediction outcome
shows that there are 20 positive records and 31 negative
records that are correctly predicted by the model. The model,
however, predicts differently compared to the actual, which
accounts to 9 records.
Fig. 2. Confusion Matrix of Gaussian Naïve Bayes model
From the cross tabulation of actual and predicted positive
and negative values in confusion matrix, the measurement
values of the model can be obtained, and are shown in Fig. 3.
Fig. 3. Measurements of Gaussian Naïve Bayes model
B. Random Forest
The model developed by utilizing Random Forest
algorithm shows a less accurate result compared to Naïve
Bayes. The result of the confusion matrix is shown in Fig. 4.
The model produces the following results upon testing by
using the test data: 16 positive records and 29 negative
records that are correctly predicted by the model, and 15
inaccurate predictions, which consists of 9 False Negative
and 6 False Positive records.
Fig. 4. Confusion Matrix of Random Forest model
From the cross tabulation of actual and predicted positive
and negative values in confusion matrix, the measurement
values of the model can be obtained, and are shown in Fig. 5.
Fig. 5. Measurements of Random Forest model
In general, the False Negative and False Positive results
produced by Gaussian Naïve Bayes and Random Forest
models are attributed by the nature of the data and the
algorithms. In Gaussian Naïve Bayes, the distribution of the
input values should be a Gaussian-like distribution, whereas
the distribution of the data values in the dataset is not
completely Gaussian. Moreover, the Naïve Bayes assume
that each risk factors associated with heart disease is
independent. In principle, this may not be true, since a risk
factor may be correlated with another risk factor. For
instance, the maximum heart rate is closely correlated with
age. According to Dr. William Haskell in 1970s, our
maximum heart rate is around 220 subtracted by our age,
measured in beats per minute (bpm), which may produce
different accuracy of the models if it is taken into account.
On the other hand, the results of the accuracy of Random
Forest are quite sensitive related to the data. Moreover, it is
also less interpretable, which may hinder the ability to
visualize the ensembles of decision trees.
V. CONCLUSION AND FUTURE WORK
This research developed machine learning-based
prediction models of coronary heart disease using Gaussian
Naïve Bayes and Random Forest, which are applied on the
dataset consisting of 303 records and 13 selected attributes.
The holdout method is applied to split the dataset into a
training dataset and testing dataset with a ratio of 80%:20%
for training data and testing data, respectively. The results of
the research show Gaussian Naïve Bayes have higher
accuracy, precision, F-measure and recall values than
Random Forest. The future research will be conducted by
employing more records in the dataset, namely the dataset
from patients with heart disease in a national hospital in
Indonesia, more attributes or risk factors involved, and more
precise techniques, exploring feature extraction, and
classification techniques to increase the accuracy of the
models.
ACKNOWLEDGMENT
This work is supported by the Research and Technology
Transfer Office, Bina Nusantara University as part of Bina
Nusantara University’s International Research Grant entitled
"Aplikasi Prediksi Diagnosa Awal Penyakit Jantung Koroner
Berbasis Web Dengan Teknik Regresi Logistik" with
contract number: No: 017/VR.RTT/III/2021 and contract
date: 22 March 2021.
REFERENCES
[1] World Health Organization, “Cardiovascular Disease”, World Health
Organization, May 2021 [Online] Available:
https://www.who.int/health-topics/cardiovascular-diseases/#tab=tab_1
[Accessed 28 May 2021].
[2] D. Shah, S. Patel, S. K. Bharti, “Heart disease prediction using
machine learning techniques”, SN Computer Science, Vol. 1, No.
345, 2020. https://doi.org/10.1007/s42979-020-00365-y
[3] Kemkes, Badan Litbangkes, Laporan Nasional Riskesdas 2018, 1
December 2018 [Online] Available at:
http://labdata.litbang.kemkes.go.id/images/download/laporan/RKD/20
18/Laporan_Nasional_RKD2018_FINAL.pdf [Accessed 29 May
2021].
[4] Kemkes, Badan Litbangkes, Laporan Nasional Riskesdas 2013, 1
December 2013 [Online] Available at:
http://labdata.litbang.kemkes.go.id/images/download/laporan/RKD/20
13/Laporan_riskesdas_2013_final.pdf [Accessed 29 May 2021]
[5] N. A. Prihartono, Fitriyani, W. Riyadina, “Cardiovascular disease risk
factors among blue and white-collar workers in Indonesia”, Acta
Medica Indonesiana - The Indonesian Journal of Internal Medicine,
Vol. 50, No. 2, pp. 96-103, 2018.
[6] M. A. Hussain, A. A. Mamun, S. AE Peters, M. Woodward, R.R.
Huxley, “The burden of cardiovascular disease attributable to major
modifiable risk factors in Indonesia”, Journal of Epidemiology,
26(10), pp. 515-521, 2016.
[7] A. Maharani, Sujarwoto, D. Praveen, D. Oceandy, G. Tampubolon,
A. Patel, “Cardiovascular disease risk factor prevalence and estimated
10-year cardiovascular risk scores in Indonesia: The SMARThealth
Extend study”, PLoS ONE, 14(4): e0215219, 2019.
https://doi.org/10.1371/journal.pone.0215219
[8] W. Adisasmito, V. Amir, A. Atin, A. Megraini, D. Kusuma,
“Geographic and socioeconomic disparity in cardiovascular risk
factors in Indonesia: analysis of the Basic Health Research 2018”,
BMC Public Health, Vol. 20, No. 1004, 2020.
https://doi.org/10.1186/s12889-020-09099-1
[9] K. Tripathi, H. Garg, H. Sharma, “A comprehensive survey on
cardiovascular disease”, International Journal of Grid and Distributed
Computing, Vol. 13, No. 1, pp. 1772 – 1780, 2020.
[10] Repaka, Anjan Nikhil, Sai Deepak Ravikanti, and Ramya G. Franklin.
"Design And Implementing Heart Disease Prediction Using Naives
Bayesian." 2019 3rd International Conference on Trends in
Electronics and Informatics (ICOEI). IEEE, Tirunelveli, India, pp.
292-297, 2019
[11] Ade, Ms RR, Dhanashree S. Medhekar, and Mayur P. Bote. "Heart
disease prediction system using svm and naive bayes." International
Journal of Engineering Sciences & Research Technology, vol. 2, no.
5, 2013.
[12] Jabbar, M. A., and ShirinaSamreen. "Heart disease prediction system
based on hidden naïve bayes classifier." 2016 International
Conference on Circuits, Controls, Communications and Computing
(I4C). IEEE, Bangalore, pp.1-5, 2016
[13] E. Miranda, E. Irwansyah, A. Y. Amelga, M. M. Maribondang, M.
Salim, “Detection of cardiovascular disease risk's level for adults
using naive Bayes classifier”, Healthcare Informatics Research
Journal, Vol. 22, Issue 3, pp. 196-205, 2016.
[14] M. Pal, S. Parija, “Prediction of Heart Diseases using Random
Forest”, Journal of Physics: Conference Series, Vol. 1817, No. 1, pp.
012009, 2021. https://doi.org/10.1088/1742-6596/1817/1/012009
[15] X. Su, Y. Xu, Z. Tan, X. Wang, P. Yang, Y. Su, Y. Jiang, S. Qin, L.
Shang, “Prediction for cardiovascular diseases based on laboratory
data: An analysis of random forest model”, Journal of Clinical
Laboratory Analysis, 34(9), e23421, 2020.
https://doi.org/10.1002/jcla.23421
[16] Y. K. Singh, N. Sinha, S. K. Singh, “Heart disease prediction system
using Random Forest”, In Proceedings of 2017 International
Conference on Advances in Computing and Data Sciences, pp. 613-
623, 2017.
[17] Kaggle, “Cleveland Clinic Heart Disease Dataset” [Online] Available
at: https://www.kaggle.com/aavigan/cleveland-clinic-heart-disease-
dataset [Accessed 28 May 2021].