Conference PaperPDF Available

Machine-Learning-Based Prediction Models of Coronary Heart Disease Using Naïve Bayes and Random Forest Algorithms

Authors:
XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE
Machine Learning-Based Prediction Models of
Coronary Heart Disease Using Gaussian Naïve
Bayes and Random Forest Algorithms
Charles Bernando
Information Systems Department,
School of Information Systems
Bina Nusantara University
Jakarta, Indonesia 11480
charles.bernando@binus.ac.id
Eka Miranda
Information Systems Department,
School of Information Systems
Bina Nusantara University
Jakarta, Indonesia 11480
ekamiranda@binus.ac.id
Mediana Aryuni
Information Systems Department,
School of Information Systems
Bina Nusantara University
Jakarta, Indonesia 11480
mediana.aryuni@binus.ac.id
AbstractCoronary heart disease, alternatively known as
cardiovascular disease (CVD) is the number one cause of death
in the world. Accordingly, a lot of research have been
conducted to predict the early diagnosis of the heart disease
and determine the most important risk factors associated with
the disease. Despite these considerable efforts, the accuracy of
the prediction has remained inadequate and the most
important risk factors have remained elusive. This research
paper discusses many risk factors associated with the disease
and presents the prediction models of coronary heart disease
using supervised machine learning algorithms, namely
Gaussian Naïve Bayes and Random Forest algorithms. It uses
the public dataset from the Cleveland database of UCI
repository of coronary heart disease patients. The results show
that the Gaussian Naïve Bayes and Random Forest algorithms
have accuracies of 85.00% and 75.00%, respectively.
Moreover, the precision, F-measure and recall of the Gaussian
Naïve Bayes are higher than those of Random Forest
algorithm, signifying its importance in predicting the early
diagnosis of the disease.
Keywordsheart disease, Gaussian Naïve Bayes, Random
Forest, machine learning, risk factors
I. INTRODUCTION
Coronary heart disease (CHD), generally known as
cardiovascular disease (CVD) is the number one cause of
death in the world, responsible for around 9 million of deaths
worldwide or 16% of the world’s total deaths.[1] In
principle, cardiovascular disease is a general term to describe
conditions influencing the heart or blood vessels, the causes
of which are related to the deposits of fat in the arteries, and
non-functional arteries in the patients’ brain, heart and
kidneys. The cardiovascular disease consists of four types of
disease, namely stroke, peripheral arterial disease, aortic
disease, and coronary heart disease, the latter of which is the
focus of the current paper. Patients with coronary heart
disease have blocked flow of blood to the heart muscle,
which may give rise to angina, heart attacks and heart failure.
These three diseases are primarily affected by lifestyle risk
factors, such as alcoholic lifestyle, smoking, high caffeine
consumption, physical inactivity, and the physiological risk
factors, such as high cholesterol, hypertension, overweight
and obesity. These factors need to be examined to determine
and predict the early diagnosis of the disease. The
examination of these factors can be conducted using machine
learning techniques [2]
The population of patients with cardiovascular disease
has been examined in Indonesia. Based on the data by
Riskesdas in 2018, CVD mostly occurs to people aged 65-74
years old and people who are 75 years old or older, which
accounts to 9.3% of the population [3]. Furthermore, the
prevalence of CVD is the highest for people in Kalimantan
Utara province, and the lowest for people in Nusa Tenggara
Timur. CVD mostly occur to women compared to men, to
most educated people compared to the less educated people,
to people who work for the government compared to people
having other occupations, and to people who live in urban
compared to those in rural areas. In 2013, however, the CVD
mostly occur to people aged 65 years old and older, which
accounts to 6.8% of the population, which signifies a 37%
increase of patients with CVD in this age range in 2018
compared to that in 2013. [4] Moreover, the CVD mostly
occurs to less educated people and to people in rural area in
2013, which depicts a significant shift of occurrence from
less educated people and people who live in rural area in
2013 to highly educated people and people who live in urban
area in 2018. These findings are confirmed by studies
conducted by other researchers. CVD risk factors among the
blue-collar and white-collar workers aged 40 to 69 years old
in Indonesia has been conducted by several researchers. [5]
The results show that the cardiovascular disease was
associated with occupation, namely, the white-collar workers
were 1.6 times as likely to be diagnosed with CVD as to blue
collar workers. In addition, the leading risk factor that causes
CVD in Indonesia is hypertension, which contributes to
20%-25% of all CHD and 36%-42% of all strokes in men
and women, followed by smoking which causes 25% of
CHD and 17% of strokes. [6] Moreover, the household
screening for cardiovascular risk factors in Malang District
found that 29.2% of adults aged 40 years and older had the
coronary heart disease risk, stroke or other atherosclerotic
disease, with the greater prevalence of high CVD risk for
people who live in urban compared to people who live in
semi-urban and rural areas. [7] Similarly, another work has
resulted a finding that shows the socioeconomic disparity in
CVD risk factors, with the prevalence of obesity,
hypertension and diabetes which is higher among urban and
the richest and well-educated districts, whereas physical
inactivity and smoking is higher among people in rural area
and least educated districts. [8] The increase of prevalence of
CVD in Indonesia calls for a robust technique to predict the
early diagnosis of the disease. In this paper, we present two
machine learning techniques to accurately predict the early
diagnosis of heart disease, namely Gaussian Naïve Bayes
and Random Forest. Therefore, in this study we want to
answer research questions on how to accurately predict the
early diagnosis of heart disease using Gaussian Naïve Bayes
and Random Forest models, and which model has the better
performance out of the two.
II. LITERATURE STUDY
A. Gaussian Naïve Bayes
Naïve Bayes is a supervised classification technique
based on Bayes’ Theorem with an assumption of
independence among predictors, which can be used for
binary and multi-class classification problems. In short,
Bayes theorem provides a way that we can calculate the
probability of a hypothesis given our prior knowledge. The
Bayes Theorem is written as:
in which P(a|b) is the probability of hypothesis a given the
data b (posterior probability), P(b|a) is the probability of data
b given that the hypothesis a was true, P(a) is the probability
of hypothesis a is true (prior probability of a), and P(b) is the
probability of the data b.
The Naïve Bayes can be extended to a Gaussian and a
Bernoulli type of probability. Gaussian Naïve Bayes is
applicable for attributes with real values. Thus, the mean and
standard deviation of input values for each class can be
calculated. The probabilities of new input values are
calculated using the Gaussian Probability Density Function
(PDF). This PDF provides an estimate of the probability of
the new input value for that class. The PDF used in this paper
is shown below:
in which P(xi|y) is the Gaussian PDF, σy is the standard
deviation, xi is the new input values for the input variable,
and µy is the mean value.
The Naïve Bayes has been extensively utilized to model
and predict the early diagnosis of heart disease. [9] Research
conducted by using Naïve Bayes applied on 11 risk factors
has resulted a 89.77% accuracy [10]. Moreover, other
researchers proposed the hybrid approach of SVM with
Naïve Bayes. [11] They applied this approach again on the
same parameters of the dataset from UCI repository and
achieved the accuracy of 100%. However, as the approach
was hybrid, researcher [12] proposed a model called Hidden
Naïve Bayes (HNB) to obtain a dependent algorithm, using
which they hit the target of 100% accuracy. These results
imply that Naïve Bayes is one of the most powerful
algorithms to classify and predict heart disease. This paper
further extends the Naïve Bayes to Gaussian Naïve Bayes,
applied on 13 risk factors. [13]
B. Random Forest
Random Forest is a supervised learning algorithm which
classify data through an ensemble of decision trees. Each
individual tree in the random forest produces a class
prediction, and the class with the most occurrences become
the model’s prediction. In short, random forest builds
numerous decision trees and merges them to obtain a more
accurate prediction. Random Forest produces additional
randomness to the model, since it looks for the best features
or risk factors among a random subset of features.
The prediction of heart disease using Random Forest has
been conducted by researchers. Recently, Random Forest is
applied on a dataset containing 303 samples and 14
attributes, resulting in 86.9% accuracy for the prediction of
heart disease [14] Moreover, a study utilizing Random Forest
involving 498 patients conducted in Xi’an Medical
University between 2011 to 2018 has resulted 9 variables
that greatly affected the heart disease prediction. [15] In
addition, the Random Forest algorithm is applied on the
Cleveland heart disease dataset, which produces an accuracy
of 85.81% [16] These results suggest the importance of
Random Forest technique in predicting the early diagnosis of
heart disease.
III. DATA AND METHODS
A. Dataset
This research utilizes Cleveland Heart Disease dataset
from the UCI repository [17]. The dataset was composed
from the Cleveland Clinic Foundation and comprises about
303 records, each having 76 attributes. These 76 attributes
were further reduced into 13, which are taken into account to
predict the Exist (values 1) or Not Exist (value 0) class of
heart disease. These 13 attributes are shown below:
B. Research Methods
In this paper, the original dataset consisting of 303
records and 76 columns is obtained from Cleveland dataset
from UCI repository. The dataset is extracted onto the
Jupyter Notebook utilizing Python to build the machine
learning models. The steps taken to predict and classify the
heart disease existence are:
1. Data Preprocessing
In the data preprocessing step, the data are cleaned
and transformed into a form that is ready to be used
as an input to the machine learning models. The
preprocessing steps consist of selecting the 13
columns of interest, which are shown in Fig. 1.
Afterwards, the null values are checked to ensure the
validity of the results, since the algorithm may
produce different results if the data have null values.
Thereafter the cross tabulation of the data based on
several attributes may be examined to obtain
descriptive visualization of the data.
2. Feature or attributes selection
The next step is to investigate the correlation
between the 13 attributes or risk factors. If the
attributes are not correlated one to another, all 13
attributes can be included in the model. However, if
there is a strong correlation between two attributes,
one of them should be dropped. The correlation
method utilizes the Pearson correlation. The result of
the correlation shows that the 13 attributes have
weak correlation between them, which signifies that
they are independent attributes. Thus, all 13
attributes are included in the model.
3. Data Splitting
In this step, the data consisting of 303 records are
split into training and testing data. The train-test data
split is conducted randomly, with a ratio of train-test
data split of 80:20. This is the optimum fraction of
the train-test data split.
4. Model Training
This step is used to build the models. The training
data serve as an input data to the models. In this step,
the Gaussian Naïve Bayes and Random Forest
algorithms are utilized to build the models
independently. The hyperparameters of the
algorithms are set and adjusted to optimize the
accuracy of the models.
5. Model Evaluation
This last step involves the use of the developed
models on testing data. The results are the confusion
matrix, which cross tabulates the actual positive and
negative with the predicted positive and negative,
where positive implies the Exist class of heart
disease, whereas negative implies the Not Exist class
of heart disease. Therefore, confusion matrix
consists of 4 classes, namely TP (True Positive:
actual and predicted values are positive), FP (False
Positive: actual value is negative and predicted value
is positive), TN (True Negative: actual and predicted
values are negative), and FN (False Negative: actual
value is positive and predicted value is negative).
From the confusion matrix, the accuracy, precision,
F measure and recall of Gaussian Naïve Bayes and
Random Forest can be obtained, according to:
Accuracy = TP + TN / (TP + TN + FP + FN)
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F-measure = 2 x Recall x Precision / (Recall +
Precision)
The complete research method is depicted in Fig. 1
Fig 1. Research Method
IV. RESULT AND DISCUSSION
In this paper, the results are the models developed using
Gaussian Naïve Bayes and Random Forest algorithms to
predict the early diagnose of heart disease.
A. Gaussian Naïve Bayes
The model developed using Gaussian Naïve Bayes gives
the best accuracy to diagnose heart disease. The result of the
confusion matrix is shown in Fig.2. This prediction outcome
shows that there are 20 positive records and 31 negative
records that are correctly predicted by the model. The model,
however, predicts differently compared to the actual, which
accounts to 9 records.
Fig. 2. Confusion Matrix of Gaussian Naïve Bayes model
From the cross tabulation of actual and predicted positive
and negative values in confusion matrix, the measurement
values of the model can be obtained, and are shown in Fig. 3.
Fig. 3. Measurements of Gaussian Naïve Bayes model
B. Random Forest
The model developed by utilizing Random Forest
algorithm shows a less accurate result compared to Naïve
Bayes. The result of the confusion matrix is shown in Fig. 4.
The model produces the following results upon testing by
using the test data: 16 positive records and 29 negative
records that are correctly predicted by the model, and 15
inaccurate predictions, which consists of 9 False Negative
and 6 False Positive records.
Fig. 4. Confusion Matrix of Random Forest model
From the cross tabulation of actual and predicted positive
and negative values in confusion matrix, the measurement
values of the model can be obtained, and are shown in Fig. 5.
Fig. 5. Measurements of Random Forest model
In general, the False Negative and False Positive results
produced by Gaussian Naïve Bayes and Random Forest
models are attributed by the nature of the data and the
algorithms. In Gaussian Naïve Bayes, the distribution of the
input values should be a Gaussian-like distribution, whereas
the distribution of the data values in the dataset is not
completely Gaussian. Moreover, the Naïve Bayes assume
that each risk factors associated with heart disease is
independent. In principle, this may not be true, since a risk
factor may be correlated with another risk factor. For
instance, the maximum heart rate is closely correlated with
age. According to Dr. William Haskell in 1970s, our
maximum heart rate is around 220 subtracted by our age,
measured in beats per minute (bpm), which may produce
different accuracy of the models if it is taken into account.
On the other hand, the results of the accuracy of Random
Forest are quite sensitive related to the data. Moreover, it is
also less interpretable, which may hinder the ability to
visualize the ensembles of decision trees.
V. CONCLUSION AND FUTURE WORK
This research developed machine learning-based
prediction models of coronary heart disease using Gaussian
Naïve Bayes and Random Forest, which are applied on the
dataset consisting of 303 records and 13 selected attributes.
The holdout method is applied to split the dataset into a
training dataset and testing dataset with a ratio of 80%:20%
for training data and testing data, respectively. The results of
the research show Gaussian Naïve Bayes have higher
accuracy, precision, F-measure and recall values than
Random Forest. The future research will be conducted by
employing more records in the dataset, namely the dataset
from patients with heart disease in a national hospital in
Indonesia, more attributes or risk factors involved, and more
precise techniques, exploring feature extraction, and
classification techniques to increase the accuracy of the
models.
ACKNOWLEDGMENT
This work is supported by the Research and Technology
Transfer Office, Bina Nusantara University as part of Bina
Nusantara University’s International Research Grant entitled
"Aplikasi Prediksi Diagnosa Awal Penyakit Jantung Koroner
Berbasis Web Dengan Teknik Regresi Logistik" with
contract number: No: 017/VR.RTT/III/2021 and contract
date: 22 March 2021.
REFERENCES
[1] World Health Organization, “Cardiovascular Disease”, World Health
Organization, May 2021 [Online] Available:
https://www.who.int/health-topics/cardiovascular-diseases/#tab=tab_1
[Accessed 28 May 2021].
[2] D. Shah, S. Patel, S. K. Bharti, “Heart disease prediction using
machine learning techniques”, SN Computer Science, Vol. 1, No.
345, 2020. https://doi.org/10.1007/s42979-020-00365-y
[3] Kemkes, Badan Litbangkes, Laporan Nasional Riskesdas 2018, 1
December 2018 [Online] Available at:
http://labdata.litbang.kemkes.go.id/images/download/laporan/RKD/20
18/Laporan_Nasional_RKD2018_FINAL.pdf [Accessed 29 May
2021].
[4] Kemkes, Badan Litbangkes, Laporan Nasional Riskesdas 2013, 1
December 2013 [Online] Available at:
http://labdata.litbang.kemkes.go.id/images/download/laporan/RKD/20
13/Laporan_riskesdas_2013_final.pdf [Accessed 29 May 2021]
[5] N. A. Prihartono, Fitriyani, W. Riyadina, “Cardiovascular disease risk
factors among blue and white-collar workers in Indonesia”, Acta
Medica Indonesiana - The Indonesian Journal of Internal Medicine,
Vol. 50, No. 2, pp. 96-103, 2018.
[6] M. A. Hussain, A. A. Mamun, S. AE Peters, M. Woodward, R.R.
Huxley, “The burden of cardiovascular disease attributable to major
modifiable risk factors in Indonesia”, Journal of Epidemiology,
26(10), pp. 515-521, 2016.
[7] A. Maharani, Sujarwoto, D. Praveen, D. Oceandy, G. Tampubolon,
A. Patel, “Cardiovascular disease risk factor prevalence and estimated
10-year cardiovascular risk scores in Indonesia: The SMARThealth
Extend study”, PLoS ONE, 14(4): e0215219, 2019.
https://doi.org/10.1371/journal.pone.0215219
[8] W. Adisasmito, V. Amir, A. Atin, A. Megraini, D. Kusuma,
“Geographic and socioeconomic disparity in cardiovascular risk
factors in Indonesia: analysis of the Basic Health Research 2018”,
BMC Public Health, Vol. 20, No. 1004, 2020.
https://doi.org/10.1186/s12889-020-09099-1
[9] K. Tripathi, H. Garg, H. Sharma, “A comprehensive survey on
cardiovascular disease”, International Journal of Grid and Distributed
Computing, Vol. 13, No. 1, pp. 1772 1780, 2020.
[10] Repaka, Anjan Nikhil, Sai Deepak Ravikanti, and Ramya G. Franklin.
"Design And Implementing Heart Disease Prediction Using Naives
Bayesian." 2019 3rd International Conference on Trends in
Electronics and Informatics (ICOEI). IEEE, Tirunelveli, India, pp.
292-297, 2019
[11] Ade, Ms RR, Dhanashree S. Medhekar, and Mayur P. Bote. "Heart
disease prediction system using svm and naive bayes." International
Journal of Engineering Sciences & Research Technology, vol. 2, no.
5, 2013.
[12] Jabbar, M. A., and ShirinaSamreen. "Heart disease prediction system
based on hidden naïve bayes classifier." 2016 International
Conference on Circuits, Controls, Communications and Computing
(I4C). IEEE, Bangalore, pp.1-5, 2016
[13] E. Miranda, E. Irwansyah, A. Y. Amelga, M. M. Maribondang, M.
Salim, “Detection of cardiovascular disease risk's level for adults
using naive Bayes classifier”, Healthcare Informatics Research
Journal, Vol. 22, Issue 3, pp. 196-205, 2016.
[14] M. Pal, S. Parija, “Prediction of Heart Diseases using Random
Forest”, Journal of Physics: Conference Series, Vol. 1817, No. 1, pp.
012009, 2021. https://doi.org/10.1088/1742-6596/1817/1/012009
[15] X. Su, Y. Xu, Z. Tan, X. Wang, P. Yang, Y. Su, Y. Jiang, S. Qin, L.
Shang, “Prediction for cardiovascular diseases based on laboratory
data: An analysis of random forest model”, Journal of Clinical
Laboratory Analysis, 34(9), e23421, 2020.
https://doi.org/10.1002/jcla.23421
[16] Y. K. Singh, N. Sinha, S. K. Singh, “Heart disease prediction system
using Random Forest”, In Proceedings of 2017 International
Conference on Advances in Computing and Data Sciences, pp. 613-
623, 2017.
[17] Kaggle, “Cleveland Clinic Heart Disease Dataset[Online] Available
at: https://www.kaggle.com/aavigan/cleveland-clinic-heart-disease-
dataset [Accessed 28 May 2021].
... As a result, several studies have been conducted on the early detection of cardiac problems and the identification of the most important associated risk factors. Despite extensive efforts, prediction accuracy has remained inadequate, and the identification of the most influential risk factors has been challenging [1]. Data analysis approaches have been used to help healthcare professionals detect early indications of cardiac disease. ...
Article
Full-text available
Coronary Heart Disease (CHD) is a persistent health issue, and risk prognosis is very important because it creates opportunities for doctors to provide early solutions. Despite such promising results, this type of analysis runs into several problems, such as accurately handling high-dimensional data because of the abundance of extracted information that hampers the prediction process. This paper presents a new approach that integrates Principal Component Analysis (PCA) and feature selection techniques to improve the prediction performance of CHD models, especially in light of dimensionality consideration. Feature selection is identified as one of the contributors to enhance model performance. Reducing the input space and identifying important attributes related to heart disease offers a refined approach to CHD prediction. Then four classifiers were used, namely PCA, Random Forest (RF), Decision Trees (DT), and AdaBoost, and an accuracy of approximately 96% was achieved, which is quite satisfactory. The experimentations demonstrated the effectiveness of this approach, as the proposed model was more effective than the other traditional models including the RF and LR in aspects of precision, recall, and AUC values. This study proposes an approach to reduce data dimensionality and select important features, leading to improved CHD prediction and patient outcomes.
... In [6], the authors compare Gaussian Naive Bayes, Bernoulli Naive Bayes, and Random Forest algorithms in predicting coronary heart disease. Using the Cleveland dataset from the UCI repository, the study evaluates these models based on accuracy, precision, F 1 score, and recall. ...
Preprint
Full-text available
Coronary Heart Disease affects millions of people worldwide and is a well-studied area of healthcare. There are many viable and accurate methods for the diagnosis and prediction of heart disease, but they have limiting points such as invasiveness, late detection, or cost. Supervised learning via machine learning algorithms presents a low-cost (computationally speaking), non-invasive solution that can be a precursor for early diagnosis. In this study, we applied several well-known methods and benchmarked their performance against each other. It was found that Random Forest with oversampling of the predictor variable produced the highest accuracy of 84%.
... To this end, several imbalance techniques have been discussed. Bemando et al. [39] predicted models for coronary cardiovascular disease (CHD), which is known as cardiovascular disease have been proposed. To this point, numerous supervised machine-learning algorithms, including Gaussian Naïve Bayes, Bernoulli Naïve Bayes, and Random Forest, are exercised in cardiovascular (heart) disease prediction. ...
Article
Heart disease remains one of the leading causes of mortality worldwide, with diagnosis and treatment presenting significant challenges, particularly in developing nations. These challenges stem from the scarcity of effective diagnostic tools, a lack of qualified medical personnel, and other factors that hinder good patient prognosis and treatment. The rise in cardiac disorders, despite their preventability, is primarily due to inadequate preventive measures and a shortage of skilled medical providers. In this study, we propose a novel approach to enhance the accuracy of cardiovascular disease prediction by identifying critical features using advanced machine learning techniques. Utilizing the Cleveland Heart Disease dataset, we explore various feature combinations and implement multiple well-known classification strategies. By integrating a Voting Classifier ensemble, which combines Logistic Regression, Gradient Boosting, and Support Vector Machine (SVM) models, we create a robust prediction model for heart disease. This hybrid approach achieves a remarkable accuracy level of 97.9%, significantly improving the precision of cardiovascular disease prediction and offering a valuable tool for early diagnosis and treatment.
... To this end, several imbalance techniques have been discussed. In [124], prediction models for coronary cardiovascular disease (CHD), which is known as cardiovascular disease have been proposed. To this end, various supervised machine learning algorithms, including Gaussian Naïve Bayes, Bernoulli Naïve Bayes, and Random Forest, are employed in cardiovascular disease prediction. ...
Article
Full-text available
Cardiovascular disease is the leading cause of global mortality and responsible for millions of deaths annually. The mortality rate and overall consequences of cardiac disease can be reduced with early disease detection. However, conventional diagnostic methods encounter various challenges, including delayed treatment and misdiagnoses, which can impede the course of treatment and raise healthcare costs. The application of artificial intelligence (AI) techniques, especially machine learning (ML) algorithms, offers a promising pathway to address these challenges. This paper emphasizes the central role of machine learning in cardiac health and focuses on precise cardiovascular disease prediction. In particular, this paper is driven by the urgent need to fully utilize the potential of machine learning to enhance cardiovascular disease prediction. In light of the continued progress in machine learning and the growing public health implications of cardiovascular disease, this paper aims to offer a comprehensive analysis of the topic. This review paper encompasses a wide range of topics, including the types of cardiovascular disease, the significance of machine learning, feature selection, the evaluation of machine learning models, data collection & preprocessing, evaluation metrics for cardiovascular disease prediction, and the recent trends & suggestion for future works. In addition, this paper offers a holistic view of machine learning’s role in cardiovascular disease prediction and public health. We believe that our comprehensive review will contribute significantly to the existing body of knowledge in this essential area.
Article
Full-text available
A visit to the doctor’s office usually starts with the nurse collecting patient symptoms, health information, and necessary lab tests. All the information will be presented to the doctor, and the doctor may collect additional information in order to do the right diagnosis. The doctor’s brain is like a complicated machine capable of quick processing of the information, relating it to previous patients, and mapping the information to diagnoses. This process resembles much to how machine learning works. In this article, we explore how machine learning could help predict different diseases and facilitate a doctor’s diagnosis. In particular, our study focuses on unbalanced, multiclass classification problems.
Thesis
Chronic heart disease is a leading global cause of mortality. Identifying diseases in a timely manner is crucial for reducing mortality rates. The COVID-19 pandemic, which began in 2020, prompted individuals to seek alternative methods of diagnosis such as online medical blogs and selfdiagnosis systems. Unfortunately, these methods often resulted in misinterpretations and inaccurate assumptions about the causes of heart disease. Various Machine Learning (ML) systems have been introduced for heart disease prediction. In this paper, we propose a Main System Model that uses either of the two proposed deep hybrid models for heart disease prediction, utilizing the Center for Disease Control (CDC) dataset. As it is a highly imbalanced dataset, so Synthetic Minority Oversampling Technique (SMOTE), Localized Random Affine Shadowsampling (LoRAS) and Proximity Weighted Random Affine Shadowsampling (ProWRAS) are used for data balancing. Standard scaling is used to transform raw data into a common scale, typically with a mean of 0 and a standard deviation of 1. In sub-system model 1, Deep Boltzmann Machine (DBM) and Residual Network (ResNet) are combined to form a hybrid model and the resultant model is termed as DeeRes. Whereas, in sub-system model 2, a hybrid of Variational Autoencoder (VAE) and GoogLeNet is developed, termed as VGoo. Both the models are evaluated using accuracy, precision, F1 score and recall classification metrics and validated using the 10-Fold Cross-Validation technique. An ablation study is performed to assess the impact of parameters influencing our model’s performance. Simulations are performed using SMOTE, LoRAS and ProWAS balancing techniques. DeeRes shows an accuracy of 79.67%, F1 score of 79%, precision of 77%, and recall of 80% using SMOTE. Whereas, VGoo with SMOTE shows an accuracy of 78.52%, F1 score of 79%, precision of 76%, and recall of 82%. In addition, DeeRes shows an accuracy of 87.98%, F1 score of 86%, precision of 91%, and recall of 83% using ProWRAS. Whereas, VGoo with ProWRAS shows an accuracy of 87.39%, F1 score of 86%, precision of 92%, and recall of 81%. When using LoRAS, DeeRes gives an accuracy of 95.14%, F1 score of 94%, precision of 97% and recall of 91%. Whereas, VGoo with LoRAS gives accuracy of 94.90%, precision of 98%, F1 score of 93%, and recall of 89%. The simulation results prove that DeeRes gives better results than VGoo, under the same conditions.
Article
Full-text available
The process of discovering or mining information from a huge volume of data is known as data mining technology. Today data mining has lots of application in every aspects of human life. Applications of data mining are wide and diverse. Among this health care is a major application of data mining. Medical field has get benefited more from data mining. Heart Disease is the most dangerous life-threatening chronic disease globally. The objective of the work is to predicts the occurrence of heart disease of a patient using random forest algorithm. The dataset was accessed from Kaggle site. The dataset contains 303 samples and 14 attributes are taken for features of the dataset. Then it was processed using python open access software in jupyter notebook. The datasets are classified and processed using machine learning algorithm Random forest. The outcomes of the dataset are expressed in terms of accuracy, sensitivity and specificity in percentage. Using random forest algorithm, we obtained accuracy of 86.9% for prediction of heart disease with sensitivity value 90.6% and specificity value 82.7%. From the receiver operating characteristics, we obtained the diagnosis rate for prediction of heart disease using random forest is 93.3%. The random forest algorithm has proven to be the most efficient algorithm for classification of heart disease and therefore it is used in the proposed system.
Article
Full-text available
Heart disease, alternatively known as cardiovascular disease, encases various conditions that impact the heart and is the primary basis of death worldwide over the span of the past few decades. It associates many risk factors in heart disease and a need of the time to get accurate, reliable, and sensible approaches to make an early diagnosis to achieve prompt management of the disease. Data mining is a commonly used technique for processing enormous data in the healthcare domain. Researchers apply several data mining and machine learning techniques to analyse huge complex medical data, helping healthcare professionals to predict heart disease. This research paper presents various attributes related to heart disease, and the model on basis of supervised learning algorithms as Naïve Bayes, decision tree, K-nearest neighbor, and random forest algorithm. It uses the existing dataset from the Cleveland database of UCI repository of heart disease patients. The dataset comprises 303 instances and 76 attributes. Of these 76 attributes, only 14 attributes are considered for testing, important to substantiate the performance of different algorithms. This research paper aims to envision the probability of developing heart disease in the patients. The results portray that the highest accuracy score is achieved with K-nearest neighbor.
Article
Full-text available
Researchers have created several expert systems over the years to predict heart disease early and assist cardiologists to enhance the diagnosis process. We present a diagnostic system in this paper that utilizes an optimized XGBoost (Extreme Gradient Boosting) classifier to predict heart disease. Proper hyper-parameter tuning is essential for any classifier’s successful application. To optimize the hyper-parameters of XGBoost, we used Bayesian optimization, which is a very efficient method for hyper-parameter optimization. We also used One-Hot (OH) encoding technique to encode categorical features in the dataset to improve prediction accuracy. The efficacy of the proposed model is evaluated on Cleveland heart disease dataset and compared it with Random Forest (RF) and Extra Tree (ET) classifiers. Five different evaluation metrics: accuracy, sensitivity, specificity, F1-score, and AUC (area under the curve) of ROC charts were used for performance evaluation. The experimental results showed its validity and efficacy in the prediction of heart disease. In addition, proposed model displays better performance compared to the previously suggested models. Moreover, our proposed method reaches the high prediction accuracy of 91.8%. Our results indicate that the proposed method could be used reliably to predict heart disease in the clinic.
Article
Full-text available
Background To establish a prediction model for cardiovascular diseases (CVD) in the general population based on random forests. Methods A retrospective study involving 498 subjects was conducted in Xi'an Medical University between 2011 and 2018. The random forest algorithm was used to screen out the variables that greatly affected the CVD prediction and to establish a prediction model. The important variables were included in the multifactorial logistic regression analysis. The area under the curve (AUC) was compared between logistic regression model and random forest model. Results The random forest model revealed the variables, including the age, body mass index (BMI), fasting blood glucose (FBG), diastolic blood pressure (DBP), triglyceride (TG), systolic blood pressure (SBP), total cholesterol (TC), waist circumference, and high‐density lipoprotein‐cholesterol (HDL‐C), were more significant for CVD prediction; the AUC was 0.802 in CVD prediction. Multifactorial logistic regression analysis indicated that the risk factors for CVD included the age [odds ratio (OR): 1.14, 95% confidence intervals (CI): 1.10‐1.17, P < .001], BMI (OR: 1.13, 95% CI: 1.06‐1.20, P < .001), TG (OR: 1.11, 95% CI: 1.02‐1.22, P = .023), and DBP (OR: 1.04, 95% CI: 1.02‐1.06, P = .001); the AUC was 0.843 in CVD prediction. The established logistic regression prediction model was Logit P = Log[P/(1 − P)] = −11.47 + 0.13 × age + 0.12 × BMI + 0.11 × TG + 0.04 × DBP; P = 1/[1 + exp(−Logit P)]. People were prone to develop CVD at the time of P > .51. Conclusions A prediction model for CVD is developed in the general population based on random forests, which provides a simple tool for the early prediction of CVD.
Article
Full-text available
Background: Cardiovascular diseases (CVDs) accounted for over 17 million deaths and 353 million disability-adjusted life years lost in 2016. The risk factors are also high and increasing with high blood pressure, smoking, and high body mass index contributed to up to 212 million disability-adjusted life years in 2016. To help reduce the burden, it is crucial to understand the geographic and socioeconomic disparities in CVD risk factors. Methods: Employing both geospatial and quantitative analyses, we analyzed the disparities in the prevalence of smoking, physical inactivity, obesity, hypertension, and diabetes in Indonesia. CVD data was from Riskesdas 2018, and socioeconomic data was from the World Bank. Results: Our findings show a very high prevalence of CVD risk factors with the prevalence of smoking, physical activity, obesity, hypertension ranged from 28 to 33%. Results also show the geographic disparity in CVD risk factors in all five Indonesian regions. Moreover, results show socioeconomic disparity with the prevalence of obesity, hypertension, and diabetes are higher among urban and the richest and most educated districts while that physical inactivity and smoking is higher among rural and the least educated districts. Conclusion: The CVD burden is high and increasing in particularly among urban areas and districts with higher income and education levels. While the government needs to continue tackling the persistent burden from maternal mortality and infectious diseases, they need to put more effort into the prevention and control of CVDs and their risk factors.
Article
Full-text available
Coronary heart disease (CHD) is one of the leading causes of death worldwide; if suffering from CHD and being in its end-stage, the most advanced treatments are required, such as heart surgery and heart transplant. Moreover, it is not easy to diagnose CHD at the earlier stage; hospitals diagnose it based on various types of medical tests. Thus, by predicting high-risk people who are to suffer from CHD, it is significant to reduce the risks of developing CHD. In recent years, some research works have been done using data mining to predict the risk of developing diseases based on medical tests. In this study, we have proposed a reconstruction error (RE) based deep neural networks (DNNs); this approach uses a deep autoencoder (AE) model for estimating RE. Initially, a training dataset is divided into two groups by their RE divergence on the deep AE model that learned from the whole training dataset. Next, two DNN classifiers are trained on each group of datasets separately by combining a RE based new feature with other risk factors to predict the risk of developing CHD. For creating the new feature, we use deep AE model that trained on the only high-risk dataset. We have performed an experiment to prove how the components of our proposed method work together more efficiently. As a result of our experiment, the performance measurements include accuracy, precision, recall, F-measure, and AUC score reached 86.3371%, 91.3716%, 82.9024%, 86.9148%, and 86.6568%, respectively. These results show that the proposed AE-DNNs outperformed regular machine learning-based classifiers for CHD risk prediction.
Article
Full-text available
Background The brunt of cardiovascular disease (CVD) burden globally now resides within low- and middle-income countries, including Indonesia. However, little is known regarding cardiovascular health in Indonesia. This study aimed to estimate the prevalence of elevated CVD risk in a specific region of Indonesia. Methods We conducted full household screening for cardiovascular risk factors among adults aged 40 years and older in 8 villages in Malang District, East Java Province, Indonesia, in 2016–2017. 10-year cardiovascular risk scores were calculated based on the World Health Organization/International Society of Hypertension’s region-specific charts that use age, sex, blood pressure, diabetes status and smoking behaviour. Results Among 22,093 participants, 6,455 (29.2%) had high cardiovascular risk, defined as the presence of coronary heart disease, stroke or other atherosclerotic disease; estimated 10-year CVD risk of ≥ 30%; or estimated 10-year CVD risk between 10% to 29% combined with a systolic blood pressure of > 140 mmHg. The prevalence of high CVD risk was greater in urban (31.6%, CI 30.7–32.5%) than in semi-urban (28.7%, CI 27.3–30.1%) and rural areas (26.2%, CI 25.2–27.2%). Only 11% and 1% of all the respondents with high CVD risk were on blood pressure lowering and statins treatment, respectively. Conclusions High cardiovascular risk is common among Indonesian adults aged ≥40 years, and rates of preventive treatment are low. Population-based and clinical approaches to preventing CVD should be a priority in both urban and rural areas.
Article
Coronary heart disease (CHD) is a significant medical disorder and one of the most prevalent forms of heart disease. Owing to the reality that a heart attack will happen without notice, an insightful screening system is inevitable. This paper investigates a new CHD detection approach built on an optimization machine learning technique, such as classifier ensembles. To boost the efficiency of our system, we used the Feature-Selector optimization model to select the best subset of CHD features. Second, to solve the problem of imbalanced CHD data, we used optimized SMOTE over-sampling, a highly efficient approach embedded with an optimization model. The class label estimation of three optimization learners, namely random forest, XGBoost API optimization, and SVM optimization model, is integrated in a stacked architecture. The identification model is validated using data from CHD patients. Finally, in terms of precision, F1, and ROC-Curve, our detection model outperformed existing ones focused on optimization models ensembles and individual classifiers. With random forest optimization, we achieved 90% accuracy, and with the XGBoost API optimization model, we achieved 89% accuracy. In contrast to previous reported research in the existing literature, this analysis indicates that our proposed model makes a substantial contribution.
Article
Background: In Indonesia, coronary heart disease (CHD) and stroke are estimated to cause more than 470 000 deaths annually. In order to inform primary prevention policies, we estimated the sex- and age-specific burden of CHD and stroke attributable to five major and modifiable vascular risk factors: cigarette smoking, hypertension, diabetes, elevated total cholesterol, and excess body weight. Methods: Population attributable risks for CHD and stroke attributable to these risk factors individually were calculated using summary statistics obtained for prevalence of each risk factor specific to sex and to two age categories (<55 and ≥55 years) from a national survey in Indonesia. Age- and sex-specific relative risks for CHD and stroke associated with each of the five risk factors were derived from prospective data from the Asia-Pacific region. Results: Hypertension was the leading vascular risk factor, explaining 20%–25% of all CHD and 36%–42% of all strokes in both sexes and approximately one-third of all CHD and half of all strokes across younger and older age groups alike. Smoking in men explained a substantial proportion of vascular events (25% of CHD and 17% of strokes). However, given that these risk factors are likely to be strongly correlated, these population attributable risk proportions are likely to be overestimates and require verification from future studies that are able to take into account correlation between risk factors. Conclusions: Implementation of effective population-based prevention strategies aimed at reducing levels of major cardiovascular risk factors, especially blood pressure, total cholesterol, and smoking prevalence among men, could reduce the growing burden of CVD in the Indonesian population.