Conference PaperPDF Available

Intelligent heart disease prediction system using data mining techniques

Authors:

Abstract and Figures

The healthcare industry collects huge amounts of healthcare data which, unfortunately, are not ";mined"; to discover hidden information for effective decision making. Discovery of hidden patterns and relationships often goes unexploited. Advanced data mining techniques can help remedy this situation. This research has developed a prototype Intelligent Heart Disease Prediction System (IHDPS) using data mining techniques, namely, Decision Trees, Naive Bayes and Neural Network. Results show that each technique has its unique strength in realizing the objectives of the defined mining goals. IHDPS can answer complex ";what if"; queries which traditional decision support systems cannot. Using medical profiles such as age, sex, blood pressure and blood sugar it can predict the likelihood of patients getting a heart disease. It enables significant knowledge, e.g. patterns, relationships between medical factors related to heart disease, to be established. IHDPS is Web-based, user-friendly, scalable, reliable and expandable. It is implemented on the .NET platform.
Content may be subject to copyright.
IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.8, August 2008
343
Manuscript received August 5, 2008
Manuscript revised August 20, 2008
Intelligent Heart Disease Prediction System Using
Data Mining Techniques
Sellappan Palaniappan Rafiah Awang
Department of Information Technology Malaysia University of Science and Technology
Block C, Kelana Square, Jalan SS7/26 Kelana Jaya, 47301 Petaling Jaya, Selangor, Malaysia
sell@must.edu.my rafyea99@yahoo.com
Abstract
The healthcare industry collects huge amounts of healthcare
data which, unfortunately, are not “mined” to discover hidden
information for effective decision making. Discovery of hidden
patterns and relationships often goes unexploited. Advanced data
mining techniques can help remedy this situation. This research
has developed a prototype Intelligent Heart Disease Prediction
System (IHDPS) using data mining techniques, namely, Decision
Trees, Naïve Bayes and Neural Network. Results show that each
technique has its unique strength in realizing the objectives of the
defined mining goals. IHDPS can answer complex “what if”
queries which traditional decision support systems cannot. Using
medical profiles such as age, sex, blood pressure and blood
sugar it can predict the likelihood of patients getting a heart
disease. It enables significant knowledge, e.g. patterns,
relationships between medical factors related to heart disease, to
be established. IHDPS is Web-based, user-friendly, scalable,
reliable and expandable. It is implemented on the .NET platform.
1. Motivation
A major challenge facing healthcare organizations
(hospitals, medical centers) is the provision of quality
services at affordable costs. Quality service implies
diagnosing patients correctly and administering treatments
that are effective. Poor clinical decisions can lead to
disastrous consequences which are therefore unacceptable.
Hospitals must also minimize the cost of clinical tests.
They can achieve these results by employing appropriate
computer-based information and/or decision support
systems.
Most hospitals today employ some sort of hospital
information systems to manage their healthcare or patient
data [12]. These systems typically generate huge amounts
of data which take the form of numbers, text, charts and
images. Unfortunately, these data are rarely used to
support clinical decision making. There is a wealth of
hidden information in these data that is largely untapped.
This raises an important question: “How can we turn data
into useful information that can enable healthcare
practitioners to make intelligent clinical decisions?” This
is the main motivation for this research.
2. Problem statement
Many hospital information systems are designed to
support patient billing, inventory management and
generation of simple statistics. Some hospitals use decision
support systems, but they are largely limited. They can
answer simple queries like “What is the average age of
patients who have heart disease?”, “How many surgeries
had resulted in hospital stays longer than 10 days?”,
“Identify the female patients who are single, above 30
years old, and who have been treated for cancer.”
However, they cannot answer complex queries like
“Identify the important preoperative predictors that
increase the length of hospital stay”, “Given patient
records on cancer, should treatment include chemotherapy
alone, radiation alone, or both chemotherapy and
radiation?”, and “Given patient records, predict the
probability of patients getting a heart disease.”
Clinical decisions are often made based on doctors’
intuition and experience rather than on the knowledge-rich
data hidden in the database. This practice leads to
unwanted biases, errors and excessive medical costs which
affects the quality of service provided to patients. Wu, et al
proposed that integration of clinical decision support with
computer-based patient records could reduce medical
errors, enhance patient safety, decrease unwanted practice
variation, and improve patient outcome [17]. This
suggestion is promising as data modelling and analysis
tools, e.g., data mining, have the potential to generate a
knowledge-rich environment which can help to
significantly improve the quality of clinical decisions.
3. Research objectives
The main objective of this research is to develop a
prototype Intelligent Heart Disease Prediction System
(IHDPS) using three data mining modeling techniques,
namely, Decision Trees, Naïve Bayes and Neural Network.
IHDPS can discover and extract hidden knowledge
(patterns and relationships) associated with heart disease
from a historical heart disease database. It can answer
complex queries for diagnosing heart disease and thus
assist healthcare practitioners to make intelligent clinical
decisions which traditional decision support systems
IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.8, August 2008
344
cannot. By providing effective treatments, it also helps to
reduce treatment costs. To enhance visualization and ease
of interpretation, it displays the results both in tabular and
graphical forms.
4. Data mining review
Although data mining has been around for more than
two decades, its potential is only being realized now. Data
mining combines statistical analysis, machine learning and
database technology to extract hidden patterns and
relationships from large databases [15]. Fayyad defines
data mining as “a process of nontrivial extraction of
implicit, previously unknown and potentially useful
information from the data stored in a database” [4].
Giudici defines it as “a process of selection, exploration
and modelling of large quantities of data to discover
regularities or relations that are at first unknown with the
aim of obtaining clear and useful results for the owner of
database” [5].
Data mining uses two strategies: supervised and
unsupervised learning. In supervised learning, a training
set is used to learn model parameters whereas in
unsupervised learning no training set is used (e.g., k-
means clustering is unsupervised) [12].
Each data mining technique serves a different purpose
depending on the modelling objective. The two most
common modelling objectives are classification and
prediction. Classification models predict categorical labels
(discrete, unordered) while prediction models predict
continuous-valued functions [6]. Decision Trees and
Neural Networks use classification algorithms while
Regression, Association Rules and Clustering use
prediction algorithms [3].
Decision Tree algorithms include CART (Classification
and Regression Tree), ID3 (Iterative Dichotomized 3) and
C4.5. These algorithms differ in selection of splits, when
to stop a node from splitting, and assignment of class to a
non-split node [7]. CART uses Gini index to measure the
impurity of a partition or set of training tuples [6]. It can
handle high dimensional categorical data. Decision Trees
can also handle continuous data (as in regression) but they
must be converted to categorical data.
Naive Bayes or Bayes’ Rule is the basis for many
machine-learning and data mining methods [14]. The rule
(algorithm) is used to create models with predictive
capabilities. It provides new ways of exploring and
understanding data. It learns from the “evidence” by
calculating the correlation between the target (i.e.,
dependent) and other (i.e., independent) variables.
Neural Networks consists of three layers: input, hidden
and output units (variables). Connection between input
units and hidden and output units are based on relevance
of the assigned value (weight) of that particular input unit.
The higher the weight the more important it is. Neural
Network algorithms use Linear and Sigmoid transfer
functions. Neural Networks are suitable for training large
amounts of data with few inputs. It is used when other
techniques are unsatisfactory.
5. Methodology
IHDPS uses the CRISP-DM methodology to build the
mining models. It consists of six major phases: business
understanding, data understanding, data preparation,
modeling, evaluation, and deployment. Business
understanding phase focuses on understanding the
objectives and requirements from a business perspective,
converting this knowledge into a data mining problem
definition, and designing a preliminary plan to achieve the
objectives. Data understanding phase uses the raw the data
and proceeds to understand the data, identify its quality,
gain preliminary insights, and detect interesting subsets to
form hypotheses for hidden information. Data preparation
phase constructs the final dataset that will be fed into the
modeling tools. This includes table, record, and attribute
selection as well as data cleaning and transformation. The
modeling phase selects and applies various techniques, and
calibrates their parameters to optimal values. The
evaluation phase evaluates the model to ensure that it
achieves the business objectives. The deployment phase
specifies the tasks that are needed to use the models [3].
Data Mining Extension (DMX), a SQL-style query
language for data mining, is used for building and
accessing the models’ contents. Tabular and graphical
visualizations are incorporated to enhance analysis and
interpretation of results.
5.1. Data source
A total of 909 records with 15 medical attributes
(factors) were obtained from the Cleveland Heart Disease
database [1]. Figure 1 lists the attributes. The records were
split equally into two datasets: training dataset (455
records) and testing dataset (454 records). To avoid bias,
the records for each set were selected randomly.
For the sake of consistency, only categorical attributes
were used for all the three models. All the non-categorical
medical attributes were transformed to categorical data.
The attribute “Diagnosis” was identified as the
predictable attribute with value “1” for patients with heart
disease and value “0” for patients with no heart disease.
The attribute “PatientID” was used as the key; the rest are
input attributes. It is assumed that problems such as
missing data, inconsistent data, and duplicate data have all
been resolved.
IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.8, August 2008
345
Predictable attribute
1. Diagnosis (value 0: < 50% diameter narrowing (no heart disease); value 1: > 50% diameter narrowing (has heart
disease))
Key attribute
1. PatientID – Patient’s identification number
Input attributes
1. Sex (value 1: Male; value 0 : Female)
2. Chest Pain Type (value 1: typical type 1 angina, value 2: typical type angina, value 3: non-angina pain; value 4:
asymptomatic)
3. Fasting Blood Sugar (value 1: > 120 mg/dl; value 0: < 120 mg/dl)
4. Restecg – resting electrographic results (value 0: normal; value 1: 1 having ST-T wave abnormality; value 2: showing
probable or definite left ventricular hypertrophy)
5. Exang – exercise induced angina (value 1: yes; value 0: no)
6. Slope – the slope of the peak exercise ST segment (value 1: unsloping; value 2: flat; value 3: downsloping)
7. CA – number of major vessels colored by floursopy (value 0 – 3)
8. Thal (value 3: normal; value 6: fixed defect; value 7: reversible defect)
9. Trest Blood Pressure (mm Hg on admission to the hospital)
10. Serum Cholesterol (mg/dl)
11. Thalach – maximum heart rate achieved
12. Oldpeak – ST depression induced by exercise relative to rest
13. Age in Year Figure 1. Description of attributes
5.2. Mining models
Data Mining Extension (DMX) query language was
used for model creation, model training, model prediction
and model content access. All parameters were set to the
default setting except for parameters “Minimum Support =
1” for Decision Tree and “Minimum Dependency
Probability = 0.005” for Naïve Bayes [10]. The trained
models were evaluated against the test datasets for
accuracy and effectiveness before they were deployed in
IHDPS. The models were validated using Lift Chart and
Classification Matrix.
5.3. Validating model effectiveness
The effectiveness of models was tested using two
methods: Lift Chart and Classification Matrix. The
purpose was to determine which model gave the highest
percentage of correct predictions for diagnosing patients
with a heart disease.
Lift Chart with predictable value. To determine if
there was sufficient information to learn patterns in
response to the predictable attribute, columns in the
trained model were mapped to columns in the test dataset.
The model, predictable column to chart against, and the
state of the column to predict patients with heart disease
(predict value = 1) were also selected. Figure 2 shows the
Lift Chart output. The X-axis shows the percentage of the
test dataset used to compare predictions while the Y-axis
shows the percentage of values predicted to the specified
state. The blue and green lines show the results for
random-guess and ideal model respectively. The purple,
yellow and red lines show the results of Neural Network,
Naïve Bayes and Decision Tree models respectively.
The top green line shows the ideal model; it captured
100% of the target population for patients with heart
disease using 46% of the test dataset. The bottom blue line
shows the random line which is always a 45-degree line
across the chart. It shows that if we randomly guess the
result for each case, 50% of the target population would be
captured using 50% of the test dataset. All three model
lines (purple, yellow and red) fall between the random-
guess and ideal model lines, showing that all three have
sufficient information to learn patterns in response to the
predictable state.
Lift Chart with no predictable value. The steps for
producing Lift Chart are similar to the above except that
the state of the predictable column is left blank. It does not
include a line for the random-guess model. It tells how
well each model fared at predicting the correct number of
the predictable attribute. Figure 3 shows the Lift Chart
output. The X-axis shows the percentage of test dataset
used to compare predictions while the Y-axis shows the
percentage of predictions that are correct. The blue, purple,
green and red lines show the ideal, Neural Network, Naïve
Bayes and Decision Trees models respectively. The chart
shows the performance of the models across all possible
states. The model ideal line (blue) is at 45-degree angle,
showing that if 50% of the test dataset is processed, 50%
of test dataset is predicted correctly.
IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.8, August 2008
346
Figure 2. Result of Lift Chart with predictable value Figure 3. Result of Lift Chart without predictable value
The chart shows that if 50% of the population is
processed, Neural Network gives the highest percentage of
correct predictions (49.34%) followed by Naïve Bayes
(47.58%) and Decision Trees (41.85%). If the entire
population is processed, Naïve Bayes model appears to
perform better than the other two as it gives the highest
number of correct predictions (86.12%) followed by
Neural Network (85.68%) and Decision Trees (80.4%).
Processing less than 50% of the population causes the
Lift lines for Neural Network and Naïve Bayes to be
always higher than that for Decision Trees, indicating that
Neural Network and Naïve Bayes are better at making
high percentage of correct predictions than Decision Trees.
Along the X-axis the Lift lines for Neural Network and
Naïve Bayes overlap, indicating that both models are
equally good for predicting correctly. When more than
50% of population is processed, Neural Network and
Naïve Bayes appear to perform better as they give high
percentage of correct predictions than Decision Trees.
This is because the Lift line for Decision Trees is always
below that of Neural Network and Naïve Bayes. For some
population range, Neural Network appears to fare better
than Naives Bayes and vice-versa.
Classification Matrix. Classification Matrix displays
the frequency of correct and incorrect predictions. It
compares the actual values in the test dataset with the
predicted values in the trained model. In this example, the
test dataset contained 208 patients with heart disease and
246 patients without heart disease. Figure 4 shows the
results of the Classification Matrix for all the three models.
The rows represent predicted values while the columns
represent actual values (1 for patients with heart disease,
‘0’ for patients with no heart disease). The left-most
columns show values predicted by the models. The
diagonal values show correct predictions.
Figure 4. Results of Classification Matrix for all the three
models
Figure 5 summarizes the results of all three models.
Naïve Bayes appears to be most effective as it has the
highest percentage of correct predictions (86.53%) for
patients with heart disease, followed by Neural Network
IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.8, August 2008
347
(with a difference of less than 1%) and Decision Trees.
Decision Trees, however, appears to be most effective for
predicting patients with no heart disease (89%) compared
to the other two models.
Model
Type Prediction
Attributes No. of
cases Prediction
+WHD, +PHD 146 Correct
–WHD, +PHD 27 Incorrect
–WHD, –PHD 219 Correct
Decision
Tree +WHD, –PHD 62 Incorrect
+WHD, +PHD 180 Correct
–WHD, +PHD 35 Incorrect
–WHD, –PHD 211 Correct
Naïve
Bayes +WHD, –PHD 28 Incorrect
+WHD, +PHD 178 Correct
–WHD, +PHD 35 Incorrect
–WHD, –PHD 211 Correct
Neural
Network +WHD, –PHD 30 Incorrect
Legend
+WHD: Patients with heart disease
–WHD: Patients with no heart disease
+PHD: Patients predicted as having heart disease
–PHD: Patients predicted as having no heart disease
Figure 5. Model results
5.4. Evaluation of Mining Goals
Five mining goals were defined based on exploration
of the heart disease dataset and objectives of this research.
They were evaluated against the trained models. Results
show that all three models had achieved the stated goals,
suggesting that they could be used to provide decision
support to doctors for diagnosing patients and discovering
medical factors associated with heart disease. The goals
are as follows:
Goal 1: Given patients’ medical profiles, predict those
who are likely to be diagnosed with heart disease. All
three models were able to answer this question using
singleton query and batch or prediction join query. Both
queries could predict on single input cases and multiple
input cases respectively. IHDPS supports prediction using
“what if” scenarios. Users enter values of medical
attributes to diagnose patients with heart disease. For
example, entering values Age = 70, CA = 2, Chest Pain
Type = 4, Sex = M, Slope = 2 and Thal = 3 into the
models, would produce the output in Figure 6. All three
models showed that this patient has a heart disease. Naïve
Bayes gives the highest probability (95%) with 432
supporting cases, followed closely by Decision Tree
(94.93%) with 106 supporting cases and Neural Network
(93.54%) with 298 supporting cases. As these values are
high, doctors could recommend that the patient should
undergo further heart examination. Thus performing “what
if” scenarios can help prevent a potential heart attack.
Goal 2: Identify the significant influences and
relationships in the medical inputs associated with the
predictable state – heart disease. The Dependency viewer
in Decision Trees and Naïve Bayes models shows the
results from the most significant to the least significant
(weakest) medical predictors. The viewer is especially
useful when there are many predictable attributes. Figures
7 and 8 show that in both models, the most significant
factor influencing heart disease is “Chest Pain Type”.
Other significant factors include Thal, CA and Exang.
Decision Trees model shows ‘Trest Blood Pressure” as the
weakest factor while Naïve Bayes model shows ‘Fasting
Blood Sugar’ as the weakest factor. Naïve Bayes appears
to fare better than Decision Trees as it shows the
significance of all input attributes. Doctors can use this
information to further analyze the strengths and
weaknesses of the medical attributes associated with heart
disease.
Goal 3: Identify the impact and relationship between
the medical attributes in relation to the predictable state –
heart disease. Identifying the impact and relationship
between the medical attributes in relation to heart disease
is only found in Decision Trees viewer (Figure 9). It gives
a high probability (99.61%) that patients with heart
disease are found in the relationship between the attributes
(nodes): “Chest Pain Type = 4 and CA = 0 and Exang = 0
and Trest Blood Pressure >= 146.362 and < 158.036.”
Doctors can use this information to perform medical
screening on these four attributes instead of on all
attributes on patients who are likely to be diagnosed with
heart disease. This will reduce medical expenses,
administrative costs, and diagnosis time. Information on
least impact (5.88%) is found in the relationship between
the attributes: “Chest Pain Type not = 4 and Sex = F”.
Also given is the relationship between attributes for
patients with no heart disease. Results show that the
relationship between the attributes: “Chest Pain Type not
= 4 and Sex = F” has the highest impact (92.58%). The
least impact (0.2%) is found in the attributes: “Chest Pain
Type = 4 and CA = 0 and Exang = 0 and Trest Blood
Pressure >= 146.362 and < 158.036”. Additional
information such as identifying patients’ medical profiles
based selected nodes can also be obtained by using the
drill through function. Doctors can use the Decision Tree
viewer to perform further analysis.
IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.8, August 2008
348
Figure 6. Output for singleton query module
Figure 7. Decision Trees dependency network
Figure 8. Dependency network for Naïve Bayes
Figure 9. Decision Trees Viewer
IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.8, August 2008
349
Goal 4: Identify characteristics of patients with heart
disease. Only Naïve Bayes model identifies the
characteristics of patients with heart disease. It shows the
probability of each input attribute for the predictable state.
Figure 10 shows that 80% of the heart disease patients are
males (Sex = 1) of which 43% are between ages 56 and 63.
Other significant characteristics are: high probability in
fasting blood sugar with less than 120 mg/dl reading, chest
pain type is asymptomatic, slope of peak exercise is flat,
etc.
Figure 11 shows the characteristics of patients with no
heart disease with high probability in fasting blood sugar
with less than 120 mg/dl reading, no exercise induced,
number of major vessels is zero, etc. These results can be
further analyzed.
Figure 10. Naïve Bayes Attribute Characteristics Viewer
in descending order for patients with heart disease
Figure 11. Naïve Bayes Attribute Characteristic Viewer in
descending order for patients with no heart disease
Goal 5: Determine the attribute values that
differentiate nodes favoring and disfavoring the
predictable states: (1) patients with heart disease (2)
patients with no heart disease. This query can be
answered by analyzing the results of attribute
discrimination viewer of Naïve Bayes and Neural Network
models. The viewer provides information on the impact of
all attribute values that relate to the predictable state.
Naïve Bayes model (Figure 12) shows the most important
attribute favoring patients with heart disease: “Chest Pain
Type = 4” with 158 cases and 56 patients with no heart
disease. The input attributes “Thal = 7” with 123 (75.00%)
patients, “Exang = 1” with 112 (73.68%) patients,” Slope
=2” with 138 (66.34%) patients, etc. also favor predictable
state. In contrast, the attributes “Thal = 3” with 195
(73.86%) patients, “CA = 0” with 198 (73.06%) patients,
“Exang = 0” with 206 (67.98%), etc. favor predictable
state for patients with no heart disease.
Figure 12. A Tornado Chart for Attribute Discrimination
Viewer in descending order for Naïve Bayes
Neural Network model (Figure 13) shows that the most
important attribute value that favors patients with heart
disease is “Old peak = 3.05 – 3.81” (98%). Other
attributes that favor heart disease include “Old peak >=
3.81”, “CA=2”, “CA=3”, etc. Attributes like “Serum
Cholesterol >= 382.37”, “Chest Pain Type = 2”, “CA =0”,
etc. also favor the predictable state for patients with no
heart disease.
Figure 13. Attribute Discrimination Viewer in descending
order for Neural Network
6. Benefits and limitations
IHDPS can serve a training tool to train nurses and
medical students to diagnose patients with heart disease. It
can also provide decision support to assist doctors to make
better clinical decisions or at least provide a “second
opinion.”
The current version of IHDPS is based on the 15
attributes listed in Figure 1. This list may need to be
expanded to provide a more comprehensive diagnosis
system. Another limitation is that it only uses categorical
data. For some diagnosis, the use of continuous data may
be necessary. Another limitation is that it only uses three
data mining techniques. Additional data mining techniques
IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.8, August 2008
350
can be incorporated to provide better diagnosis. The size
of the dataset used in this research is still quite small. A
large dataset would definitely give better results. It is also
necessary to test the system extensively with input from
doctors, especially cardiologists, before it can be deployed
in hospitals. [Access to the system is currently restricted
to stakeholders.]
7. Conclusion
A prototype heart disease prediction system is
developed using three data mining classification modeling
techniques. The system extracts hidden knowledge from a
historical heart disease database. DMX query language
and functions are used to build and access the models. The
models are trained and validated against a test dataset. Lift
Chart and Classification Matrix methods are used to
evaluate the effectiveness of the models. All three models
are able to extract patterns in response to the predictable
state. The most effective model to predict patients with
heart disease appears to be Naïve Bayes followed by
Neural Network and Decision Trees.
Five mining goals are defined based on business
intelligence and data exploration. The goals are evaluated
against the trained models. All three models could answer
complex queries, each with its own strength with respect
to ease of model interpretation, access to detailed
information and accuracy. Naïve Bayes could answer four
out of the five goals; Decision Trees, three; and Neural
Network, two. Although not the most effective model,
Decision Trees results are easier to read and interpret. The
drill through feature to access detailed patients’ profiles is
only available in Decision Trees. Naïve Bayes fared better
than Decision Trees as it could identify all the significant
medical predictors. The relationship between attributes
produced by Neural Network is more difficult to
understand.
IHDPS can be further enhanced and expanded. For
example, it can incorporate other medical attributes
besides the 15 listed in Figure 1. It can also incorporate
other data mining techniques, e.g., Time Series, Clustering
and Association Rules. Continuous data can also be used
instead of just categorical data. Another area is to use Text
Mining to mine the vast amount of unstructured data
available in healthcare databases. Another challenge
would be to integrate data mining and text mining [16].
8. References
[1] Blake, C.L., Mertz, C.J.: “UCI Machine Learning Databases”,
http://mlearn.ics.uci.edu/databases/heart-disease/, 2004.
[2] Chapman, P., Clinton, J., Kerber, R. Khabeza, T., Reinartz,
T., Shearer, C., Wirth, R.: “CRISP-DM 1.0: Step by step data
mining guide”, SPSS, 1-78, 2000.
[3] Charly, K.: “Data Mining for the Enterprise”, 31st Annual
Hawaii Int. Conf. on System Sciences, IEEE Computer, 7,
295-304, 1998.
[4] Fayyad, U: “Data Mining and Knowledge Discovery in
Databases: Implications fro scientific databases”, Proc. of the
9th Int. Conf. on Scientific and Statistical Database
Management, Olympia, Washington, USA, 2-11, 1997.
[5] Giudici, P.: “Applied Data Mining: Statistical Methods for
Business and Industry”, New York: John Wiley, 2003.
[6] Han, J., Kamber, M.: “Data Mining Concepts and
Techniques”, Morgan Kaufmann Publishers, 2006.
[7] Ho, T. J.: “Data Mining and Data Warehousing”, Prentice
Hall, 2005.
[8] Kaur, H., Wasan, S. K.: “Empirical Study on Applications of
Data Mining Techniques in Healthcare”, Journal of
Computer Science 2(2), 194-200, 2006.
[9] Mehmed, K.: “Data mining: Concepts, Models, Methods and
Algorithms”, New Jersey: John Wiley, 2003.
[10] Mohd, H., Mohamed, S. H. S.: “Acceptance Model of
Electronic Medical Record”, Journal of Advancing
Information and Management Studies. 2(1), 75-92, 2005.
[11] Microsoft Developer Network (MSDN).
http://msdn2.microsoft.com/en-us/virtuallabs/aa740409.aspx,
2007.
[12] Obenshain, M.K: “Application of Data Mining Techniques
to Healthcare Data”, Infection Control and Hospital
Epidemiology, 25(8), 690–695, 2004.
[13] Sellappan, P., Chua, S.L.: “Model-based Healthcare
Decision Support System”, Proc. Of Int. Conf. on
Information Technology in Asia CITA’05, 45-50, Kuching,
Sarawak, Malaysia, 2005
[14] Tang, Z. H., MacLennan, J.: “Data Mining with SQL Server
2005”, Indianapolis: Wiley, 2005.
[15] Thuraisingham, B.: “A Primer for Understanding and
Applying Data Mining”, IT Professional, 28-31, 2000.
[16] Weiguo, F., Wallace, L., Rich, S., Zhongju, Z.: “Tapping
the Power of Text Mining”, Communication of the ACM.
49(9), 77-82, 2006.
[17] Wu, R., Peters, W., Morgan, M.W.: “The Next Generation
Clinical Decision Support: Linking Evidence to Best
Practice”, Journal Healthcare Information Management.
16(4), 50-55, 2002.
... They discussed Naive Bayes, Neural Networks, and Decision Trees, and concluded that the number of features included in the prediction could alter the forecasts accuracy. In [4] constructed a prototype IHDPS using data mining algorithms such as Decision Trees, Nave Bayes, and Neural Network. IHDPS can estimate the likelihood of individuals developing heart disease based on 13 medical attributes. ...
Article
Full-text available
Cardiovascular disease includes a wide range of heart-related illnesses and has surpassed cancer as the top cause of mortality worldwide in recent decades. Many people nowadays are engrossed in their daily lives and engage in various activities while ignoring their health. As a result of their rushed lifestyles and disrespect for their health, the number of people becoming unwell is increasing every day. According to the World Health Organization, heart disease claims the lives of over 31% of the world's population. As a result, doctors must be able to predict whether a patient may develop heart illness, but the amount of data collected by the medical sector or hospitals, on the other hand, is so vast that it can be difficult to analyze at times. This research paper assessed several aspects of heart illness and develops a model based on supervised learning methods like Gaussian Naïve Bayes and AdaBoosting algorithm. The purpose of this research is to figure out how to anticipate whether a patient will develop heart disease. The AdaBoosting algorithm achieves a great accuracy score of 95%, according to the data.
... Also, experiments performed by the other authors show that the Naive Bayes model has the most effective achievement in terms of accurate prediction (86.112%). The challenger NN model with 86.12% correct predictions and, therefore, the third DT with 84% of the score was correct predictions [30]. ...
Article
Full-text available
Heart disease is one of the most complicated diseases, and it affects a large number of individuals throughout the world. In healthcare, particularly cardiology, early and accurate detection of cardiac disease is critical. The Heart Disease Data Set-UCI repository collects data on heart disease. The search space and complexity of the classification models are increased by this raw dataset, which contains redundant and inconsistent data. We need to eliminate the redundant and unnecessary elements from the data to improve classification accuracy. As a consequence, feature selection approaches might be useful for reducing the cost of diagnosis by identifying the most important qualities. This research developed an ensemble classification model based on a feature selection approach in which selected features play a role in classification. Accordingly, a classification approach was introduced using ensemble learning with a genetic algorithm, feature selection, and biomedical test values to diagnose heart disease. Based on the results, it is deduced that the benefits of using the feature selection method vary depending on the utilized machine learning technique. However, the best-proposed model based on the combination of genetic algorithm and the ensemble learning model has achieved an accuracy of 97.57% on the considered datasets. The suggested diagnosis system achieved better accuracy than previously proposed methods and can easily be implemented in healthcare to identify heart disease. Graphical abstract
... Palaniappan and Awang developed a risk evaluation model using decision tree, neural network, and naive Bayes data mining techniques [15]. e developed model extracts interesting hidden patterns related to cardiac disorders and can answer detailed questions in which existing risk assessment tools fail. ...
Article
Full-text available
Heart disease is a severe disorder, which inflicts an adverse burden on all societies and leads to prolonged suffering and disability. We developed a risk evaluation model based on visible low-cost significant noninvasive attributes using hyperparameter optimization of machine learning techniques. The multiple set of risk attributes is selected and ranked by the recursive feature elimination technique. The assigned rank and value to each attribute are validated and approved by the choice of medical domain experts. The enhancements of applying specific optimized techniques like decision tree, k-nearest neighbor, random forest, and support vector machine to the risk attributes are tested. Experimental results show that the optimized random forest risk model outperforms other models with the highest sensitivity, specificity, precision, accuracy, AUROC score, and minimum misclassification rate. We simulate the results with the prevailing research; they show that it can do better than the existing risk assessment models with exceptional predictive accuracy. The model is applicable in rural areas where people lack an adequate supply of primary healthcare services and encounter barriers to benefit from integrated elementary healthcare advances for initial prediction. Although this research develops a low-cost risk evaluation model, additional research is needed to understand newly identified discoveries about the disease.
Chapter
Full-text available
By analyzing patterns in large amounts of data, forensic analysts can identify trends and behaviors in an industry. These forecasts providing valuable insights leading to better informed business and investment decisions. What forecast data processing? Predictive data processing is the processing of data used to predict or predict trends using business intelligence or other data. This type of data processing can help business leaders make better decisions and add value to the analysis team's efforts. Big Data is a group of technologies. This is a huge set of data that will continue to grow. Predictive analysis is the process by which source data is first processed into structured data. Patterns are then identified to predict future events. A clear example of how any one of the college entrance exams can predict the college grade point average (GPA). Predictive analysis of historical data to predict the use of future events. In general, historical data are used to create mathematical models that capture important trends. That predictive model is what happens next in the current data Used to predict or suggest actions to be taken for optimal outcomes. Forecast is a automated forecasting technique that allows the continuous adjustment of forecasts to detect new opportunities and risks in advance and grow profitably. The definition of a prophecy is a prophecy or a prophecy. An example of a prediction is that a mentally ill couple will be told that the baby will be born soon before they know the woman is pregnant. Report on what will happen in the future. A technique is performed on a database to predict the value of the response variable based on the prediction variable or to study the relationship between response variable and predictive variables. Forecast: We may think that prediction is like something that can happen in the future. As with forecasting, we detect or predict missing or unavailable data for new observations based on previous data we have and future assumptions. For example, if the temperature measurement on a machine is related to the running time at high power, then those two combined measurements may put the engine at risk of malfunction. Predict future status using sensor values.
Article
In present scenario, Heart Disease has become the vital cause of mortality and diagnosis of heart diseases is a great confrontation in the field of medical data analysis. Data Mining is an efficient technique for processing and analyzing larger databases for deriving hidden knowledge appropriately. Hence, it is incorporated in medical data analysis for assisting in effective decision making and disease predictions. With that concern, this paper concentrates on framing an Integrated Model for Heart Disease Diagnosis (IM-HDD) using the advanced data mining conceits. The model considers the significant features of patient data that are available in benchmark datasets. Here, the main objective of the proposed model is to enhance the classification accuracy of patient data on classes under NORMAL and ABNORMAL. For enhancing the classification accuracy, the proposed integrated model utilizes the algorithms such as Decision Tree Algorithm, Naive Baye’s Classification and Ensemble Classifiers called Random Forest and Bagging. Further, performance evaluation is performed for analyzing the proposed work. For that, images from UCI repository are utilized and the comparative analysis shows that the proposed work produces better results than the existing models compared.
Chapter
Social networking has rapidly expanded to include millions of individuals throughout the world. It allows users to develop and share its material in various types of information, personal text, image, audio, and videos through this kind of electronic communication through social networking platforms. Thus, social computing has become the new subject of study and development, which covers a wide range of concerns, including Internet semantics, artificial intelligence, linguistic processing, network analysis, and big data analytics. In the previous few years, we have transformed and altered our online social network approach with individuals, groups, and communities (Facebook, Twitter, YouTube, Flickr, MySpace, LinkedIn, Metacafe, Video, and so on). We are developing in this study a program that analyzes the nature of tweets on a specific idea. The main goal is to analyze polarity in noisy Twitter streams. This work reports on the conception of a data analysis, which extracts many tweets. Results divide users into positive and negative perceptions via tweets. The user can enter a keyword and learn the nature of this on the basis of the latest tweets containing the keyword input. Each tweet is classified on the basis of a favorable or bad feeling. Data are collected regarding film reviews from the IMDB website. The machine learning algorithm Naive Bayes was utilized. Different test methods were used to test the result of this model. In addition, our algorithm is quite effective on mining sentences directly taken from Twitter. The accuracy was 92.50% with good generalization capabilities and good speed of execution.
Chapter
Heart attacks are the leading cause of death worldwide. The ability to foresee this is critical for the country’s healthcare industry to improve. Electrocardiograms (ECG) and clinical data are used to make accurate and exact predictions about heart disease. According to a survey conducted by the World Health Organization, 10 million people who were affected by heart disease and have died as a result of it. The issue that today’s healthcare business is dealing with is the early detection of sickness after a person has been afflicted. Ischemic and hypertensive heart attacks are one of the main factors of death, according to the World Health Organization. When this sickness is not closely monitored, it can be fatal. We calculate the risk factor subjectable to the individual for heart attack by creating sample modules over the interactive system for heart attack prediction. The purpose of this work is to evaluate and analyze the different proposed and practiced strategies for detecting and monitoring a person’s heart attack. Machine Learning, IoT, Neural Networks, Association Rule Mining, Android Applications, and other techniques are commonly employed. By automating risk prediction with the use of the studied methodologies, this research study aims to reduce the doctor’s efforts and time.
Book
Full-text available
Article
Full-text available
This paper discusses acceptance issues of Electronic Medical Record System (EMR), particularly in Malaysia. A detailed overview of EMR and its benefits are firstly discussed. A number of acceptance models are scrutinized. Then factors affecting EMR acceptance are put forward. Finally, before proposing an EMR acceptance model, an instrument formed by adapting and then finding its factors loading is presented.
Book
The increasing availability of data in our current, information overloaded society has led to the need for valid tools for its modelling and analysis. Data mining and applied statistical methods are the appropriate tools to extract knowledge from such data. This book provides an accessible introduction to data mining methods in a consistent and application oriented statistical framework, using case studies drawn from real industry projects and highlighting the use of data mining methods in a variety of business applications. Introduces data mining methods and applications. Covers classical and Bayesian multivariate statistical methodology as well as machine learning and computational data mining methods. Includes many recent developments such as association and sequence rules, graphical Markov models, lifetime value modelling, credit risk, operational risk and web mining. Features detailed case studies based on applied projects within industry. Incorporates discussion of data mining software, with case studies analysed using R. Is accessible to anyone with a basic knowledge of statistics or data analysis. Includes an extensive bibliography and pointers to further reading within the text. Applied Data Mining for Business and Industry, 2nd edition is aimed at advanced undergraduate and graduate students of data mining, applied statistics, database management, computer science and economics. The case studies will provide guidance to professionals working in industry on projects involving large volumes of data, such as customer relationship management, web design, risk management, marketing, economics and finance.
Article
This book reviews state-of-the-art methodologies and techniques for analyzing enormous quantities of raw data in high-dimensional data spaces, to extract new information for decision making. The goal of this book is to provide a single introductory source, organized in a systematic way, in which we could direct the readers in analysis of large data sets, through the explanation of basic concepts, models and methodologies developed in recent decades. If you are an instructor or professor and would like to obtain instructor's materials, please visit http://booksupport.wiley.com. If you are an instructor or professor and would like to obtain a solutions manual, please send an email to: [email protected] /* */
Article
This work presents an underground coal mine channel model for the broadband characterization of low-voltage power lines. The model is characterized by taking into account both measured and geometrical channel characteristics and can easily be used to take into account also the presence of noise. The channel is described by a conventional multipath characteristics model and a two-port equivalent described by a scattering matrix determined from a wavelet-based expansion of the input and output quantities. The typical metal carrier model is given. This paper indicates using conventional multipath characteristics model to calculate difficulty because the underground coal mine low-voltage power line has strong branching and proposes to use scattering parameters matrix of wavelet-based. Upper and lower bounds for the response of the channel in presence of time-varying loads are determined in a fast and efficient way avoiding time consuming Monte Carlo simulations. The bounds determination allows the estimate of noteworthy quantities for the tuning of currently used modulation schemes for power lines communications such as orthogonal frequency-division multiplexing.