ArticlePDF Available

Data mining in ECG data

Authors:
  • University of West Attica ( former Piraeus University of Applied Sciences - Technological Educational Institute of Piraeus)
G. Nikolaou#1, S. Vasileiadou#2, C. Alafodimos#3, D. Tseles#4
# Piraeus University of Applied Sciences, Department of Automation Engineering
P.Ralli & Thivon 250, 122 44 Egaleo, Athens, Greece
1 nikolaou@teipir.gr 2 svasil@teipir.gr 3 calafod@teipir.gr 4 dtsel@teipir.gr
Abstract: Electrocardiography (ECG) is a data source that can contain valuable knowledge for the
body function and can reveal information for several diseases. It has been used across the years to
diagnose cardiovascular diseases and to prognose upcoming abnormalities. Along with the
evolution of data processing methods, data mining algorithms in ECG data have been used in the
last three decades. This work investigates this area of research with the aim to realize the advances
through time and present current research trend. The area of data mining is first presented and a
brief introduction of the main algorithms is given. The focus of this work is placed on the different
frameworks and models used in the area, namely signal processing approaches, morphological
based processing and data mining. Literature shows a trend towards real-time diagnosis based on
portable devices. Technology evolution provides portable computational power that can be used for
real-time data processing and a number of research studies in this topic are presented. This work
concludes with some remarks in this area and its future evolvement.
Keywords: Medical Data Mining, ECG Data, Arrhythmia, Real-time diagnosis
I. INTRODUCTION
Twentieth century was characterized as the century of information. Loads of data were collected
and information was exchanged in several aspects. Information technology had an explosive
evolution developing a number of methods to handle and process this data. Knowledge Discovery
Databases (KDD), Machine Learning (ML), Pattern Recognition (PR), Data Mining (DM) etc. are
only some of the uprising methods developed for data processing and knowledge extraction that can
be used for automated decision making. Like in every other field, implementation was tested in
medical data in cases were diagnosis and prognosis can have a major effect on human’s health. In
this work this field of research is presented by presenting a number of applications from different
algorithms and different medical data models. The evolution of applications is recorded concluding
on the latest trend which is the real-time diagnosis through portable devices.
II. MEDICAL DATA
Medical data are gathered every day in big batches in several hospitals and medical centers around
the world. They hide patterns and correlations to human health that are not always straightforward
at the time of collection. From the early 90s’ along with the evolution of computers and the
informatics science, methods and algorithms have been developed to study underlying patterns and
understand the human health system. Initially, statistical methods where used, followed by more
sophisticated data mining and machine learning algorithms. The introduction of data mining
brought up a number of characteristics on the data analysis that made those methods attractive.
Illhoi et al. carry out a survey on the medical literature where they define the four main
characteristics that differentiate data mining to statistics, justifying the domination of such methods,
due to their offering advantages on medical applications [1]. First, the use of heuristics on data
mining allows resolving real world problems in a better manner. Second, data mining algorithms
can handle large data sets in a more efficient way while statistics aim to work with data samples of
the population. Third, statistics use purely numeric data while data mining methods can use a
combination of categorical and numeric data; something that is commonly used in medical data
bases. Finally, statistics is a deductive method while data mining makes inductive conclusions,
meaning that data mining can use specific cases to generalize. In the following part an overview of
data mining is presented.
Data mining in ECG data
III. DATA MINING ALGORITHMIC OVERVIEW
Data mining has been given several definitions since it emerged in the 90s. A commonly accepted
one has been given by Hand et al. and states that “data mining is the analysis of observational data
sets to find unsuspected relationships and to summarize the data in novel ways that are both
understandable and useful to the data owner” [2]. Data mining, knowledge discovery and machine
learning are areas which are interrelated and algorithms from one category can fall into another.
The main two types of algorithms used in data mining are classification and clustering. In
classification, the algorithm trains a data model that can be used to classify any new data set to one
of the classes. The data classes are predefined and the data have to be annotated for the algorithm to
work. In other words some initial knowledge, usually expert’s domain knowledge, is required to
preprocess the data set. In clustering methods the algorithm receives a batch of data which are not
annotated and the algorithm aims to find commonalities that will allow clustering. Medical data and
especially arrhythmia data for model training are annotated with the belonging heart disease (class)
and classification methods are commonly used. Some of the most commonly used algorithmic types
are briefly presented in this part.
Bayesian classifier: This is a probabilistic classifier based on Bayes theorem of probabilities. The
classifier is based on the assumption of conditional independent amongst features which reduces the
computational complexity of probabilities. Its simplicity makes it one of the mostly used classifiers
which however requires a very strong assumption that is difficult to obtain in real world
applications.
Decision Trees: A decision tree classifier is a flow chart type diagram that based on some data
features splits into branches which end to different classes. The splitting point and features is a
matter of the algorithm and its efficiency. ID3 (Iterative Dischotomiser) is the first decision tree
algorithm and C4.5 is one of the mostly used algorithms. Reasons for that are the simplicity of
understanding a decision tree due to its visual structure and the very low computational cost. They
have a low computational cost on both its creation as well as its use. This is a significant advantage
for real-time application.
Support Vector Machines (SVM): SVM is a classification algorithm reported in 92’ from Vapnik
et al. [3] and is one of the most popular classification algorithms. The SVM constructs a hyperplane
to classify data points with the aim to achieve the maximum distance from all the closest elements.
In that way the seperatability between the class elements becomes maximum, something that is used
as an indication of a good classifier. The SVM can generate a very good classifier but it can be
computationally expensive with large data since it involves data matrix inversion.
Nearest Neighbor (NN): The NN classifier bases its results on every neighbor element of the
testing element. The algorithm computes the distance between every element, based on which it
decides in which class should the testing element be placed. The k number (number of neighbor
element used for distance calculation) is an important part of the algorithm since the smaller the k,
the more sensitive the algorithm is to noise. If k is big that means that the algorithm becomes
computationally expensive and the algorithm will suffer from the curse of dimensionality. NN
algorithm is very efficient with low dimensional problems but it can become very computationally
expensive in larger dimensions.
Every classifier has different characteristics making each method attractive in different types of
applications. In medical data and arrhythmia detection lots of different classifiers have been used
with very good results. In the following part the ECG signal is presented along with main the
arrhythmia types used for prognosis and diagnosis.
IV. ECG DATA
Electrocardiographic (ECG) data are used for diagnoses of heart diseases but also for prediction of
upcoming heart disorders. An electrocardiographic signal (ECG) is what is called P-QRS-T wave
and represents the cardiac function. The signal is composed of three main parts, the P wave the T
wave and the QRS complex. Time intervals and duration are some of the ECG features used to
classify the respective heartbeat. For instance, the QRS duration is normally between 0.06 to 0.1
sec., something that is used to observe abnormalities when it different. The structure of ECG
showing the respective P, Q, R, S, T spots is presented in Figure [5]. The heart rate signal getting
from the ECG data is a non-stationary signal, meaning that it can vary through time and
abnormalities do not always constantly appear [4]. The underlying pattern to a heart disease is not
always straightforward. The features on an ECG signal and the morphology of the heart beat can be
used to identify arrhythmia (abnormal heart rhythm) cases and diagnose heart diseases or prongnose
upcoming malfunctions.
Figure The structure of an ECG signal
A number of situations can be identified that can trigger the patient treatment. The heart beat classes
are several and can vary in every study. Some of the most commonly used are Venticular
tachycardia (VT), ventricular fibrillation (VF), atrial premature contraction (APC), premature
ventricular contraction (PVC), supraventricular tachycardia (ST). The chosen number and types of
classes is a good indication on which data mining method should be used, and can vary its
performance.
V. DATA MINING ALGORITHMS IN ARRHYTHMIA
Data mining and machine learning technics for heart beat classification and arrhythmia disease
diagnostics is an area of ongoing research with promising results in the last two decades. Literature
presents several data mining algorithms focusing on classification techniques. The studies vary in
the methodology, the number of classes, the used data features and the algorithm evaluation. These
criteria make a significant difference on the final result and the type of application. Initially reported
studies focused on the number of classes varying from 5 in the early nineties going up to 17 in latest
reports. The evaluation criteria have been varying in time and according to the application.
Accuracy, sensitivity and specificity are some of the main evaluation criteria.
Some applications focus more on the signal processing before the diagnostic stage. Cantzos et al.
present an algorithm for processing ECG recordings for diagnostic purposes [6]. The proposed
scheme is composed of two stages, first abnormality detection and then incident classification. The
algorithm detects the abnormal heart beat segment based on a recursive stochastic time-series
algorithms and the rest of the signal is discarded. The abnormal heart beat segment is then classified
as normal, supraventricular, ventricular and fusion. Presented results show comparable behavior to
other established algorithms. The advantage of this method is the less complex signal and used
features and 50% less data. Dingfei et al. propose an auto regressive model to classify normal sinus
rythms and cardiac arrhythmias. The proposed model shows a high detecting accuracy, varying
from 93.2% to 100% [7].
Some other approaches focus on analyzing the morphology of the heart beat signal. This is because
some arrhythmias cannot be identified based on the heart beat signal analysis or on the PQRST
signal feature extraction. In morphological arrhythmias it is more appropriate to study and compare
the morphology of the heart beat to identify the abnormality [8]. Karimifard et al. present a study
where they develop a heartbeat model based on Hermitian basis function. The model parameters are
then used as a feature vector on a k-NN classifier to realise the accuracy of the model. The classifier
uses seven classes and achieves a sensitivity of 99% and specificity of 99.84%. Some model signals
are presented in Figure [8].
Figure Heart beat and R beat modeling
A combination of morphological and dynamic features for arrhythmia classification is proposed by
Ye et al. [9]. In their work they use a combination of tools for feature selection and finally heart
beat classification. First, the signal is gone through a wavelet analysis to extract 136 morphological
features. A Principal Component Analysis (PCA) is then used to reduce the number of features to
26. The 26 morphological features and 4 dynamic are finally imported to an SVM classifier
composed of 14 classes.
Medical literature contains a number of studies on purely based data mining approaches without
processing the heart beat signal or analyzing its morphological features. Guvenir et al. developed a
supervised learning algorithm (VF15) using inductive learning based on expert’s notated data [10].
They compare the VF15 algorithm to a naïve Bayes and k nearest neighbor algorithm which they
outperformed. Tsipouras et al. propose a data mining approach and implements a Support Vector
Machines (SVM) algorithm on medical data to classify arrhythmic beats [11]. They use a BOXCQP
algorithm [12] to solve the quadratic programming problem formed by SVM. Their work is based
on two constraints, namely, they use RR interval signals and four classes. By sampling the ECG
data and only using RR signals the proposed method is claimed to be faster and more unaffected by
the presence of noise. The used classes are VF, PVC as these described above, NSR and 2o heart
block. The NSR class is used to describe both normal heart beat but also heart beat that doesn’t fall
into any of the previous categories but could still be problematic. The SVM algorithm performs
with very high accuracy with a largest and shortest misclassification rate for the four classes to be
9.77% and 1% respectively. The experiments are carried out on the whole data base from the MIT-
BIH arrhythmia database composed of 109.880 datasets.
Tang et al. present a classification study on medical data coming from patients with coronary heart
disease [13]. They use a data set with 1723 cases and 71 attributes to compare the performance of a
decision tree based algorithms. Additionally, they use a system reconstruction method to first
weight the data and study its effect on the generated decision tree. They anticipate that weight data
can improve the tree’s correction rate and get better results. In the presented experiments they test
the algorithms of inductive trees (ID3), classification and regression tree (C4.5 and CART), Chi-
square interaction detector (CHAID) and exhaustive CHAID. Results show that weigh data have a
better correction rate (see Figure -a)) without though improving neither the tree’s depth nor the leaf
number (see Figure -b) [13]).
Figure a) Correction rate comparison of decision trees and b) Decision-tree algorithm parameters comparison
Another k-NN algorithm is reported by Park et al. [5]. The algorithm is adapted using locally
weighted regression (LWR) to classify heartbeats according to their extracted features. The main
contribution of their work is not on the algorithm per say but on the large number of classes. They
use 17 classes since they analyse the whole ECG signal without partitioning it such as the RR
interval methods. The large number of classes minimizes the unclassified cases and makes
diagnostics more accurate.
VI. REAL-TIME ECG CLASSIFICATION
Initial studies on ECG data analysis, classification of heart diseases and heart beat rate, aimed to use
medical data for diagnostics and disease prognosis based on a number of features. The initial
success of, off-the-shelf algorithm implementation triggered more detailed studies focusing on
algorithmic optimization. Additionally, another aspect brought out in more recent studies is the
processing time. Data mining algorithms can be used for decision support on prognosis but can
potentially be used for real time diagnosis. To achieve that two are the prerequisites; real time data
collection and real time data processing.
Already since the 80s’ there were attempts towards this direction that have motivated future work.
Pan and Tompkins developed an algorithm for real time QRS detection [14]. The algorithm is
designed to be implemented in a hospital environment to diagnose patient’s real time data. The
algorithm is implemented in a Z80 microprocessor offering very small processing power in
comparison to contemporary microprocessors. The algorithm filters the signal to fetch slope,
amplitude and width information, based on which the prognosis is made. Back in the 80s’ data
mining techniques were not easy to use in portable microcontrollers due to low processing
capabilities. However, technological evolution has provided the tools for more sophisticated
algorithms and portable data processing.
Some works have been on that direction aiming for real time results. Karimifard et al. focus their
work on achieving real-time data processing and achieve algorithmic processing time almost equal
to 0.56 seconds [8]. They compare that, to the ECG beat time, which is longer, meaning that the
algorithm’s processing time will be exported before next heartbeat. Going on step further,
Rodriguez et al. present results from real-time ECG data classification on a Personal Digital
Assistant (PDA) [15]. This application aims to realize the ability of evolving PDA devices to run
the required steps to convert a portable device into a medical device. The proposed framework aims
to perform an electrocardiogram beat and also classify the type of rhythm. There are two rising
questions on such an application, these are if the classifier can be accurate and if it can perform
results in real time. To answer that, the authors test a number of classifiers on WEKA software to
choose the most accurate and faster. Four algorithms have comparably good performance, namely
the decision tree C4.5, the IB1 nearest neighbour algorithm, a neural network algorithm and the
LogiBoost algorithm based on a regression learning scheme. Amongst the above, C4.5 incorporated
a number of characteristics (accuracy -92.73%, adequacy of representation language, rule
flexibility, speed efficiency) to be finally chosen for further testing. Moreover, achieving real time
processing on a PDA was further investigated. Authors recommend that to achieve real-time
classification, the processing thread needs to end before the signal acquisition obtains the next
package. Based on that strong constrain they run experiments while varying the acquisition time.
Experiments are run on a PDA and a PC to compare performance. Results show that optimal results
are achieved for 1 and 2 sec processing time generating a detection delay of 4.43 sec and 6.66 sec
for the PC and PDA respectively.
With the development of portable processing capabilities through smart phones, real time, on the
spot heart disease detection becomes realistic. Oresko et al. work in this direction and they present a
framework for real time cardiovascular disease detection with the use of a wearable smart phone
based platform [16].
VII. CONCLUSION
The use of data mining technics on medical applications shows very promising results on the early
diagnosis of potential diseases. ECG data have been widely used to diagnose such abnormalities as
well as for prognosis. Diagnosis of life threatening diseases requires real-time data processing in
contrast to prognosis of heart diseases which happens offline. This is an important aspect of studies
in the literature that can guide the choice of models and algorithms. The functional speed becomes a
significant characteristic for a successful algorithm. Literature shows that real-time and on the spot
diagnosis can happen with the newly introduced technological equipment. This can have a
significant effect on human healthcare and prevent cardiovascular diseases. Current technology can
allow for real time data capturing devices in low cost. Therefore, research should focus on using
data mining algorithms to optimize results and find the relevant framework for optimal ECG data
processing.
Acknowledgements
This research has been co-financed by the European Union (European Social Fund ESF) and
Greek national funds through the Operational Program “Education and Lifelong Learning” of the
National Strategic Reference Framework (NSRF) Research Funding Program: ARCHIMEDES
III Investing in knowledge society through the European Social Fund.
REFERENCES
[1] I. Yoo, P. Alafaireet, M. Marinov, K. Pena-Hernandez, R. Gopidi, J.-F. Chang, and L. Hua, “Data Mining in
Healthcare and Biomedicine: A Survey of the Literature,” J. Med. Syst., vol. 36, no. 4, pp. 2431–2448, 2012.
[2] D. Hand, H. Mannila, and P. Smyth, Principles of data mining. 2001.
[3] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A Training Algorithm for Optimal Margin Classifiers,” Proc. 5th
Annu. ACM Work. Comput. Learn. Theory, pp. 144–152, 1992.
[4] U. Rajendra Acharya, K. Paul Joseph, N. Kannathal, C. M. Lim, and J. S. Suri, “Heart rate variability: a
review,” Med. Biol. Eng. Comput., vol. 44, no. 12, pp. 1031–1051, 2006.
[5] J. Park, K. Lee, and K. Kang, “Arrhythmia Detection from Heartbeat Using k-nearest neighbor clasifier,” in
Bioinformatics and Biomedicine (BIBM), IEEE International Conference on, 2013.
[6] D. Cantzos, D. Dimogianopoulos, and D. Tseles, “ECG Diagnosis via a Sequential Recursive Time Series
Wavelet Classification Scheme,” in EUROCON, IEEE, 2013, no. July, pp. 1770–1777.
[7] D. Ge, N. Srinivasan, and S. M. Krishnan, “Cardiac arrhythmia classification using autoregressive modeling.,”
Biomed. Eng. Online, vol. 1, p. 5, 2002.
[8] S. Karimifard, a Ahmadian, M. Khoshnevisan, and M. S. Nambakhsh, “Morphological heart arrhythmia
detection using Hermitian basis functions and kNN classifier.,” Conf. Proc. IEEE Eng. Med. Biol. Soc. , vol. 1,
no. 4, pp. 1367–70, 2006.
[9] C. Ye, M. T. Coimbra, and B. K. Vijaya Kumar, “Arrhythmia detection and classification using morphological
and dynamic features of ECG signals.,” Conf. Proc. IEEE Eng. Med. Biol. Soc., vol. 2010, pp. 1918–1921,
2010.
[10] H. a. Guvenir, B. Acar, G. Demiroz, and a. Cekin, “A supervised machine learning algorithm for arrhythmia
analysis,” Comput. Cardiol. 1997, vol. 24, pp. 433–436, 1997.
[11] M. G. Tsipouras, C. Voglis, I. E. Lagaris, and D. I. Fotiadis, “Cardiac arrhythmia classification using support
vector machines,” in The 3rd European Medical and Biological Engineering Conference, 2005, pp. 2–7.
[12] C. Voglis and I. E. Lagaris, “Boxcqp : an Algorithm for Bound Constrained Convex Quadratic Problems,” in
International Conference from Schientific Computing to Computational Engineering, 2004, no. September, pp.
8–10.
[13] T. Tang and P. Wang, “A Comparative Study of Medical Data Classification Methods Based on Decision Tree
and System Reconstruction Analysis,” Ind. Eng. Manag. Syst., vol. 4, no. 1, pp. 102–108, 2005.
[14] J. Pan and W. J. Tompkins, “A real-time QRS detection algorithm.,” IEEE Trans. Biomed. Eng., vol. 32, no. 3,
pp. 230–236, 1985.
[15] J. Rodríguez, A. Goñi, and A. Illarramendi, “Real-time classification of ECGs on a PDA,” IEEE Trans. Inf.
Technol. Biomed., vol. 9, no. 1, pp. 23–34, 2005.
[16] J. J. Oresko, H. Duschl, and A. C. Cheng, “A wearable smartphone-based platform for real-time cardiovascular
disease detection via electrocardiogram processing.,” IEEE Trans. Inf. Technol. Biomed., vol. 14, no. 3, pp.
734–40, May 2010.
Article
The paper describes a method for the structural analysis of abstract data and its applications. A finite set with two numberings, two order relations, and one symmetry relation is considered. An algorithm is proposed for partitioning this set into simple parts determined by compositions of given relations. Each simple part of the set is then partitioned into layers. This model can also be used for analyzing different data admitting similar formalization. It can be used for analyzing structure of some graphs (in particular, graphs describing molecular associations). The proposed algorithms are used for partitioning a complex family into simple ones determined by family relationships and for determining family generations. The algorithms can be applied to analyzing corporations with a family-like structure.
ResearchGate has not been able to resolve any references for this publication.