ChapterPDF Available

Imbalance Reduction Techniques Applied to ECG Classification Problem

Authors:

Abstract and Figures

In this work we explored capabilities of improving deep learning models performance by reducing the dataset imbalance. For our experiments a highly imbalanced ECG dataset MIT-BIH was used. Multiple approaches were considered. First we introduced mutliclass UMCE, the ensemble designed to deal with imbalanced datasets. Secondly, we studied the impact of applying oversampling techniques to a training set. smote without prior majority class undersampling was used as one of the methods. Another method we used was smote with noise introduced to synthetic learning examples. The baseline for our study was a single ResNet network with undersampling of the training set. Mutliclass UMCE proved to be superior compared to the baseline model, but failed to beat the results obtained by a single model with smote applied to training set. Introducing perturbations to signals generated by smote did not bring significant improvement. Future work may consider combining multiclass UMCE with smote.
Content may be subject to copyright.
Imbalance Reduction Techniques Applied
to ECG Classification Problem
edrzej Kozal and Pawe!l Ksieniewicz(B
)
Department of Systems and Computer Networks,
Wroclaw University of Science and Technology, Wroclaw, Poland
pawel.ksieniewicz@pwr.edu.pl
Abstract. In this work we explored capabilities of improving deep learn-
ing models performance by reducing the dataset imbalance. For our
experiments a highly imbalanced ECG dataset MIT-BIH was used. Mul-
tiple approaches were considered. First we introduced mutliclass UMCE,
the ensemble designed to deal with imbalanced datasets. Secondly, we
studied the impact of applying oversampling techniques to a training
set. smote without prior majority class undersampling was used as one
of the methods. Another method we used was smote with noise intro-
duced to synthetic learning examples. The baseline for our study was a
single ResNet network with undersampling of the training set. Mutliclass
UMCE proved to be superior compared to the baseline model, but failed
to beat the results obtained by a single model with smote applied to
training set. Introducing perturbations to signals generated by smote
did not bring significant improvement. Future work may consider com-
bining multiclass UMCE with smote.
Keywords: Machine learning ·ECG classification ·Imbalanced data
1Introduction
When dealing with imbalanced datasets we are faced with a significant disparity
in number of learning examples from dierent classes. We call the class with
higher number of samples the majority class, and the class with lower number
of samples the minority class. Majority class can dominate the predictions made
by models with ease. This influences the performance of our machine learning
algorithms. It is easy for neural network to overfit the imbalanced dataset, just
by adjusting biases of last layer to always point the majority class. In this case
accuracy of the model can be as high as the proportion of the majority class to
the all learning examples.
There are many recent papers regarding the ECG signal classification with
deep neural networks [5,8,12,13]. Some of them introduce innovative ideas or
report very high value of metrics. However only few of them mention the problem
of imbalance of the dataset. In this work, we argue, that as in many medical
problems, the imbalance of the ECG dataset is the key factor and addressing it
can bring significant improvement to the model performance.
c
!Springer Nature Switzerland AG 2019
H. Yin et al. (Eds.): IDEAL 2019, LNCS 11872, pp. 323–331, 2019.
https://doi.org/10.1007/978-3-030-33617-2_33
324 J. Kozal and P. Ksieniewicz
One of basic methods of dealing with imbalanced data is undersampling i.e.
drawing learning examples randomly from a majority class in order to equalise
its size with the minority class. This is not sucient when dealing with deep
learning models, which require many learning examples. Purpose of this study
is to examine the eect of imbalance reduction methods on the quality of deep
networks classification metrics.
The single elements of a datasets are sometimes referred to as a samples.
Due to a temporal character of our data (each element of a dataset is signal
consisting of many samples) this may be confusing. To avoid misconceptions
from now on we will refer to a single element of dataset as a learning example
(or just example) and to an elements of learning example as samples.
2 Related Work
Our work is heavily influenced by [8]. In that paper little emphasis was put on
the imbalance of the data. In order to equalise the number of samples, a simple
data augmentation technique was applied. Taking imbalance of a dataset into
consideration can greatly improve a performance of the model.
2.1 Residual Networks
Residual networks were proposed in [3]. They utilise skip connections to ease a
learning task. Due to the usage of skip connections a neural network layers must
learn only the residual mapping F:
y=x+F(x)(1)
instead of learning the whole mapping from xto y. Most common application
of residual learning are convolutional neural networks (CNNs). CNNs can be
applied to any data with grid-like structure [7], including a time series. Some
papers studied the impact of applying a residual learning to the optimisation
process [2]. This study showed that ResNets performance does not degrade when
using deeper architectures.
2.2 UMCE
Undersampled Majority Class Ensemble (UMCE) for a binary classification task
was introduced in [9]. UMCE is an ensemble classifier that changes an unbalanced
learning problem into a set of smaller, balanced problems. This can be achieved
by dividing the set of majority class examples into kfolds, containing a number
of examples equal to the number of examples of the minority class. Next, a pool
of base classifiers Ψis trained on a training set consisting of one of kmajority
class folds and all minority examples. Using the supports of Ψafussingmethod
can be applied to obtain the final decision of an ensemble.
Imbalance Reduction Techniques Applied to ECG Classification Problem 325
2.3 SMOTE
Synthetic Minority Over-sampling Technique (smote)wasintroducedin[4]. It
is proven to be an eective technique. smote generates additional examples by
using distances in the feature space. First it chooses knearest neighbours of a
given example. One of neighbours is chosen randomly. An additional sample is
generated by selecting a random point on the line in the feature space connecting
the neighbour and the base example. Sometimes smote can choose more than
one neighbour, depending on the desired number of a synthetic examples. Also
it can be combined with undersampling of the majority class.
3Methods
3.1 Undersampling and the Baseline Model
As the most simple method of solving this task we employed undersampling to
the lowest cardinality class. This solution served as a baseline for the compar-
ison of all methods performance. Also, for data with applied undersampling,
the structure of ResNet network was fine tuned. Networks with the same struc-
ture were used later in experiments. Of course structure of the network can be
obtained for each of the methods separately, but the sole purpose of this study
is to compare an impact of the imbalance reduction methods, not to obtain the
highest possible value of metrics.
Fig. 1. (Left) the schema of a block with no conv on skip connections - block A of
ResNet. (Right) the schema of a block with conv on skip connection - block B of
ResNet.
326 J. Kozal and P. Ksieniewicz
Base Model. As a base model, a ResNet with 1D convolution was utilised.
Used ResNet consists of two kinds of blocks: A and B (see Fig.1). Block B is
used when the number of filters is increased. In that case convolution is applied
to the skip connection. This changes the tensor shape of a skip connection (its
depth) and enables adding it to an output of the two convolutional layers with
an increased number of filters.
The number of kernels in each convolutional layer was, respectively 32-32-
64-64-128. All kernels were of size 5 and were applied with stride 1. MaxPooling
was applied with stride 2. Skip connections were applied to two convolutional
layers with batch normalisation. The input of the whole model is passed to a
single convolution layer with 32 filters. Output of this layer is feed into five
blocks, respectively: A, A, B, A, B. After these blocks three fully connected
layers were applied as the classifier part of the network. The first two with the
ReLU activation function and 512 neurons, and the last one with softmax.
3.2 Multiclass UMCE
For the purpose of this paper, a multiclass UMCE variant was developed based
on the solution for binary classification. The algorithm is given in 1. The majority
and minority classes here are classes with highest and lowest number of samples
respectively. Please note that we limit the maximum number of base classifiers
to 10 in order to restrict the training time of the whole ensemble. This limitation
can be omitted or tuned when necessary. The number of folds for each class is
dierent, therefore during the training of base classifiers we draw one of all folds
available for each class. For each training set, the fold for a class is drawn once,
so there is no risk that training set will contain repeated examples. After training
each of the base classifiers, the vector of all supports is used to provide the final
decision. In the case of this work fusion technique was average support.
3.3 SMOTE
In our experiments smote was utilised with the number of neighbours k= 5.
All classes with except of the majority class were oversampled to the number
of examples of the majority class. No undersampling of the majority class was
performed. In order to improve the performance of the classification algorithm
we introduced perturbations to samples generated by smote. First we added
random noise drawn from U(0.05,0.05) distribution, and later we stretched
signals by applying resampling to the length of 1+ U(0.5,0.5)
3. This augmentation
method was used in [8]. Later we will refer to this method as smote with data
augmentation.
4 Experiment Setup
In order to reduce the impact of variation, each model was trained with 10 fold
cross validation. Tuning of hyperparameters was performed using 10-fold cv,
Imbalance Reduction Techniques Applied to ECG Classification Problem 327
#min = number of learning examples of minority class;
#max = number of learning examples of majority class;
IR = #max
#min ;
number of classifiers = min(10, !IR");
for ifrom0tonumberofclasses do
#i = number of samples in class;
ki=!#i
#min ";
ki-fold division of all classes;
end
for jfrom0tonumberofclassiersdo
draw random fold for each class;
combine all folds in training set;
sample base classifier Ψj;
train Ψjon training set
end
Algorithm 1. Multiclass UMCE training algorithm
but without test set examples from the standard train-test split of a dataset.
Precision, recall and f1-score were reported to compare the performance of each
model.
Statistical analysis was performed using Kruskal test with Conover posthoc
tests. We were considering precision, recall and f1-score separately. Tests were
performed using results from folds averaged over all labels.
4.1 Dataset
In our study MIT-BIH Arrhythmia Dataset was used. It is available online1.
ECG signals in this dataset are normalised to fit [0, 1] interval, and each training
example is 187 samples long. There are 109446 training examples with 5 classes.
Details of class cardinality are given in Table 1.IRinthistableiscalculated
with regard to the majority class (label 0).
Tabl e 1 . Number of samples and IR for each class.
Class 0 1 2 3 4
Number of samples 72470 2223 5788 641 6431
IR -32.6:1 12.52:1 113.06:1 11.27:1
1https://www.kaggle.com/shayanfazeli/heartbeat.
328 J. Kozal and P. Ksieniewicz
4.2 Tools
All computations were performed using computer with IntelCore i7 8700 CPU
and GTX 1070Ti GPU. The dataset used for this study was downloaded from
kaggle service. The m o dels were implemented u sin g keras [6] library with ten-
sorflow backend [1]. To develop experiment framework packages scikit-learn [10],
imb-learn, numpy [11] and scipy were utilised.
4.3 Neural Network Training Process
All Neural Networks were trained for 20 epochs with initial learning rate 0.001
decreased at epochs 8, 13 and 18 by factor of 10. Adam optimiser algorithm was
employed with categorical cross entropy loss function. Batch size was 32.
5 Results
Tabl e 2 . Test set results averaged over all folds.
Model Label Precision Recall f1-score
ResNet with undersampling 00.994 0.892 0.940
10.306 0.890 0.455
20.822 0.934 0.874
30.225 0.927 0.361
40.933 0.982 0.957
avrg 0.656 0.925 0.717
Multiclass UMCE 01.000 0.918 0.955
10.369 0.919 0.524
20.878 0.951 0.913
30.25 0.94 0.394
40.967 0.989 0.977
avrg 0.693 0.943 0.753
ResNet with smote 00.994 0.996 0.995
10.917 0.893 0.905
20.976 0.972 0.974
30.871 0.818 0.843
40.993 0.993 0.993
avrg 0.950 0.934 0.942
ResNet with smote (with data
augmentation)
00.993 0.996 0.995
10.927 0.884 0.905
20.978 0.969 0.974
30.867 0.817 0.841
40.995 0.994 0.994
avrg 0.952 0.932 0.942
Imbalance Reduction Techniques Applied to ECG Classification Problem 329
Tabl e 3 . H-values and p-values for Kurskal test
Metric H-value p-value
Precision 33.354 2.710e07
Recall 17.148 0.001
f1-score 32.975 3.258e07
Tabl e 4 . Conover post-hoc test results. SwA is short for SMOTE with augmentation.
precision
Undersampling
UMCE
Undersampling
SMOTE
Undersampling
SwA
UMCE
SMOTE
UMCE
SwA
SMOTE
SwA
2.492e05 2.089e13 5.056e15 1.606e07 1.562e09 0.130
recall
Undersampling
UMCE
Undersampling
SMOTE
Undersampling
SwA
UMCE
SMOTE
UMCE
SwA
SMOTE
SwA
6.985e06 0.022 0.064 0.006 0.001 0.634
f1-score
Undersampling
UMCE
Undersampling
SMOTE
Undersampling
SwA
UMCE
SMOTE
UMCE
SwA
SMOTE
SwA
3.990e05 1.011e13 6.389e14 4.120e08 2.341e08 0.852
Results averaged over all folds for separate labels and averaged over all labels
are given in Table 2. As expected, for all models, the most problematic classes
were the ones with the lowest number of samples (1 and 3).
Based on the results of Kruskal tests presented in Table 3with α=0.005
significance level, we can conclude that all algorithms dier in a significant way.
We analyse the post-hoc tests results provided in Table 4with the same sig-
nificance level. In case of precision and f1-score all the algorithms dier in a
significant way, with the exception of smote and smote with augmentation.
For recall all algorithms dier with the exception of undersampling and smote
with augmentation and smote and smote with augmentation.
All of the analysed algorithms achieved the high value of recall. This property
is desired in all medical applications. Of all methods multiclass UMCE obtained
highest average recall and recall for classes 1 and 3. Considering precision and
f1-score, UMCE proposed in this paper brings improvement compared to a single
model with undersampling. However smote with and without data augmenta-
tion obtain the best results, without significant dierence between these two
approaches. Introducing augmentation in this case had only a small impact on
the precision-recall tradeo. F1-score remains the same for smote with and
without augmentation.
330 J. Kozal and P. Ksieniewicz
6Conclusions
This research is quite compelling because it employs classic machine learning
techniques to push forward the performance of deep learning models. By using
this approach we were able to improve on the values of precision, recall and
f1-score metrics for ResNets.
More often than not smote is used after applying undersampling to the
majority class. That was not the case in our experiments. Deep learning models
in general benefit greatly from the increased number of the learning examples
in a dataset. This dependence on the size of the dataset also may contribute
to worse performance of multiclass umce compared to smote. Future work
may study the influence of applying smote to reduce IR to some degree, and
then constructing multiclass umce. Another possible direction of research is to
increase the capacity of the ResNet model while applying smote or to use a loss
function designed to deal with the imbalanced problems.
Also, one can find surprising that smote applied to temporal signals can yield
such good results. smote does not take into consideration temporal dependence
of signal samples. In this case we assume that the success of smote can be
attributed to a great number of learning examples in the dataset and to nor-
malised value of each signal to [0,1] interval. The code used for experiments is
available online2.
Acknowledgments. This work is supported by the Polish National Science Center
under the Grant no. UMO-2015/19/B/ST6/01597 as well the statutory funds of the
Department of Systems and Computer Networks, Faculty of Electronics, Wroc"law Uni-
versity of Science and Technology.
We also wanna thank Micha"lLe´s for lending his computing power resources. Thanks
to him this results could be collected and presented.
References
1. Abadi, M.: TensorFlow: Large-scale machine learning on heterogeneous systems
(2015). http://tensorflow.org/
2. Li, H., et al.: Visualizing the loss landscape of neural nets. CoRR abs/1712.09913
(2017). http://arxiv.org/abs/1712.09913
3. He, K., et al.: Deep residual learning for image recognition. CoRR abs/1512.03385
(2015). http://arxiv.org/abs/1512.03385
4. Bowyer, K.W., et al.: SMOTE: synthetic minority over-sampling technique. CoRR
abs/1106.1813 (2011). http://arxiv.org/abs/1106.1813
5. Jun, T.J., et al.: ECG arrhythmia classification using a 2-D convolutional neural
network. CoRR abs/1804.06812 (2018). http://arxiv.org/abs/1804.06812
6. Chollet, F.E.A.: Keras (2015). https://keras.io
7. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge
(2016). http://www.deeplearningbook.org
2https://github.com/jedrzejkozal/ecg oversampling .
Imbalance Reduction Techniques Applied to ECG Classification Problem 331
8. Kachuee, M., Fazeli, S., Sarrafzadeh, M.: ECG heartbeat classification: a deep
transferable representation. CoRR abs/1805.00794 (2018). http://arxiv.org/abs/
1805.00794
9. Ksieniewicz, P.: Undersampled majority class ensemble for highly imbalanced
binary classification. Proc. Mach. Learn. Res. 1, 1–13 (2010)
10. Pedregosa, F.E.A.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res.
12, 2825–2830 (2011)
11. van der Walt, S., Colbert, S.C., Varoquaux, G.: The numpy array a structure for
ecient numerical computation. Comput. Sci. Eng. 13(2), 22–30 (2011)
12. Xiong, Z.E.A.: ECG signal classification for the detection of cardiac arrhythmias
using a convolutional recurrent neural network. Physiol. Meas. (2018). https://doi.
org/10.1088/1361-6579/aad9ed
13. Xu, S.S., Mak, M.W., Cheung, C.C.: Towards end-to-end ECG classification with
rawsignal extraction and deep neural networks. IEEE J. Biomed. Health Inform.
(2019)
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Objective: The electrocardiogram (ECG) provides an effective, non-invasive approach for clinical diagnosis in patients with cardiac diseases, such as atrial fibrillation (AF). AF is the most common cardiac rhythm disturbance and affects ~2% of the general population in industrialized countries. Automatic AF detection in clinics remains a challenging task due to the high inter-patient variability of ECGs, and unsatisfactory existing approaches for AF diagnosis (e.g., atrial or ventricular activity based analyses). Approach: We have developed RhythmNet, a 21-layer 1D convolutional recurrent neural network, trained using 8,528 single lead ECG recordings from the 2017 PhysioNet/Computing in Cardiology (CinC) Challenge, to automatically classify ECGs of different rhythms including AF. Our RhythmNet architecture contained 16 convolutions to extract features directly from raw ECG waveforms, followed by 3 recurrent layers to process ECGs of varying lengths and to detect arrhythmia events in long recordings. Large 15 × 1 convolutional filters were used to effectively learn the detailed variations of the signal within small time-frames such as the P-waves and QRS complexes. We employed residual connections throughout RhythmNet, along with batch-normalization and rectified linear activation units to improve convergence during training. Main results: We evaluated our algorithm on 3,658 testing data and obtained an F1 accuracy of 82% for classifying sinus rhythm, AF, and other arrhythmias. RhythmNet was also ranked 5th in the 2017 CinC challenge. Significance: Potentially, our approach could aid AF diagnosis in clinics and be used for patient self-monitoring to improve the early detection and effective treatment of AF.
Article
Full-text available
TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.
Article
Full-text available
In the Python world, NumPy arrays are the standard representation for numerical data and enable efficient implementation of numerical computations in a high-level language. As this effort shows, NumPy performance can be improved through three techniques: vectorizing calculations, avoiding copying data in memory, and minimizing operation counts.
Article
This paper proposes deep learning methods with signal alignment that facilitate the end-to-end classification of raw electrocardiogram (ECG) signals into heartbeat types, i.e., normal beat or different types of arrhythmias. Time-domain sample points are extracted from raw ECG signals, and consecutive vectors are extracted from a sliding time-window covering these sample points. Each of these vectors comprises the consecutive sample points of a complete heartbeat cycle, which includes not only the QRS complex but also the P and T waves. Unlike existing heartbeat classification methods in which medical doctors extract handcrafted features from raw ECG signals, the proposed end-to-end method leverages a deep neural network (DNN) for both feature extraction and classification based on aligned heartbeats. This strategy not only obviates the need to handcraft the features but also produces optimized ECG representation for heartbeat classification. Evaluations on the MIT-BIH arrhythmia database show that at the same specificity, the proposed patient-independent classifier can detect supraventricular- and ventricular-ectopic beats at a sensitivity that is at least 10% higher than current state-of-the-art methods. More importantly, there is a wide range of operating points in which both the sensitivity and specificity of the proposed classifier are higher than those achieved by state-of-the-art classifiers. The proposed classifier can also perform comparable to patient-specific classifiers, but at the same time enjoys the advantage of patient independency.
Article
Neural network training relies on our ability to find "good" minimizers of highly non-convex loss functions. It is well known that certain network architecture designs (e.g., skip connections) produce loss functions that train easier, and well-chosen training parameters (batch size, learning rate, optimizer) produce minimizers that generalize better. However, the reasons for these differences, and their effect on the underlying loss landscape, is not well understood. In this paper, we explore the structure of neural loss functions, and the effect of loss landscapes on generalization, using a range of visualization methods. First, we introduce a simple "filter normalization" method that helps us visualize loss function curvature, and make meaningful side-by-side comparisons between loss functions. Then, using a variety of visualizations, we explore how network architecture affects the loss landscape, and how training parameters affect the shape of minimizers.
Technical Report
TensorFlow [1] is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.
Article
In this paper, we propose an effective electrocardiogram (ECG) arrhythmia classification method using a deep two-dimensional convolutional neural network (CNN) which recently shows outstanding performance in the field of pattern recognition. Every ECG beat was transformed into a two-dimensional grayscale image as an input data for the CNN classifier. Optimization of the proposed CNN classifier includes various deep learning techniques such as batch normalization, data augmentation, Xavier initialization, and dropout. In addition, we compared our proposed classifier with two well-known CNN models; AlexNet and VGGNet. ECG recordings from the MIT-BIH arrhythmia database were used for the evaluation of the classifier. As a result, our classifier achieved 99.05% average accuracy with 97.85% average sensitivity. To precisely validate our CNN classifier, 10-fold cross-validation was performed at the evaluation which involves every ECG recording as a test data. Our experimental results have successfully validated that the proposed CNN classifier with the transformed ECG images can achieve excellent classification accuracy without any manual pre-processing of the ECG signals such as noise filtering, feature extraction, and feature reduction.
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.