ArticlePDF Available
1
Heart Sound Segmentation - An Event Detection
Approach using Deep Recurrent Neural Networks
Elmar Messner, Student Member, IEEE, Matthias Z¨
ohrer, and Franz Pernkopf, Senior Member, IEEE
AbstractObjective: In this paper, we accurately
detect the state-sequence first heart sound (S1) - systole -
second heart sound (S2) - diastole, i.e. the positions of S1
and S2, in heart sound recordings. We propose an event
detection approach, without explicitly incorporating a priori
information of the state duration. This renders it also applicable
to recordings with cardiac arrhythmia and extendable to the
detection of extra heart sounds (third and fourth heart sound),
heart murmurs, as well as other acoustic events. Methods: We
use data from the 2016 PhysioNet/CinC Challenge, containing
heart sound recordings and annotations of the heart sound
states. From the recordings, we extract spectral and envelope
features and investigate the performance of different deep
recurrent neural network (DRNN) architectures to detect the
state-sequence. We use virtual-adversarial training (VAT),
dropout and data augmentation for regularization. Results:
We compare our results with the state-of-the-art method and
achieve an average score for the four events of the state-sequence
of F196% on an independent test set. Conclusion: Our
approach shows state-of-the-art performance carefully evaluated
on the 2016 PhysioNet/CinC Challenge dataset. Significance: In
this work, we introduce a new methodology for the segmentation
of heart sounds, suggesting an event detection approach with
DRNNs using spectral or envelope features.
Index Terms—heart sound segmentation, acoustic event de-
tection, deep recurrent neural networks, gated recurrent neural
networks, bidirectional
I. INTRODUCTION
COMPUTER-AIDED heart sound analysis can be
considered as a twofold task: segmentation and
subsequent classification. The accurate segmentation
of the fundamental heart sounds, or more precisely
of the state-sequence first heart sound (S1) - systole -
second heart sound (S2) - diastole, is a challenging task.
In heart sound recordings of healthy adults only S1 and
S2 are present. However, extra heart sounds (third heart
sound - S3 and fourth heart sound - S4) can occur during
diastole, i.e. in the interval S2-S1, and heart murmurs during
systole, i.e. in the interval (S1-S2), and/or diastole, as shown
in Figure 1. Furthermore, the corruption by different noise
sources (e.g. motion artefacts, ambient noise) and other
body sounds (e.g. lung sounds, cough sounds) renders the
segmentation even more challenging.
Elmar Messner, Matthias Z¨
ohrer and Franz Pernkopf are with the Signal
Processing and Speech Communication Laboratory, Graz University of Tech-
nology, Graz, Austria.
This work was supported by the Austrian Science Fund (FWF) under the
project number P27803-N15. We acknowledge NVIDIA for providing GPU
computing resources.
S1 S1 S1 S1 S1S2 S2 S2 S2
S1 S1 S1S2 S2 S2S3 S3 S3
S1 S1 S1 S1S2 S2 S2 S2Murmur Murmur Murmur Murmur
a)
b)
c)
Fig. 1. Examples of heart sound recordings: a) cardiac arrhythmia, b) extra
(third) heart sound S3, and c) heart murmur (mitral valve prolapse). The
marked events are first (S1), second (S2) and third (S3) heart sounds and heart
murmurs.
According to [1], existing heart sound segmentation meth-
ods are classified into four groups: Envelope-based meth-
ods [2]–[7], feature based methods [8]–[14], machine learning
methods [15]–[21] and HMM methods [22]–[28]. The authors
in [28] introduced a logistic regression hidden semi-Markov
model (LR-HSMM) to predict the most likely sequence of
states in the order of S1 - systole - S2 - diastole, using a pri-
ori information about expected durations of the heart sound
states. In experiments they achieve an average F-score of
F1= 95.63 % on an independent test set. Due to the significant
improvement in comparison to other reported methods in the
literature, it is considered as the state-of-the-art method by the
authors in [1]. A more extensive evaluation of the algorithm
on the 2016 PhysioNet/CinC Challenge data [1] is presented
in [29]. The authors report an average F-score of F1= 98.5%
for segmenting S1 and systole intervals and F1= 97.2% for
segmenting S2 and diastole intervals. They observe detection
errors especially in the situations of long heart cycles and
irregular sinus rhythm. Also, the authors in [21] point out
that the LR-HSMM-method [28] may be unsuitable for the
segmentation in recordings with cardiac arrhythmia. Their
main objective is to investigate if S1 and S2 can be detected
without using a priori information about the state duration.
They propose a machine learning approach with a deep neural
network (DNN) in combination with mel frequency cepstral
coefficients (MFCCs)-features for S1 and S2 heart sound
recognition. Using the K-means algorithm, they cluster the
MFCC features into two groups to refine their representation
and discriminative capability. The refined features are then fed
2
to a DNN. In experiments with a relatively small dataset, the
authors show that S1 and S2 can be detected with an accuracy
of 91 %, outperforming well-known classifiers, such as K-
nearest neighbor, Gaussian mixture models, logistic regression,
and support vector machines.
Within this paper, we exploit spectral information and
temporal dependencies of heart sounds for heart sound seg-
mentation. To this end, we propose an acoustic event detec-
tion approach with deep recurrent neural networks (DRNNs)
[30]–[32]. Recurrent neural networks (RNNs) are suitable to
process sequential input of variable length, and learn temporal
dependencies within the data [33], [34]. They are already used
in heart sound classification [35]–[37], but, to the best of
our knowledge, not introduced specifically for heart sound
segmentation. Compared to the LR-HSMM-method [28], we
do not directly incorporate a priori information about the
state durations, because the model is capable of learning the
temporal dependencies itself. Furthermore, we are flexible
regarding the order of occurring states enabling to model
additionally S3, S4 and heart murmurs. In acoustic event
detection, sound events are usually detected by onsets and
offsets, defining the beginning and ending of a particular
event within an audio recording. It is differentiated between
polyphonic and monophonic event scenarios: In the first case,
multiple events can occur at the same time, whereas in the
second case no overlapping events exist. Within this work,
we consider heart sound segmentation as a monophonic event
scenario, although heart sound recordings can be contam-
inated with body sounds and different noise sources, and
therefore represent a polyphonic event scenario. DNNs show
a significant boost in performance when applied to acoustic
event detection. In particular, Gencoglu et al. [38] proposed
a DNN architecture for acoustic event detection. Although
DNNs are powerful network architectures, they do not model
temporal context explicitly. To account for temporal structure
long short term memory (LSTM) networks, i.e. DNNs capable
of modeling temporal dependencies, have been applied to
acoustic keyword spotting [39] and polyphonic sound event
detection [40]. Performance in recognition comes at the ex-
pense of computational complexity and the amount of labeled
data. LSTMs have a relatively high model complexity and
parameter tuning is not always simple. A simplification of
LSTMs are gated recurrent neural networks (GRNNs), which
have less parameters, but achieve comparable performance.
Due to this fact, we focus on GRNNs for the accurate seg-
mentation of fundamental heart sounds, or more precisely of
the state-sequence S1 - systole - S2 - diastole. GRNNs already
show promising results for acoustic event detection [41]. To
exploit future information as well, and not just information
from the past, we also consider bidirectional recurrent neural
networks [42], [43].
In particular, we extract spectral and envelope features
from heart sounds and investigate the performance of different
DRNN architectures to detect the state-sequence, i.e. acoustic
events. We use data from the 2016 PhysioNet/CinC Chal-
lenge [1], containing heart sound recordings and annotations
of the heart sound states. Our main contributions and results
are:
We compare different recurrent neural network architec-
tures.
We evaluate bidirectional gated recurrent neural networks
(BiGRNNs) in combination with virtual adversarial train-
ing (VAT), dropout and data augmentation for regulariza-
tion.
We show state-of-the-art performance on the 2016 Phys-
ioNet/CinC Challenge dataset.
The paper is structured as follows: In Section II, we dis-
cuss common DRNN architectures. In particular, we explain
vanilla RNNs, LSTMs, GRNNs, and their implementations
as bidirectional networks. We introduce virtual adversarial
training (VAT), dropout and two data augmentation approaches
for regularization in Section III. In Section IV, we introduce
our processing framework for heart sound segmentation and
show experimental results, including the comparison with the
LR-HSMM-method [28]. Finally, we discuss our findings in
Section V and conclude the paper in Section VI.
II. RECURRENT NEURAL NE TW OR K ARCHITECTURES
RNNs are extensions of traditional feed forward neural
networks [44]. They are able to process sequential input of
variable length, and learn temporal dependencies within the
data. Various RNN architectures exist, such as Elman net-
works [45], Jordan networks [46], or Hop field networks [47].
In this work, we focus on a more classical model, i.e. the
vanilla RNN, two very popular architectures, i.e. LSTMs and
GRNNs, and their bidirectional implementations.
A. Vanilla Recurrent Neural Networks
Figure 2 shows the flow-graph of an RNN unit. We consider
a recurrent neural network with Llayers, with l∈ {1, ..., L1}
indexing the hidden layers of the network. With the given
input vector xl
fand the previous recurrent hidden state vector
hl
f1, the sum between the dot product Wl
xxl
f, the projected
previous hidden state Wl
hhl
f1and the bias term bl
his
computed. Wl
xis the input weight matrix and Wl
hthe hidden
IN
+
h
OUT
Fig. 2. Flow graph of a vanilla RNN unit. hdenotes the activation.
weight matrix. A non-linear function gis applied to obtain the
output hl
f, as shown in Equation (1). The output hl
fis used as
the input for the next layer xl+1
f. As shown in Equation (2),
the output of the last hidden layer hL1
fis fed into the output
layer. Wyis the output weight matrix and bythe output bias
term. A non-linear function mis applied to obtain the output
yf.
hl
f=g(Wl
xxl
f+Wl
hhl
f1+bl
h)(1)
yf=m(WyhL1
f+by)(2)
3
Multiple RNN layers can be stacked, forming a deep recurrent
neural network. They are trained via back-propagation through
time using a differentiable cost function.
B. Long Short Term Memory Networks
LSTMs [48], [49] are temporal recurrent neural networks
using memory cells to store temporal information. In contrast
to RNNs, LSTMs have memory cells, which store or erase
their content using input gates ior forget gates r. An additional
output gate ois used to access this information. Figure 3 shows
the flow-graph of an LSTM unit.
IN
+
c
e
c
+
tanh
o
r
i
OUT
Fig. 3. Flow graph of a LSTM unit [31]. i,r, and oare the input, forget and
output gates, respectively. cdenote the memory cell and ecthe new memory
cell content.
In Equations (3-8), the network is mathematically described.
The input states il
fare calculated by applying a sigmoid
function σto the sum of the dot-product of the input weight
matrix Wl
xi and the inputs xl
f, the projected previous hid-
den states Wl
hihl
f1and the bias vector bl
iof layer l(cf.
Equation 3). The forget states rl
f(cf. Equation 4) and output
states ol
f(cf. Equation 5) are computed in a similar way,
except for using individual forget matrices Wl
xr,Wl
hr and the
forget bias vector bl
rand output matrices Wl
xo,Wl
ho and the
output bias vector bl
o, respectively. The new memory states
˜
cl
fare obtained by applying a tanh activation function on the
sum of the projected inputs Wl
xcxl
f, previous hidden memory
states Wl
hchl
f1and the bias vector bl
cin Equation (6). The
memory cell states cl
fare updated by the previous memory
states cl
f1and ˜
cl
f(cf. Equation (7)), weighted with the forget
states rl
fand the input state il
f, respectively (denotes an
element-wise product). The outputs hl
fare computed with the
current memory states tanh(cl
f)and the output states ol
fin
Equation (8).
il
f=σ(Wl
xixl
f+Wl
hihl
f1+bl
i)(3)
rl
f=σ(Wl
xrxl
f+Wl
hrhl
f1+bl
r)(4)
ol
f=σ(Wl
xoxl
f+Wl
hohl
f1+bl
o)(5)
˜
cl
f= tanh(Wl
xcxl
f+Wl
hchl
f1+bl
c)(6)
cl
f=rl
fcl
f1+il
f˜
cl
f(7)
hl
f=ol
ftanh(cl
f)(8)
In classical RNNs, the hidden activation is overwritten at
each time-step (cf. Equation 1). LSTMs are able to decide
whether to keep or erase existing information with the help
of their gates. If LSTMs detect important features from an
input sequence at early stage, they easily carry this information
over a long distance, hence, capturing potential long-distance
dependencies.
C. Gated Recurrent Neural Networks
GRNNs [31], [32] are simplifications of LSTMs, achieving
comparable performance, but having less parameters. Gated
recurrent units have reset- and update-gates, coupling static
and temporal information. This allows the network to learn
temporal information. Whenever an important event happens,
the update-gate zdecides to renew the current state of the
model. The network can forget the previously computed in-
formation by deleting the current state of the model with the
reset-gate r. Figure 4 shows the flow graph of a gated recurrent
unit.
IN
e
h
h
r
z
OUT
Fig. 4. Flow graph of a gated recurrent unit [31]. rand zdenote the reset
and update gates, and hand
e
hthe activation and the candidate activation.
Equations (9-12), mathematically describe the network.
Equation (9) starts with the output states hl
f, which are
computed with a linear interpolation between past states hl
f1
and current information ˜
hl
f, by using the update-states zl
f. The
update-states zl
fdetermine the update behavior of the units.
According to Equation (10), they are computed as sigmoid
function of the weighted input xl
fand the past hidden states
hl
f1.Wand bdenote the weights and bias terms. In Equa-
tion (11) the states ˜
hl
fare computed with a non-linear function
g, applied to the affine transformed input and the previous
hidden states hl
f1. This is similar to Equation (1), only
differing in the additional reset-state rl
f, which is element-
wise multiplied with hl
f1. In Equation (12), the reset state is
computed with the current inputs xl
fand the provided hidden
states hl
f1.
hl
f= (1 zl
f)hl
f1+zl
f˜
hl
f(9)
zl
f=σ(Wl
xzxl
f+Wl
hzhl
f1+bl
z)(10)
˜
hl
f=g(Wl
xhxl
f+Wl
hh(rl
fhl
f1) + bl
h)(11)
rl
f=σ(Wl
xrxl
f+Wl
hrhl
f1+bl
r)(12)
D. Bidirectional Recurrent Neural Networks
Conventional RNNs are limited to previous context, i.e.
information in the past of a specific time frame. To make
use of future context as well, their extension to bidirectional
RNNs [42] can be used (see Figure 5). Bidirectional RNNs
process data in both directions with two separate hidden layers,
which are then fed into the same output layer.
4
Outputs
Backward Layer
Forward Layer
Inputs
yf1
hf1
hf1
xf1
yf
hf
hf
xf
yf+1
hf+1
hf+1
xf+1
Fig. 5. Bidirectional recurrent neural network [50].
Equations (13-15) specify the network mathematically. The
forward hidden sequence
hl
fis computed by iterating the
forward layer from f= 1 to F. The backward hidden
sequence
hl
fis computed by iterating the backward layer
from f=Fto 1.Wl
x
h,Wl
x
hare the input weight matrices,
Wl
h
h,Wl
h
hthe hidden weight matrices, and bl
h,bl
h
the bias terms for the forward and backward hidden layer,
respectively. Multiple bidirectional RNN layers can be stacked,
forming a deep bidirectional RNN. Every hidden layer receives
input from the previous forward and backward layers, i.e.
hl1
fand
hl1
f. According to Equation (15), the output layer
yfis updated using the hidden activations
hL1
fand
hL1
f
of the last hidden layer L1.W
h y,W
h y are the output
weight matrices and bythe output bias term.
BiRNNs can be combined with LSTMs or GRNNs, re-
sulting in bidirectional long short term memory networks
(BiLSTMs) or bidirectional gated recurrent neural networks
(BiGRNNs) [50].
hl
f=g(Wl
x
hxl
f+Wl
h
h
hl
f1+bl
h)(13)
hl
f=g(Wl
x
hxl
f+Wl
h
h
hl
f1+bl
h)(14)
yf=m(W
h y
hL1
f+W
h y
hL1
f+by)(15)
III. REGULARIZERS FOR RNNS
Deep neural networks usually require many training sam-
ples. The training set of the PhysioNet/CinC Challenge
database [1] is limited to only 3153 heart sound recordings.
We consider three different approaches for regularization to
improve the ability of the model to generalize on test data, i.e.
virtual adversarial training, dropout and data augmentation.
A. Virtual Adversarial Training
Virtual adversarial training (VAT) [51], [52] is a regulariza-
tion method, which makes the model robust against adversarial
perturbations [53], [54]. It promotes local smoothness of
the posterior distribution p(yf|xf)with respect to xf. The
posterior distribution, or more precisely the softmax activa-
tion of the network output hl
f, should vary minimally for
small, bounded perturbations of the input xf. The adversarial
perturbation δfis determined on frame-level by maximizing
the Kullback-Leibler divergence KL-divergence (·||·) of the
posterior distribution for unperturbed and perturbed inputs, i.e.
δf= arg max
||δ||<
KL(p(y|xf)||p(y|xf+δ),(16)
where  > 0limits the maximum perturbation, i.e. the
noisy input xf+δlies within a radius around xf. The
smaller the KL(p(y|xf)||p(y|xf+δf)the smoother is the
posterior distribution around xf. Instead of maximizing the
conditional likelihood p(yf|xf)of the model during training,
we maximize the regularized objective
X
f
log p(yf|xf)λX
f
KL(p(y|xf)||p(y|xf+δf)),(17)
where the tradeoff parameter λand the radius have to be
selected on a validation set.
For further details regarding the implementation, we refer
to [51]. In our experiments, we tune the number of iterations
Ip, the radius and the tradeoff parameter λ.
B. Dropout
The idea of dropout is to randomly drop units from the
neural network during training [55]. In this work, we consider
input dropout applied on the hidden layers. Due to simplicity,
we show dropout just for the vanilla RNN. Equations (18-
20) describe the feed-forward operation of the network with
dropout.
rlBernoulli(p)(18)
˜xl
f=rlxl
f(19)
hl
f=g(Wl
x˜xl
f+Wl
hhl
f1+bl)(20)
For any hidden layer l∈ {1, ..., L 1},rlis a vector
of independent Bernoulli random variables, each having a
probability pof being 1, with p= [p, p, ..., p]T. The vector rl
is multiplied element-wise with the inputs of the layer xl
f, to
create the thinned inputs ˜xl
f. The thinned inputs are then used
as inputs to the current layer. For training, the derivatives of the
loss function are backpropagated through the sub-network. For
testing, the network is used without dropout and the weights
are scaled as Wl
x,test =pWl
x.
C. Data Augmentation
We consider two approaches for data augmentation, i.e.
noise injection and generating of additional training data with
various audio transformations.
1) Noise Injection: Noise injection to the inputs of a neural
network can be considered as a form of data augmenta-
tion [56]. The model should be capable to detect the heart
sound sequence, although random noise is added to the inputs
and also applied to the hidden units. The authors in [57]
showed that noise injection can be very effective if the noise
magnitude is carefully tuned. Dropout (see Section III-B) can
be considered as a process of constructing new inputs by using
a particular type of noise [56]. We add zero mean Gaussian
noise to the inputs xfand the hidden units during training.
Standard deviation and noise level are tuned.
5
2) Audio Transformations: The best way to prevent over-
fitting is to train on more data. Therefore, we augment the
training data by using various audio transformations from
SoX [58], similar as in [59]. We consider the following two
transformations to slightly modify the heart sound recordings:
Pitch: Change the audio pitch without changing tempo.
Tempo: Change the audio playback speed but not its pitch.
We provide an overview of the augmented training set in
Table IV in Section IV-B2.
IV. HEA RT SOUND SEG ME NTATI ON - EXP ER IM ENTS
A. Audio Processing Framework
Figure 6 shows the basic steps of our heart sound segmenta-
tion framework. Given the raw audio data xt= [x1, . . . , xT],
we extract a sequence of feature frames xfRD.Dindicates
the dimension of the feature vector and f∈ {1, ..., F }is the
frame index, with Finidicating the number of frames.
feature
extraction DRNN arg max
x1. . . xTx1. . . xF˜
y1. . . ˜
yFevent-sequence
Fig. 6. Audio processing framework for heart sound segmentation with
DRNNs.
We process the feature frames with a multi-label DRNN
with a softmax output layer. The index of the maximum value
(arg max) of the real-valued output vector ˜yfdetermines the
event class per frame. This results in a sequence of frame
labels as output. We group consecutive identical frame labels
as one event.
B. Material
1) Heart Sound Database (Training Sets of the Phys-
ioNet/CinC Challenge 2016): For the experiments within this
section, we use heart sounds from the 2016 PhysioNet/CinC
Challenge [1]. The dataset is a collection of several heart
sound databases from different research groups, obtained in
different real-world clinical and nonclinical environments. It
contains recordings from normal subjects and pathological
patients, which are grouped as follows: Normal control group
(Normal), murmurs related to mitral valve prolapse (MVP),
innocent or benign murmurs (Benign), aortic disease (AD),
miscellaneous pathological conditions (MPC), coronary artery
disease (CAD), mitral regurgitation (MR), aortic stenosis (AS)
and pathological (Pathologic). The heart sounds were recorded
at the four common recording locations: aortic area, pulmonic
area, tricuspid area and mitral area. Due to the fact that
the database is a collection of several small databases from
different research groups, the recordings vary regarding several
aspects: recording hardware, recording locations, data quality
and patient types and methods for identifying gold standard
diagnoses. For further details, we refer to [1].
The training set includes data from six databases, with a
total of 3153 heart sound recordings from 764 subjects/patients
(see Table I). The recordings are sampled with fs= 2 kHz and
vary in length between 5s and just over 120 s. The dataset
is unbalanced, i.e. the number of normal recordings differ
TABLE I
SUM MARY O F TH E DATASE T (TR AI NI NG DATA FRO M THE 2016
PHYSIONET /CI NC C HA LLE NG E) [1].
Challenge set # Patients # Recordings # Beats
PN-training-a 121 409 14559
PN-training-b 106 490 3353
PN-training-c 31 31 1808
PN-training-d 38 55 853
PN-training-e 356 2054 59593
PN-training-f 112 114 4260
Total 764 3153 84426
from that of abnormal recordings. Besides a binary diagnosis
(-1=normal, 1=abnormal) for each heart sound recording,
the challenge dataset further provides annotations for the
heart sound states (S1,systole,S2,diastole). The annotations
were generated with the LR-HSMM-based segmentation al-
gorithm [28] (trained on PN-training-a) and further manually
corrected. The annotations solely generated with the segmen-
TABLE II
SUM MARY O F TH E DATASE T (TR AI NI NG DATA FRO M THE 2016
PHYSIONET /CI NC C HA LLE NG E) [1] AF TE R EX CLU DI NG A REA S LA BE LED
AS noisy (L ABE LS :’(N’, ’N)’)AN D FIL ES MA RK ED A S unsure.
Challenge set # Recordings # Beats
PN-training-a 392 14559
PN-training-b 368 3353
PN-training-c 27 1808
PN-training-d 52 853
PN-training-e 1926 59567
PN-training-f 109 4260
Total 2874 84400
tation algorithm and those generated with the segmentation
algorithm and subsequent hand correction, are accessible
separately. In total 84426 beats were annotated in the PN-
training set (after hand correction).
Because the reference annotations for the four heart sound
states were not available for heart sound recordings marked
with unsure (=low signal quality), we excluded these record-
ings. We further excluded areas labeled as noisy (labels: ’(N’,
’N)’) by setting the respective areas of the signal to zero (no
signal). Table II shows the resulting number of recordings and
beats.
2) Training, Validation and Test Data: Due to the fact that
the original test set from the PhysioNet/CinC Challenge 2016
is not publicly available so far, we generated a new test-,
validation- and training-set out of the original PhysioNet (PN)-
training set (see Section IV-B1). In the test set, we put exclu-
sively PN-training-a and some recordings from PN-training-b
and PN-training-e. For the recordings from PN-training-b and
PN-training-e, we ensured their exclusivity in terms of subject
affiliation, i.e. each subject is either only in the training set or
the test set. We selected all recordings from the same subject
with increasing ’Subject ID’ (for PN-training-b) and increasing
’Raw record’ name (for PN-training-e). This additional infor-
mation is provided by the online appendix of the database.
The resulting test set contains 764 recordings with 21116 beats
6
TABLE III
SUM MARY O F TH E TE ST,VALI DATIO N AN D TRA IN IN G SET. T HE ASSIGNED NUMBER OF RECORDINGS (#R.) AN D BEATS (#B.) ARE REPORTED. TH E
RECORDINGS ARE GROUPED AS FOLLOWS: NORM AL C ON TRO L GRO UP (N ORMAL), M UR MUR S RE LATE D TO MI TR AL VALVE P ROL AP SE (MVP),
INNOCENT OR BENIGN MURMURS (BENIGN), AORTIC DISEASE (AD), MISCELLANEOUS PATHOLOGICAL CONDITIONS (MPC), CORO NA RY ARTE RY
DISEASE (CAD), MITRAL REGURGITATION (MR), AO RTIC S TE NO SIS (AS) AND PATHOLOGICAL (PATHOLOGIC).
Dataset Challenge set Normal MVP Benign AD MPC CAD MR AS Pathologic Total
#R. #B. #R. #B. #R. #B. #R. #B. #R. #B. #R. #B. #R. #B. #R. #B. #R. #B. #R. #B.
Test PN-training-a 116 4419 126 4583 114 4200 13 425 23 932 - - - - - - - - 392 14559
PN-training-b 135 1278 - - - - - - - - 29 238 - - - - - - 164 1516
PN-training-e 205 5018 - - - - - - - - 3 23 - - - - - - 208 5041
Total 456 10715 126 4583 114 4200 13 425 23 932 32 261 - - - - - - 764 21116
Validation PN-training-b 12 100 - - - - - - - - 5 49 - - - - - - 17 149
PN-training-c 1 23 - - - - - - - - - - 3 125 1 18 - - 5 166
PN-training-d 3 32 - - - - - - - - - - - - - - 2 26 5 58
PN-training-e 156 4987 - - - - - - - - 15 344 - - - - - - 171 5331
PN-training-f 7 269 - - - - - - - - - - - - - - 5 162 12 431
Total 179 5411 - - - - - - - - 20 393 3 125 1 18 7 188 210 6135
Training PN-training-b 148 1313 - - - - - - - - 39 375 - - - - - - 187 1688
PN-training-c 6 340 - - - - - - - - - - 9 772 7 530 - - 22 1642
PN-training-d 23 302 - - - - - - - - - - - - - - 24 493 47 795
PN-training-e 1419 46564 - - - - - - - - 128 2631 - - - - - - 1547 49195
PN-training-f 71 2820 - - - - - - - - - - - - - - 26 1009 97 3829
Total 1667 51339 - - - - - - - - 167 3006 9 772 7 530 50 1502 1900 57149
in total. From the residual recordings, we randomly selected
210 recordings (6135 beats) for the validation set and 1900
recordings (57149 beats) for the training set. Details about the
splitting are shown in Table III.
In Section III-C, we introduce two transformations for
data augmentation, pitch shifting and temporal stretch-
ing/compressing. We modify the recordings from the training
set with a pitch shift of ±a semitone, i.e. a fundamental
frequency of 50 Hz varies with approximately ±3Hz. We
modify the time-scale of the recordings with ±10 %. In total,
we get an augmented dataset consisting of 9500 recordings
and 285745 beats, as shown in Table IV.
TABLE IV
AUGMENTED TRAINING SET.
Effect Parameters # Recordings # Beats
Clean 1900 57149
Pitch +semitone 1900 57149
Pitch -semitone 1900 57149
Tempo +10% 1900 57149
Tempo -10% 1900 57149
Total 9500 285745
3) Labeling: Based on the hand corrected annotations,
we generated the labeling for the state sequence first heart
sound (S1) - systole - second heart sound (S2) - diastole. Due
to the shift of 20 ms in our frame-wise processing framework
(see Section IV-C), we generated a label for each frame from
the annotation information. In addition to the state labeling,
we further added the label no signal, for areas with absent
signal due to zero-padding. Figure 7 shows an example of a
phonocardiogram (PCG) with the five labels.
no signal S1 S1 S1S2 S2 S2sys sys sysdia dia
Fig. 7. Example of a phonocardiogram (PCG) showing the five possible
labels: no signal,first heart sound (S1),systole (sys),second heart sound (S2),
and diastole (dia).
C. Feature Extraction
We resampled the heart sound recordings to a sampling
frequency of fs= 1 kHz and removed DC offset with a high-
pass filter with a cutoff frequency of fc= 10 Hz (relevant
for PN-training e). We zero-padded the recordings according
to the longest one in each set.
For the spectral features, we preprocessed all recordings
with a STFT using a Hamming window with window-size
80 ms (b= 80 samples) and 75 % overlap (b=frame-shifts of
20 ms or 20 samples). To exploit the spectral information of the
heart sounds, we consider the following two types of spectral
features:
Spectrogram: We extract 41-bin log magnitude spectro-
grams.
Mel Frequency Cepstral Coefficients (MFCCs): MFCCs
are used as features in various acoustic pattern recog-
nition tasks, including heart sound classification [1] and
heart sound segmentation [21].
We extract 20 static coefficients, 20 delta coefficients ()
and 20 acceleration coefficients (2), resulting in a 60-
bin vector per frame. We use 20 mel bands within a
frequency range of 0-500 Hz. With a width of 9 frames,
we calculate the delta and acceleration coefficients.
7
Furthermore, similar as for the LR-HSMM-method [28], we
extract feature vectors for 20 ms-frames with the following
features:
Envelope features [28]: Homomorphic envelope, Hilbert
envelope, wavelet envelope and power spectral density.
All features were normalized to zero-mean unit variance using
the training corpus.
D. Evaluation Metrics
We evaluate the results event-based. We define an event as
correctly detected if its temporal position overlaps with the one
of an identically labeled event in the hand annotated ground
truth. For temporal onset and offset, we allow a tolerance
of ±40 ms (b=±two frame-shifts of 20 ms), respectively. We
determine for all heart sound recordings:
True positives (TP): Events, where system output and
ground truth have a temporal overlap;
False positives (FP): The ground truth indicates no event
that the system outputs;
False negatives (FN): The ground truth indicates an event
that is not detected by the system;
Substitutions (S): Events in the automatic segmentation
with correct temporal position, but incorrect class label;
Insertions (I): False positives minus the number of sub-
stitutions;
Deletions (D): False negatives minus the number of
substitutions;
Reference states (N): Number of events in the ground
truth.
We evaluate the performance of the segmentation algorithm
using Precision (Eqn. 21), Sensitivity (Equation 22), F-score
(Equation 23) and Error Rate (Equation 24). The Error Rate
should be small, while the F-score should be large. For a more
detailed description of the metrics, we refer to [60].
P+=T P
T P +F P (21)
Se =T P
T P +F N (22)
F1= 2 ·P+·S e
P++Se (23)
ER =S+I+D
N(24)
E. Experiments and Results
For the experiments1, we built a single multi-label
classification system. We initialize the models with orthogonal
weights [61] and use a softmax output gate as output layer.
For optimizing the cross-entropy error (CEE) objective, we
use ADAM [62]. We perform early stopping, where we train
each model for 200 epochs and use the parameter setting that
causes the smallest validation error for the evaluation of the
model. The reported scores are the average values over the
1We conducted experiments using Python with Theano, and CUDA for
GPU computing.
events S1,systole,S2 and diastole. In addition to the average
values, we report the scores for each event independently on
the test set for the best setup. The reported scores are results
of the validation set, except for the evaluation of the final
setup on the test set in Section IV-E6.
1) Comparison of GRNN network size: We initiate our
experiments with finding an appropriate network size by using
GRNNs and MFCC features. We use rectifier activations for
the gated recurrent units. Figure 8 shows the results with
varying number of neurons per hidden layer and varying
number of hidden layers per model. For a 2-hidden layer
GRNN, we achieved the best score of F1= 93.5% with
400 neurons/layer. For a GRNN with hidden layers of 200
neurons, we achieved the best score of F1= 93.2% with 4
hidden layers. Due to the small difference regarding the F-
score, we choose a network size in favor of faster training.
For the subsequent experiments, we fix the model size to 2
hidden layers, and 200 neurons per layer.
100 200 300 400 500
neurons/layer
92.4
92.6
92.8
93.0
93.2
93.4
93.6
F-Score
(a) Comparing network size
23456
layers
92.4
92.6
92.8
93.0
93.2
93.4
93.6
F-Score
(b) Comparing network depth
Fig. 8. Comparison of network size: (a) shows the F-scores for a GRNN
using {2, ..., 6}hidden layer of 200 neurons. (b) shows the F-score for a
GRNN with two hidden layer using {100,200, ..., 500}neurons per layer.
2) Comparison of GRNN activation functions: Table V
shows the results for different activation functions. In particu-
lar, we use sigmoid, tanh and rectifier non-linearities. Again,
we use (2 hidden layer, 200 neurons per layer) GRNNs and
MFCC features. Rectifier functions achieve the best average
score, i.e. F1= 93.0%. This is consistent with the litera-
ture [63].
TABLE V
COMPARING DIFFERENT ACTIVATION FUNCTIONS USING GRNNS.
Model Features Activation P+(%) Se(%) F1(%) E R
GRNN MFCCs sigmoid 91.4 92.1 91.7 0.17
GRNN MFCCs tanh 92.4 93.3 92.8 0.15
GRNN MFCCs rectifier 92.2 93.8 93.0 0.14
3) Comparison of RNN architectures: Table VII shows the
results for different RNN architectures. We compare different
models, i.e. RNNs, LSTMs, GRNNs, and their bidirectional
versions, using MFCC features. The model size for the con-
ventional models is 2 hidden layers and 200 neurons per layer.
For the bidirectional models, we use 2 hidden layers and 100
neurons for the forward and backward layers, respectively. The
BiLSTM slightly outperforms the other models, by achieving
an average F-Score of F1= 94.1%. Due to the small
difference between the BiLSTM and the BiGRNN, we choose
the less complex BiGRNN for the subsequent experiments.
8
TABLE VII
COMPARISON OF DIFFERENT RECURRENT NEURAL NETWORKS
ARCHITECTURES USING MFCC FE ATURE S.
Model Features P+(%) Se(%) F1(%) ER
RNN MFCCs 90.2 93.1 91.6 0.17
LSTM MFCCs 91.8 93.3 92.5 0.15
GRNN MFCCs 92.2 93.8 93.0 0.14
BiRNN MFCCs 91.8 94.5 93.1 0.14
BiLSTM MFCCs 93.5 94.8 94.1 0.12
BiGRNN MFCCs 92.8 94.5 93.7 0.13
4) Comparison of BiGRNN input features: Table VIII
shows the results for BiGRNNs with MFCCs, spectrograms,
envelope features, and their combinations. Best results are
obtained with spectrograms, envelope features, and their com-
bination. The envelope features already show promising re-
sults in combination with the LR-HSMM-method. For this
reason, and with the assumption that spectrograms render
the segmentation more robust against artefacts, we use the
combination of spectrogram and envelope features for the
subsequent experiments.
TABLE VIII
COMPARISON OF MFCCS,SPECTROGRAM,AND E NV EL OPE F EATU RE S.
Model Features P+(%) Se(%) F1(%) ER
BiGRNN MFCCs 92.8 94.5 93.7 0.13
BiGRNN Spectrogram 95.0 95.7 95.4 0.09
BiGRNN Envelope 95.0 95.9 95.4 0.10
BiGRNN MFCCs + Envelope 93.7 94.6 94.2 0.12
BiGRNN Spectrogram + Envelope 94.9 95.8 95.4 0.09
5) Comparison of different regularizers: Table IX shows
the results using a BiGRNN with different regularization
approaches. For dropout, we dropped units in the hidden layers
during training with a probability of p= 0.1. This value
achieved the best results among p∈ {0.1,0.5,0.7,0.9}. For
VAT, we used the parameter setting of λ= 0.1,= 0.1, and
Ip= 1. For noise injection, we added zero mean Gaussian
noise with standard deviation σ= 0.025 and magnitude
m= 0.25. For data augmentation with audio transformations,
we used the augmented training set for training (see Table IV).
All regularization methods, except for data augmentation with
audio transformations, improved the F-score. In particular,
with dropout, we achieve the best result of F1= 96.1%.
TABLE IX
COMPARISON OF DIFFERENT REGULARIZATION METHODS.
Model Regularizer Parameters P+(%) S e(%) F1(%) ER
BiGRNN - - 94.9 95.8 95.4 0.09
BiGRNN VAT λ= 0.1,
= 0.1,
Ip= 1
95.3 96.1 95.7 0.09
BiGRNN Dropout p = 0.1 95.8 96.3 96.1 0.08
BiGRNN Noise
Injection
σ= 0.025,
m = 0.25
95.4 95.8 95.7 0.09
BiGRNN Audio Trans-
formations
- 95.0 95.7 95.4 0.09
6) Evaluation of the final setup on the test set: Table X
shows the results for the best setup (i.e. BiGRNN, 2 hidden
layers, 200 neurons per layer, rectifier activations, spectro-
gram+envelope features, dropout regularization) evaluated on
the test set. In addition to the metrics from the previous sec-
tions, we report in detail the numbers of reference states Nref
(ground truth), system states Nsys (BiGRNN-method), true
positives NTP , false negatives NF N and false positives NF P
for each event, respectively.
TABLE X
DETAI LE D RES ULT S PER E VE NT E VALUATE D WIT H TH E FINA L SE TU P ON
TH E TES T SE T.
Event Nref Nsys NT P NFN NF P P+(%) Se(%) F1(%) ER
S1 21115 21271 20659 456 612 97.1 97.8 97.5 0.05
Systole 21200 21453 20267 933 1186 94.5 95.6 95.0 0.10
S2 21073 21229 20102 971 1127 94.7 95.4 95.0 0.10
Diastole 21385 21758 20283 1102 1475 93.2 94.8 94.0 0.12
Average 94.9 95.9 95.4 0.09
TABLE VI
COMPARISON OF OUR BIGRNN WITH THE LR-HSMM-METHOD [28], EVALUATED O N 744 RE CO RDI NG S FRO M TH E TE ST SE T.
Challange set Disease #Recordings #Beats BiGRNN LR-HSMM
P+(%) Se(%) F1(%) ER P+(%) Se(%) F1(%) ER
PN-training-a Normal 112 4270 95.6 96.7 96.1 0.08 96.4 96.0 96.2 0.08
MVP 119 4402 93.1 93.8 93.4 0.13 91.2 91.6 91.4 0.17
Benign 113 4163 96.4 97.2 96.8 0.06 96.8 96.8 96.8 0.06
AD 13 425 86.6 93.2 89.7 0.21 95.5 95.5 95.5 0.09
MPC 23 932 89.6 92.7 91.1 0.18 91.3 91.8 91.6 0.17
All 380 14192 94.4 95.6 95.0 0.10 94.5 94.6 94.6 0.11
PN-training-b Normal 132 1256 92.8 94.4 93.6 0.13 96.8 96.1 96.4 0.07
CAD 29 238 79.9 82.8 81.3 0.38 94.6 93.5 94.0 0.12
All 161 1494 90.6 92.5 91.6 0.17 96.5 95.7 96.1 0.08
PN-training-e Normal 201 4981 98.6 98.6 98.6 0.03 98.5 98.3 98.4 0.03
CAD 2 19 68.1 69.4 68.7 0.64 85.9 85.9 85.9 0.28
All 203 5000 98.5 98.5 98.5 0.03 98.4 98.2 98.3 0.03
All 744 20686 95.1 96.1 95.6 0.09 95.6 95.5 95.6 0.09
9
7) Comparison with the LR-HSMM-method: For this ex-
periment, we remove recordings from the training and test
set containing areas with no signal, because the LR-HSMM
is limited to the detection of four events in the order of S1 -
systole - S2 - diastole. This results in 1810 recordings for the
training set and 744 recordings for the test set.
For the LR-HSMM, we preprocess the recordings with
resampling to fs= 1 kHz and high-pass filtering with a
cutoff frequency of fc= 10 Hz (cf. Section IV-C). We
process the audio signals with frames of 20 ms. We train
the LR-HSMM by using the four feature types provided:
homomorphic envelogram, Hilbert envelope, wavelet envelope,
and power spectral density (PSD) [28].
Table VI shows the results achieved with the BiGRNN (final
setup) compared with the LR-HSMM method.
Figure 9 shows nine examples of automatically segmented
heart sound recordings (snippets of four seconds each). In
each subfigure, we show the hand annotated ground truth
(GT), the segmentation with the LR-HSMM method and the
segmentation with the BiGRNN. We show five recordings
from PN-Training-a (Figure 9a to 9e), two recording from
PN-Training-b (Figure 9f and 9g) and two recordings from
PN-Training-e (Figure 9h and 9i). For the visualization, we
normalized each heart sound recording according to its maxi-
mum amplitude.
V. DISCUSSION
In our experiments, we compare vanilla RNNs, LSTMs,
GRNNs, and their bidirectional implementations, with
BiGRNNs outperforming the rest. In subsequent experiments,
we find the final setup using spectrogram and envelope features
with a regularized BiGRNN. The network consists of 2 hidden
layers with 200 neurons each and rectifier activations (except
for the last layer). Regularization with dropout achieves the
best result. Data augmentation with audio transformations
does not result in any improvement.
In Section IV-E7, we compare our proposed method with
the state-of-the-art, the LR-HSMM-method. The BiGRNN-
method performs on par with the LR-HSMM-method with an
overall F-score of F1= 95.6% (cf. Table VI). We have to
remark that this is not a completely fair comparison, because
the ground truth annotations, although trained on less data (i.e.
PN-training-a) and manually corrected, were generated with
the LR-HSMM-method. This may introduce bias towards the
LR-HSMM-method. Furthermore, the hand annotated ground
truth is not always correct (cf. Figure 9a and 9h), also being
in favor of the LR-HSMM-method and in general causing
distortion in the scores.
Table VI shows detailed results for the test data in terms of
PN-training sets and diseases. We observe that the BiGRNN-
method outperforms the LR-HSMM-method for PN-training-a
and PN-training-e, but performs worse for PN-training-b.
Regarding the diseases in PN-training-a, only for MVP the
GT
LR-HSMM
BiGRNN
(a) a0072.wav - MPC
GT
LR-HSMM
BiGRNN
(b) a0076.wav - Benign
GT
LR-HSMM
BiGRNN
(c) a0104.wav - MVP
GT
LR-HSMM
BiGRNN
(d) a0231.wav - Normal
GT
LR-HSMM
BiGRNN
(e) a0326.wav - MVP
GT
LR-HSMM
BiGRNN
(f) b0267.wav - CAD
GT
LR-HSMM
BiGRNN
(g) b0401.wav - Normal
GT
LR-HSMM
BiGRNN
(h) e01069.wav - Normal
GT
LR-HSMM
BiGRNN
(i) e01149.wav - Normal
Fig. 9. Legend: S1;systole;S2;diastole.
Examples of automatically segmented heart sound recordings (snippets of four seconds each) from the test set. In each subfigure, the first plot corresponds
to the hand annotated ground truth (GT), the second to the logistic regression hidden semi-Markov model based (LR-HSMM) method and the third to the
proposed method from this paper (BiGRNN).
10
BiGRNN-method is outperforming the LR-HSMM-method,
and for benign murmurs (Benign) both methods perform on
par. For PN-training-b, the LR-HSMM-method outperforms
the BiGRNN-method for normal and CAD recordings. For
normal recordings of PN-training-e, the BiGRNN-method
outperforms the LR-HSMM-method. The LR-HSMM-method
is distinctly better than the BiGRNN-method for the two
recordings of CAD in PN-training-e.
The 2016 PhysioNet/CinC Challenge data does not provide
any labeling for cardiac arrhythmia. According to [64], mitral
valve prolapse is a source of arrhythmias. We refer to the
results reported for MVP, with the BiGRNN-method (F1=
93.4%) outperforming the LR-HSMM-method (F1= 91.4%).
Moreover, we visually inspected all test recordings of MVP
and found 20 recordings with cardiac arrhythmia. On this
selected set of recordings the BiGRNN-method (F1= 87.2%)
outperforms the LR-HSMM-method (F1= 75.7%). An exam-
ple for cardiac arrhythmia for MVP is shown in Figure 9e.
The examples in Figure 9 illustrate some observations for
both segmentation methods, and also for the ground truth
annotations. Figure 9b and 9c show examples, where both
methods perform well. In Figure 9d, we observe that the
LR-HSMM-method skips every second S2, and detects every
second S1 as S2. In contrast to this, in Figure 9i the LR-
HSMM-method detects too many events, i.e. S2 is always
detected as S1 and in between an additional state-sequence
S1 - systole - S2 - diastole is detected. Figure 9g shows some
segmentation errors for the BiGRNN method. Figure 9a,
9h and 9f are examples, where the ground truth labeling
is partially incorrect. In Figure 9h, we further observe that
both methods achieve partially incorrect segmentation results.
Figure 9e shows an example for the failure of the LR-HSMM
method for irregularity of the temporal occurrence of the
events.
In our experiments, the proposed BiGRNN-method achieves
performance on par with the LR-HSMM-method. We suc-
cessfully show state-of-the-art performance without directly
incorporating a priori information of the state durations. The
proposed method is easily extendable to the detection of extra
heart sounds (third and fourth heart sound), heart murmurs,
as well as other acoustic events. However, this would re-
quire appropriate training data, i.e. heart sound recordings
containing the additional events and their proper labeling.
In a practical sense, our method features further advantages.
Without preprocessing, it can easily handle absence of the
signal, noise and irregularity of the temporal occurrence of
the events (like in cardiac arrhythmia).
VI. CONCLUSION
In this paper, we introduce an event detection approach
with deep recurrent neural networks (DRNNs) for heart sound
segmentation, i.e. the detection of the state-sequence first heart
sound (S1) - systole - second heart sound (S2) - diastole. We
carefully conduct experiments with heart sound recordings
from the 2016 Physionet/CinC Challenge and compare the
proposed method with the state-of-the-art by reporting event
based metrics and visualizing examples of segmented heart
sound recordings.
In particular, we trained a BiGRNN on heart sound record-
ings and appropriate labeling of the state-sequences. In our
final setup, we use spectrogram and envelope features and
dropout for regularization. We obtain an event-based F-score
of F1= 95.6%, evaluated on an independent test set. The
state-of-the-art, the logistic regression hidden semi-Markov
model based heart sound segmentation method, achieves the
same score. This result is however biased, since it has been
used to annotate the dataset. Furthermore, the ground truth
for the heart sound segmentation is partially incorrect. Never-
theless, we show that the proposed method achieves state-of-
the-art performance, although we do not explicitly incorporate
a priori information about the state-durations. Furthermore,
the proposed method shows advantages regarding practical
aspects. In particular, it can handle absence of the signal,
noise and cardiac arrhythmia, i.e. irregularity of the temporal
occurrence of the events.
The proposed method represents a general solution for the
detection of different kinds of events in heart sound recordings.
The method is easily extendable to the detection of extra heart
sounds (third and fourth heart sound), heart murmurs, as well
as other acoustic events. This, however, requires appropriate
training data with thorough labeling of the events and further
experiments.
REFERENCES
[1] C. Liu et al., “An open access database for the evaluation of heart sound
algorithms,” Physiological Measurement, vol. 37, no. 12, p. 2181, 2016.
[2] H. Liang et al., “Heart sound segmentation algorithm based on heart
sound envelogram,” in Computers in Cardiology. IEEE, 1997, pp.
105–108.
[3] A. Moukadem et al., “A robust heart sounds segmentation module based
on S-transform,” Biomedical Signal Processing and Control, vol. 8,
no. 3, pp. 273–281, 2013.
[4] S. Sun et al., “Automatic moment segmentation and peak detection anal-
ysis of heart sound pattern via short-time modified Hilbert transform,”
Computer Methods and Programs in Biomedicine, vol. 114, no. 3, pp.
219–230, 2014.
[5] S. Choi and Z. Jiang, “Comparison of envelope extraction algorithms for
cardiac sound signal segmentation,” Expert Systems with Applications,
vol. 34, no. 2, pp. 1056–1069, 2008.
[6] Z. Yan et al., “The moment segmentation analysis of heart sound
pattern,” Computer Methods and Programs in Biomedicine, vol. 98,
no. 2, pp. 140–150, 2010.
[7] S. Ari et al., “A robust heart sound segmentation algorithm for com-
monly occurring heart valve diseases,Journal of Medical Engineering
& Technology, vol. 32, no. 6, pp. 456–465, 2008.
[8] H. Naseri and M. Homaeinezhad, “Detection and boundary identification
of phonocardiogram sounds using an expert frequency-energy based
metric,” Annals of Biomedical Engineering, vol. 41, no. 2, pp. 279–292,
2013.
[9] D. Kumar et al., “Detection of S1 and S2 heart sounds by high frequency
signatures,” in Proceedings of the 28th Annual International Conference
of the IEEE Engineering in Medicine and Biology Society (EMBC’06).
IEEE, 2006, pp. 1410–1416.
[10] V. N. Varghees and K. Ramachandran, “A novel heart sound activity
detection framework for automated heart sound analysis,Biomedical
Signal Processing and Control, vol. 13, pp. 174–188, 2014.
[11] J. Pedrosa et al., “Automatic heart sound segmentation and murmur
detection in pediatric phonocardiograms,” in Proceedings of the 36th
Annual International Conference of the IEEE Engineering in Medicine
and Biology Society (EMBC’14). IEEE, 2014, pp. 2294–2297.
[12] J. Vepa et al., “Segmentation of heart sounds using simplicity features
and timing information,” in Proceedings of the 33th IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP’08).
IEEE, 2008, pp. 469–472.
11
[13] C. D. Papadaniil and L. J. Hadjileontiadis, “Efficient heart sound
segmentation and extraction using ensemble empirical mode decompo-
sition and kurtosis features,” IEEE Journal of Biomedical and Health
Informatics, vol. 18, no. 4, pp. 1138–1152, 2014.
[14] A. Gharehbaghi et al., “An automatic tool for pediatric heart sounds
segmentation,” in Computing in Cardiology. IEEE, 2011, pp. 37–40.
[15] T. Oskiper and R. Watrous, “Detection of the first heart sound using a
time-delay neural network,” in Computers in Cardiology. IEEE, 2002,
pp. 537–540.
[16] A. A. Sepehri et al., “A novel method for pediatric heart sound
segmentation without using the ECG,” Computer Methods and Programs
in Biomedicine, vol. 99, no. 1, pp. 43–48, 2010.
[17] T. Chen et al., “Intelligent heartsound diagnostics on a cellphone using
a hands-free kit.” in AAAI Spring Symposium: Artificial Intelligence for
Development, 2010.
[18] C. N. Gupta et al., “Neural network classification of homomorphic
segmented heart sounds,” Applied Soft Computing, vol. 7, no. 1, pp.
286–297, 2007.
[19] H. Tang et al., “Segmentation of heart sounds based on dynamic
clustering,” Biomedical Signal Processing and Control, vol. 7, no. 5,
pp. 509–516, 2012.
[20] S. Rajan et al., “Unsupervised and uncued segmentation of the funda-
mental heart sounds in phonocardiograms using a time-scale representa-
tion,” in Proceedings of the 28th Annual International Conference of the
IEEE Engineering in Medicine and Biology Society (EMBC’06). IEEE,
2006, pp. 3732–3735.
[21] T.-E. Chen et al., “S1 and S2 heart sound recognition using deep neural
networks,” IEEE Transactions on Biomedical Engineering, vol. 64, no. 2,
pp. 372–380, 2017.
[22] L. Gamero and R. Watrous, “Detection of the first and second heart
sound using probabilistic models,” in Proceedings of the 25th Annual
International Conference of the IEEE Engineering in Medicine and
Biology Society (EMBC’03), vol. 3. IEEE, 2003, pp. 2877–2880.
[23] A. D. Ricke et al., “Automatic segmentation of heart sound signals using
hidden Markov models,” in Computers in Cardiology. IEEE, 2005, pp.
953–956.
[24] D. Gill et al., “Detection and identification of heart sounds using
homomorphic envelogram and self-organizing probabilistic model,” in
Computers in Cardiology. IEEE, 2005, pp. 957–960.
[25] P. Sedighian et al., “Pediatric heart sound segmentation using hid-
den Markov model,” in Proceedings of the 36th Annual International
Conference of the IEEE Engineering in Medicine and Biology Society
(EMBC’14). IEEE, 2014, pp. 5490–5493.
[26] A. Castro et al., “Heart sound segmentation of pediatric auscultations
using wavelet analysis,” in Proceedings of the 35th Annual International
Conference of the IEEE Engineering in Medicine and Biology Society
(EMBC’13). IEEE, 2013, pp. 3909–3912.
[27] S. Schmidt et al., “Segmentation of heart sound recordings by a duration-
dependent hidden Markov model,” Physiological Measurement, vol. 31,
no. 4, p. 513, 2010.
[28] D. B. Springer et al., “Logistic regression-hsmm-based heart sound
segmentation,” IEEE Transactions on Biomedical Engineering, vol. 63,
no. 4, pp. 822–832, 2016.
[29] C. Liu et al., “Performance of an open-source heart sound segmentation
algorithm on eight independent databases,” Physiological measurement,
vol. 38, no. 8, p. 1730, 2017.
[30] K. Cho et al., “On the properties of neural machine translation: Encoder-
decoder approaches,” CoRR, vol. abs/1409.1259, 2014.
[31] J. Chung et al., “Empirical evaluation of gated recurrent neural networks
on sequence modeling,” CoRR, vol. abs/1412.3555, 2014.
[32] ——, “Gated feedback recurrent neural networks,” CoRR, vol.
abs/1502.02367, 2015.
[33] I. Sutskever et al., “Sequence to sequence learning with neural net-
works,” in Advances in Neural Information Processing Systems 27, 2014,
pp. 3104–3112.
[34] A. Graves et al., “Hybrid speech recognition with deep bidirectional
LSTM,” in IEEE Workshop on Automatic Speech Recognition and
Understanding, 2013, pp. 273–278.
[35] T. c. I. Yang and H. Hsieh, “Classification of acoustic physiological
signals based on deep learning neural networks with augmented fea-
tures,” in Computing in Cardiology Conference (CinC), Sept 2016, pp.
569–572.
[36] C. Thomae and A. Dominik, “Using deep gated RNN with a con-
volutional front end for end-to-end classification of heart sound,” in
Computing in Cardiology Conference (CinC), Sept 2016, pp. 625–628.
[37] J. van der Westhuizen and J. Lasenby, “Bayesian LSTMs in medicine,
arXiv preprint arXiv:1706.01242, 2017.
[38] O. Gencoglu et al., “Recognition of acoustic events using deep neural
networks,” in Proceedings of the 22nd European Signal Processing
Conference, 2014, pp. 506–510.
[39] G. Chen et al., “Query-by-example keyword spotting using long short-
term memory networks,” in Proceedings of the IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP’15),
2015, pp. 5236–5240.
[40] G. Parascandolo et al., “Recurrent neural networks for polyphonic sound
event detection in real life recordings,” in Proceedings of the IEEE
International Conference on Acoustics, Speech and Signal Processing
(ICASSP’16), 2016, pp. 6440–6444.
[41] M. Z¨
ohrer and F. Pernkopf, “Gated recurrent networks applied to
acoustic scene classification and acoustic event detection,IEEE AASP
Challenge: Detection and Classification of Acoustic Scenes and Events,
2016.
[42] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural net-
works,” IEEE Transactions on Signal Processing, vol. 45, no. 11, pp.
2673–2681, 1997.
[43] E. Messner et al., “Crackle and breathing phase detection in lung sounds
with deep bidirectional gated recurrent neural networks,” in Proceedings
of the 40th Annual International Conference of the IEEE Engineering
in Medicine and Biology Society (EMBC’18). IEEE, 2018.
[44] D. E. Rumelhart et al., “Neurocomputing: Foundations of research,” J. A.
Anderson and E. Rosenfeld, Eds. Cambridge, MA, USA: MIT Press,
1988, ch. Learning Internal Representations by Error Propagation, pp.
673–695.
[45] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14,
no. 2, pp. 179–211, 1990. [Online]. Available: http://groups.lis.illinois.
edu/amag/langev/paper/elman90findingStructure.html
[46] M. I. Jordan, “Attractor dynamics and parallelism in a connectionist
sequential machine,” in Proceedings of the Eighth Annual Conference
of the Cognitive Science Society. Hillsdale, NJ: Erlbaum, 1986, pp.
531–546.
[47] J. J. Hopfield and D. W. Tank, “Computing with neural circuits: A
model,” Science, vol. 233, pp. 624–633, 1986.
[48] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[49] A. Graves et al., “Speech recognition with deep recurrent neural net-
works,” in Proceedings of IEEE International Conference on Acoustics,
Speech, and Signal Processing (ICASSP’13), 2013, pp. 6645–6649.
[50] ——, “Hybrid speech recognition with deep bidirectional LSTM,” in
IEEE Workshop on Automatic Speech Recognition and Understanding
(ASRU). IEEE, 2013, pp. 273–278.
[51] T. Miyato et al., “Distributional smoothing by virtual adversarial exam-
ples.” CoRR, vol. abs/1507.00677, 2015.
[52] M. Ratajczak et al., “Virtual adversarial training applied to neural higher-
order factors for phone classification.” in INTERSPEECH, 2016, pp.
2756–2760.
[53] A. Makhzani et al., “Adversarial autoencoders,CoRR, vol.
abs/1511.05644, 2015.
[54] I. Goodfellow et al., “Generative adversarial nets,” in Advances in Neural
Information Processing Systems 27, 2014, pp. 2672–2680.
[55] N. Srivastava et al., “Dropout: A simple way to prevent neural networks
from overfitting.Journal of Machine Learning Research, vol. 15, no. 1,
pp. 1929–1958, 2014.
[56] I. Goodfellow et al.,Deep Learning. MIT Press, 2016, http://www.
deeplearningbook.org.
[57] B. Poole et al., “Analyzing noise in autoencoders and deep networks,
arXiv preprint arXiv:1406.1831, 2014.
[58] “Sound exchange,” http://sox.sourceforge.net, accessed: 2017-07-05.
[59] C. Thomae and A. Dominik, “Using deep gated RNN with a con-
volutional front end for end-to-end classification of heart sound,” in
Computing in Cardiology Conference (CinC). IEEE, 2016, pp. 625–
628.
[60] A. Mesaros et al., “Metrics for polyphonic sound event detection,”
Applied Sciences, vol. 6, no. 6, p. 162, 2016.
[61] A. M. Saxe et al., “Exact solutions to the nonlinear dynamics of learning
in deep linear neural networks,” International Conference of Learning
Representations (ICLR), 2014.
[62] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
CoRR, vol. abs/1412.6980, 2014.
[63] X. Glorot et al., “Deep sparse rectifier neural networks,” in Proceedings
of the Fourteenth International Conference on Artificial Intelligence and
Statistics (AISTATS), 2011, pp. 315–323.
[64] E. Van der Wall and M. Schalij, “Mitral valve prolapse: a source
of arrhythmias?” The international journal of cardiovascular imaging,
vol. 26, no. 2, pp. 147–149, 2010.
... In HSMM annotation is performed with an extended Viterbi algorithm, which combines the information of the whole signal for the sound event annotation. However, Messner et al (17) found that the extended Viterbi algorithm is error-prone in the presence of irregularities in the cardiac action due to, e.g. cardiac arrhythmias. ...
... As in most SED challenges, CNN, RNN, Temporal Convolutional Networks (TCN), and CRNN architectures are mainly used for the heart SED. Messner et al. (17) used a bidirectional RNN with different inputs (41-bin log magnitude spectrograms, Mel Frequency Cepstral Coefficients (MFCCs) and Envelope features), achieving F1 scores for S1 of 97.5% and 95.0% for S2 on an independent test set. Renna et al. (14) used a U-net derived 1-D CNN based on the envelopes (Homomorphic, Hilbert, Wavelet, PSD) of the PCG signal for the detection of the heart sound events. ...
... Previous published work has used different tolerance window (time collar) values for eventbased metrics when evaluating the validity of the onset and offset times. Springer used a time collar value of 100ms (16), Schmidt used 60 ms (15), and Messner used 40 ms (17). To make an even-handed comparison to prior work, we chose a time collar value of 60 ms. ...
Article
Full-text available
Objective: This work proposes a semi-supervised training approach for detecting lung and heart sounds simultaneously with only one trained model and in invariance to the auscultation point. Methods: We use open-access data from the 2016 Physionet/CinC Challenge, the 2022 George Moody Challenge, and from the lung sound database HF_V1. We first train specialist single-task models using foreground ground truth (GT) labels from different auscultation databases to identify background sound events in the respective lung and heart auscultation databases. The pseudo-labels generated in this way were combined with the ground truth labels in a new training iteration, such that a new model was subsequently trained to detect foreground and background signals. Benchmark tests ensured that the newly trained model could detect both, lung, and heart sound events in different auscultation sites without regressing on the original task. We also established hand-validated labels for the respective background signal in heart and lung sound auscultations to evaluate the models. Results: In this work, we report for the first time results for i) a multi-class prediction for lung sound events and ii) for simultaneous detection of heart and lung sound events and achieve competitive results using only one model. The combined multi-task model regressed slightly in heart sound detection and gained significantly in lung sound detection accuracy with an overall macro F1 score of 39.2% over six classes, representing a 6.7% improvement over the single-task baseline models. Conclusion/Significance: To the best of our knowledge, this is the first approach developed to date for measuring heart and lung sound events invariant to both, the auscultation site and capturing device. Hence, our model is capable of performing lung and heart sound detection from any auscultation location.
... In HSMM annotation is performed with an extended Viterbi algorithm, which combines the information of the whole signal for the sound event annotation. However, Messner et al (17) found that the extended Viterbi algorithm is error-prone in the presence of irregularities in the cardiac action due to, e.g. cardiac arrhythmias. ...
... As in most SED challenges, CNN, RNN, Temporal Convolutional Networks (TCN), and CRNN architectures are mainly used for the heart SED. Messner et al. (17) used a bidirectional RNN with different inputs (41-bin log magnitude spectrograms, Mel Frequency Cepstral Coefficients (MFCCs) and Envelope features), achieving F1 scores for S1 of 97.5% and 95.0% for S2 on an independent test set. Renna et al. (14) used a U-net derived 1-D CNN based on the envelopes (Homomorphic, Hilbert, Wavelet, PSD) of the PCG signal for the detection of the heart sound events. ...
... Previous published work has used different tolerance window (time collar) values for eventbased metrics when evaluating the validity of the onset and offset times. Springer used a time collar value of 100ms (16), Schmidt used 60 ms (15), and Messner used 40 ms (17). To make an even-handed comparison to prior work, we chose a time collar value of 60 ms. ...
Preprint
Objective: This work proposes a semi-supervised training approach for detecting lung and heart sounds simultaneously with only one trained model and in invariance to the auscultation point. Methods: We use open-access data from the 2016 Physionet/CinC Challenge, the 2022 George Moody Challenge, and from the lung sound database HF_V1. We first train specialist single-task models using foreground ground truth (GT) labels from different auscultation databases to identify background sound events in the respective lung and heart auscultation databases. The pseudo-labels generated in this way were combined with the ground truth labels in a new training iteration, such that a new model was subsequently trained to detect foreground and background signals. Benchmark tests ensured that the newly trained model could detect both, lung, and heart sound events in different auscultation sites without regressing on the original task. We also established hand-validated labels for the respective background signal in heart and lung sound auscultations to evaluate the models. Results: In this work, we report for the first time results for i) a multi-class prediction for lung sound events and ii) for simultaneous detection of heart and lung sound events and achieve competitive results using only one model. The combined multi-task model regressed slightly in heart sound detection and gained significantly in lung sound detection accuracy with an overall macro F1 score of 39.2% over six classes, representing a 6.7% improvement over the single-task baseline models. Conclusion/Significance: To the best of our knowledge, this is the first approach developed to date for measuring heart and lung sound events invariant to both, the auscultation site and capturing device. Hence, our model is capable of performing lung and heart sound detection from any auscultation location.
... The heart sound is a kind of periodic audio signal caused by periodic heartbeats (see Fig. 1; sample can be downloaded from the link 2 ), which contains rich physiological and pathological information related to the heart activity and blood flow [6]. Understanding and utilising this novel heart sound-based digital phenotype for CVDs is integral to diagnosis and treatment of CVDs [7,8]. ...
... In our study, when performing the heart sound abnormality detection tasks, the input features from the first and second heart sounds play the most important roles in the diagnosis of the model, which is also consistent with the previous research conclusions [30,6]. In addition, when the classifier judges whether the heart sound signal is abnormal, the high-frequency information of the first heart sound and the second heart sound will play a major positive role. ...
Article
Full-text available
The advantages of non-invasive, real-time and convenient, computer audition-based heart sound abnormality detection methods have increasingly attracted efforts among the community of cardiovascular diseases. Time-frequency analyses are crucial for computer audition-based applications. However, a comprehensive investigation on discovering an optimised way for extracting time-frequency representations from heart sounds is lacking until now. To this end, we propose a comprehensive investigation on time-frequency methods for analysing the heart sound, i. e., short-time Fourier transformation, Log-Mel transformation, Hilbert-Huang transformation, wavelet transformation , Mel transformation, and Stockwell transformation. The time-frequency representations are automatically learnt via pre-trained deep convolutional neural networks. Considering the urgent need of smart stethoscopes for high robust detection algorithms in real environment, the training, verification, and testing sets employed in the extensive evaluation are subject-independent. Finally, to further understand the heart sound-based digital phenotype for cardiovascular diseases, explainable artificial intelligence approaches are used to reveal the reasons for the performance differences of six time-frequency representations in heart sound abnormality detection. Experimental results show thatStockwell transformation can beat other methods by reaching the highest overall score of 65.2 %. The interpretable results demonstrate that Stockwell transformation does not only present more information for heart sounds, but also provides a certain noise robustness. Besides, the considered fine-tuned deep model brings an improvement of the mean accuracy over the previous state-of-the-art results by 9.0 % in subject-independent testing.
... Therefore, RNNs can also help in locating the states of heart sounds. In [109], the authors regarded segmentation as an event detection task and developed bi-directional Gated Recurrent Unit (GRU)-RNNs based on spectrogram and envelop features. Since envelope features fail to effectively model the intrinsic duration information of the heart cycles, a duration-LSTM was proposed in [110] to integrate the duration vector into the standard LSTM cells with envelope features to obtain better segmentation performance. ...
Preprint
Full-text available
Heart sound auscultation has been demonstrated to be beneficial in clinical usage for early screening of cardiovascular diseases. Due to the high requirement of well-trained professionals for auscultation, automatic auscultation benefiting from signal processing and machine learning can help auxiliary diagnosis and reduce the burdens of training professional clinicians. Nevertheless, classic machine learning is limited to performance improvement in the era of big data. Deep learning has achieved better performance than classic machine learning in many research fields, as it employs more complex model architectures with stronger capability of extracting effective representations. Deep learning has been successfully applied to heart sound analysis in the past years. As most review works about heart sound analysis were given before 2017, the present survey is the first to work on a comprehensive overview to summarise papers on heart sound analysis with deep learning in the past six years 2017--2022. We introduce both classic machine learning and deep learning for comparison, and further offer insights about the advances and future research directions in deep learning for heart sound analysis.
... However, the computational cost of the method was high in comparison to other autocorrelation methods which might heavily impact the real-time implementation. Messner et al. [36] proposed an event detection approach with deep recurrent neural networks (DRNNs) for heart sound segmentation, i.e. the detection of the state-sequence of the S1 and S2 heart sound. It can easily handle absence of the signal, noise and irregularity of the temporal occurrence of the events without preprocessing. ...
Article
Full-text available
Phonocardiogram (PCG) is commonly used as a diagnostic tool in ambulatory monitoring in order to evaluate cardiac abnormalities and detect cardiovascular diseases. Although cardiac auscultation is widely used for evaluation of cardiac function, the analysis of heart sound signals mostly depends on the clinician’s experience and skills. There is growing demand for automatic and objective heart sound interpretation techniques. The objective of this study is to develop an automatic classification method for anomaly (binary and multi-class) detection of PCG recordings without any segmentation. A deep neural network (DNN) model is used on the raw data during the extraction of the features of the PCG inputs. Deep feature maps obtained from hierarchically placed layers in DNN are fed to various shallow classifiers for the anomaly detection, including support vector classifier (SVC), k-nearest neighbors (KNN), random forest (RF), gradient boosting (GB) classifier, decision tree (DT) classifier, quadratic discriminant analysis (QDA), and multi-layer perception (MLP). Principal component analysis (PCA) technique is used to reduce the high dimensions of feature maps.Finally, two famous heart sound databases, namely PhysioNet/Computing in Cardiology (CinC) Challenge heart sound database and heart valve disease (HVD) database, are used for evaluation. The databases are significantly different in terms of the tools used for data acquisition, clinical protocols, digital storages and signal qualities, making it challenging to process and analyze. By using the 10-fold cross-validation style, experimental results demonstrate that the proposed deep features with shallow classifiers yield highest performance with accuracy of 99.61% and 99.44% for binary and multi-class classification on the two databases, respectively. The results indicate that our method is effective for the detection of abnormal heart sound signals and outperforms other state-of-the-art methods.
... The hidden semi-Markov model (HSMM) method [15] was extended with logistic regression to achieve signal segmentation in a noisy environment. Furthermore, there are some segmentation methods based on deep learning [16]- [18]. In the second step, the physiological and pathological information about the heart is extracted. ...
Article
Full-text available
In recent years, auxiliary diagnosis technology for cardiovascular disease based on abnormal heart sound detection has become a research hotspot. Heart sound signals are promising in the preliminary diagnosis of cardiovascular diseases. Previous studies have focused on capturing the local characteristics of heart sounds. In this paper, we investigate a method for mapping heart sound signals with complex patterns to fixed-length feature embedding called HS-Vectors for abnormal heart sound detection. To get the full embedding of the complex heart sound, HS-Vectors are obtained through the Time-Compressed and Frequency-Expanded Time-Delay Neural Network(TCFE-TDNN) and the Dynamic Masked-Attention (DMA) module. HS-Vectors extract and utilize the global and critical heart sound characteristics by masking out irreverent information. Based on the TCFE-TDNN module, the heart sound signal within a certain time is projected into fixed-length embedding. Then, with a learnable mask attention matrix, DMA stats pooling aggregates multi-scale hidden features from different TCFE-TDNN layers and masks out irrelevant frame-level features. Experimental evaluations are performed on a 10-fold cross-validation task using the 2016 PhysioNet/CinC Challenge dataset and the new publicly available pediatric heart sound dataset we collected. Experimental results demonstrate that the proposed method excels the state-of-the-art models in abnormality detection.
... The system was validated and tested using data from MIT-BIH Atrial Fibrillation, resulting in an accuracy of 98.51% with ten fold cross-validation (20 subjects) and 99.77% with blindfold validation (3 subjects) [9]. Several approaches to deep learning can increase the value of performance, such as Deep Neural Networks (DNN), Convolution Neural Networks (CNN), and especially Recurrent Neural Networks (RNN) where this one method is very appropriate to be used to process sequential data such as ECG signals [10]. Thus, classifying ECG signals from AF heart disease based on rhythm can be done using the RNN method to obtain good performance values and data sharing with k-fold Cross-Validation to evaluate the model effectively on sequential data. ...
Article
Atrial fibrillation is a quivering or irregular heartbeat (arrhythmia) that can lead to blood clots, stroke, heart failure, and even sudden cardiac death. This study used several public datasets of electrocardiogram (ECG) signals, including MIT-BIH Atrial Fibrillation, China Physiological Signal Challenge 2018, MIT-BIH Normal Sinus Rhythm based on QT-Database, and Fantasia Database. All datasets were divided into 3 cases with the experiment windows size 10, 5, and 2 seconds for two classes, namely Normal and Atrial Fibrillation. The recurrent neural networks method is appropriate for processing sequential data such as ECG signals, and k-fold Cross-Validation can help evaluate models effectively to achieve high performance. Overall, LSTM performance achieved accuracy, sensitivity, specificity, precision, F1-score, is 94.56% 94.67%, 94.67%, 94.43%, and 94.51%.
Article
The aim of this paper is cardiac sound segmentation in order to extract significant clinical parameters that can aid cardiologists in diagnosis, through maximal overlap discrete wavelet transform (MODWT) and abrupt changes detection. After reconstruction of the fifth to seventh level of decomposition of the pre-processed phonocardiogram (PCG), we can correctly measure the time duration of Fundamental heart sounds (S1, S2), while the third and fourth levels localize murmurs and clicks. From this scope, it is possible to establish the time interval between clicks and fundamental heart sounds or evaluating murmur severity through energetic ratio. We have tested this approach on several phonocardiography records. Results show that this method performs greatly on long and short PCG records and gives the precise duration of fundamental heart sounds; we have achieved an accuracy of 88.6% in cardiac sounds segmentation.
Article
Recurrent neural networks (RNN) emerged as powerful tools to model and analyze the nonlinear behavior of electronic circuits accurately and quickly. Efforts to improve the accuracy of RNN will lead to the design of better‐quality products, which is essential in various fields such as the design of energy harvesting (EH) systems. EH techniques can provide the electrical energy needed for low‐power electronics without the need for a battery or with minimal dependency. Due to the importance of the active voltage balancing circuit in EH systems, we have proposed a new macromodeling method called dropout local‐feedback deep recurrent neural network (Dropout‐LFDRNN) to model and analyze this circuit along with two other nonlinear circuits as examples. This technique is an advance over the LFDRNN macromodeling method, and based on the obtained results from the measurements, we were able to build a fast macromodel for the active balancing circuit, which outperforms conventional deep recurrent neural network modeling method in terms of accuracy without sacrificing speed. It is worth mentioning that this proposed technique in this paper can be considered a viable approach for modeling and analysis of nonlinear electronic components and circuits. In addition to the advantage of generating a more accurate model, the model based on the Dropout‐LFDRNN approach is much faster than the existing transistor‐level models. The hybrid use of solar and thermal energy harvesting (EH) systems is an efficient approach towards powering battery‐less electronics. A new macromodeling method called dropout local‐feedback deep recurrent neural network (Dropout‐LFDRNN) is proposed to model and analyze active voltage balancing circuit in EH systems. In addition to the advantage of generating a more accurate model, the model based on the Dropout‐LFDRNN approach is much faster than the existing transistor‐level models.
Article
Objective: Heart sound segmentation (HSS), which aims to identify the exact positions of the first heart sound(S1), second heart sound(S2), the duration of S1, systole, S2, and diastole within a cardiac cycle of phonocardiogram (PCG), is an indispensable step to find out heart health. Recently, some neural network-based methods for heart sound segmentation have shown good performance. Approach: In this paper, a novel method was proposed for HSS exactly using One-Dimensional Convolution and Bidirectional Long-Short Term Memory neural network with Attention mechanism (C-LSTM-A) by incorporating the 0.5-order smooth Shannon entropy envelope and its instantaneous phase waveform (IPW), and third intrinsic mode function (IMF-3) of PCG signal to reduce the difficulty of neural network learning features. Main results: An average F1-score of 96.85 was achieved in the clinical research dataset (Fuwai Yunnan Cardiovascular Hospital heart sound dataset) and an average F1-score of 95.68 was achieved in 2016 PhysioNet/CinC Challenge dataset using the novel method. Significance: The experimental results show that this method has advantages for normal PCG signals and common pathological PCG signals, and the segmented fundamental heart sound(S1, S2), systole, and diastole signal components are beneficial to the study of subsequent heart sound classification.
Conference Paper
Full-text available
In this paper, we present a method for event detection in single-channel lung sound recordings. This includes the detection of crackles and breathing phase events (inspiration/expiration). Therefore, we propose an event detection approach with spectral features and bidirectional gated recurrent neural networks (BiGRNNs). In our experiments, we use multichannel lung sound recordings from lung-healthy subjects and patients diagnosed with idiopathic pulmonary fibrosis, collected within a clinical trial. We achieve an event-based F-score of F1 ≈ 86% for breathing phase events and F1 ≈ 72% for crackles. The proposed method shows robustness regarding the contamination of the lung sound recordings with noise, bowel and heart sounds.
Article
Full-text available
In this paper we present an approach to polyphonic sound event detection in real life recordings based on bi-directional long short term memory (BLSTM) recurrent neural networks (RNNs). A single multilabel BLSTM RNN is trained to map acoustic features of a mixture signal consisting of sounds from multiple classes, to binary activity indicators of each event class. Our method is tested on a large database of real-life recordings, with 61 classes (e.g. music, car, speech) from 10 different everyday contexts. The proposed method outperforms previous approaches by a large margin, and the results are further improved using data augmentation techniques. Overall, our system reports an average F1-score of 65.5% on 1 second blocks and 64.7% on single frames, a relative improvement over previous state-of-the-art approach of 6.8% and 15.1% respectively.
Conference Paper
Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.7 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a strong phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which beats the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Article
While logistic sigmoid neurons are more biologically plausable that hyperbolic tangent neurons, the latter work better for training multi-layer neural networks. This paper shows that rectifying neurons are an even better model of biological neurons and yield equal or better performance than hyperbolic tangent networks in spite of the hard non-linearity and non-differentiability at zero, creating sparse representations with true zeros, which seem remarkably suitable for naturally sparse data. Even though they can take advantage of semi-supervised setups with extra-unlabelled data, deep rectifier networks can reach their best performance without requiring any unsupervised pre-training on purely supervised tasks with large labelled data sets. Hence, these results can be seen as a new milestone in the attempts at understanding the difficulty in training deep but purely supervised nueral networks, and closing the performance gap between neural networks learnt with and without unsupervised pre-training
Article
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
Conference Paper
Despite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. We provide an analytical description of these phenomena by finding new exact solutions to the nonlinear dynamics of deep learning. Our theoretical analysis also reveals the surprising finding that as the depth of a network approaches infinity, learning speed can nevertheless remain finite: for a special class of initial conditions on the weights, very deep networks incur only a finite, depth independent, delay in learning speed relative to shallow networks. We show that, under certain conditions on the training data, unsupervised pretraining can find this special class of initial conditions, while scaled random Gaussian initializations cannot. We further exhibit a new class of random orthogonal initial conditions on weights that, like unsupervised pre-training, enjoys depth independent learning times. We further show that these initial conditions also lead to faithful propagation of gradients even in deep nonlinear networks, as long as they operate in a special regime known as the edge of chaos.
Conference Paper
Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates $backslash$emphdeep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.
Conference Paper
In this paper we propose a new method for regularizing autoencoders by imposing an arbitrary prior on the latent representation of the autoencoder. Our method, named "adversarial autoencoder", uses the recently proposed generative adversarial networks (GAN) in order to match the aggregated posterior of the hidden code vector of the autoencoder with an arbitrary prior. Matching the aggregated posterior to the prior ensures that there are no "holes" in the prior, and generating from any part of prior space results in meaningful samples. As a result, the decoder of the adversarial autoencoder learns a deep generative model that maps the imposed prior to the data distribution. We show how adversarial autoencoders can be used to disentangle style and content of images and achieve competitive generative performance on MNIST, Street View House Numbers and Toronto Face datasets.
Article
Objective: Heart sound segmentation is a prerequisite step for the automatic analysis of heart sound signals, facilitating the subsequent identification and classification of pathological events. Recently, hidden Markov model-based algorithms have received increased interest due to their robustness in processing noisy recordings. In this study we aim to evaluate the performance of the recently published logistic regression based hidden semi-Markov model (HSMM) heart sound segmentation method, by using a wider variety of independently acquired data of varying quality. Approach: Firstly, we constructed a systematic evaluation scheme based on a new collection of heart sound databases, which we assembled for the PhysioNet/CinC Challenge 2016. This collection includes a total of more than 120 000 s of heart sounds recorded from 1297 subjects (including both healthy subjects and cardiovascular patients) and comprises eight independent heart sound databases sourced from multiple independent research groups around the world. Then, the HSMM-based segmentation method was evaluated using the assembled eight databases. The common evaluation metrics of sensitivity, specificity, accuracy, as well as the [Formula: see text] measure were used. In addition, the effect of varying the tolerance window for determining a correct segmentation was evaluated. Main results: The results confirm the high accuracy of the HSMM-based algorithm on a separate test dataset comprised of 102 306 heart sounds. An average [Formula: see text] score of 98.5% for segmenting S1 and systole intervals and 97.2% for segmenting S2 and diastole intervals were observed. The [Formula: see text] score was shown to increases with an increases in the tolerance window size, as expected. Significance: The high segmentation accuracy of the HSMM-based algorithm on a large database confirmed the algorithm's effectiveness. The described evaluation framework, combined with the largest collection of open access heart sound data, provides essential resources for evaluators who need to test their algorithms with realistic data and share reproducible results.
Article
The medical field stands to see significant benefits from the recent advances in deep learning. Knowing the uncertainty in the decision made by any machine learning algorithm is of utmost importance for medical practitioners. This study demonstrates the utility of using Bayesian LSTMs for classification of medical time series. Four medical time series datasets are used to show the accuracy improvement Bayesian LSTMs provide over standard LSTMs. Moreover, we show cherry-picked examples of confident and uncertain classifications of the medical time series. With simple modifications of the common practice for deep learning, significant improvements can be made for the medical practitioner and patient.