ArticlePDF Available

Robust Discriminative Keyword Spotting for Emotionally Colored Spontaneous Speech using Bidirectional LSTM Networks

Authors:

Abstract and Figures

In this paper we propose a new technique for robust keyword spotting that uses bidirectional Long Short-Term Memory (BLSTM) recurrent neural nets to incorporate contextual information in speech decoding. Our approach overcomes the drawbacks of generative HMM modeling by applying a discriminative learning procedure that non-linearly maps speech features into an abstract vector space. By incorporating the outputs of a BLSTM network into the speech features, it is able to make use of past and future context for phoneme predictions. The robustness of the approach is evaluated on a keyword spotting task using the HUMAINE Sensitive Artificial Listener (SAL) database, which contains accented, spontaneous, and emotionally colored speech. The test is particularly stringent because the system is not trained on the SAL database, but only on the TIMIT corpus of read speech. We show that our method prevails over a discriminative keyword spotter without BLSTM-enhanced feature functions, which in turn has been proven to outperform HMM-based techniques.
Content may be subject to copyright.
ROBUST DISCRIMINATIVE KEYWORD SPOTTING FOR EMOTIONALLY COLORED
SPONTANEOUS SPEECH USING BIDIRECTIONAL LSTM NETWORKS
Martin W¨
ollmer1, Florian Eyben1, Joseph Keshet2, Alex Graves3, Bj¨
orn Schuller1, Gerhard Rigoll1
1Institute for Human-Machine Communication, Technische Universit¨
at M¨
unchen, Germany
2Idiap Research Institute, Martigny, Switzerland
3Institute for Computer Science VI, Technische Universit¨
at M¨
unchen, Germany
woellmer@tum.de
ABSTRACT
In this paper we propose a new technique for robust keyword spot-
ting that uses bidirectional Long Short-Term Memory (BLSTM) re-
current neural nets to incorporate contextual information in speech
decoding. Our approach overcomes the drawbacks of generative
HMM modeling by applying a discriminative learning procedure
that non-linearly maps speech features into an abstract vector space.
By incorporating the outputs of a BLSTM network into the speech
features, it is able to make use of past and future context for phoneme
predictions. The robustness of the approach is evaluated on a key-
word spotting task using the HUMAINE Sensitive Artificial Listener
(SAL) database, which contains accented, spontaneous, and emo-
tionally colored speech. The test is particularly stringent because the
system is not trained on the SAL database, but only on the TIMIT
corpus of read speech. We show that our method prevails over a
discriminative keyword spotter without BLSTM-enhanced feature
functions, which in turn has been proven to outperform HMM-based
techniques.
Index TermsSpeech recognition, Robustness, Recurrent neu-
ral networks
1. INTRODUCTION
The goal of keyword spotting is to reliably detect the presence of a
specific word in a given speech utterance. This is most commonly
done with Hidden Markov Models (HMM) [1, 2]. However, the
use of HMMs has various drawbacks, such as the need for an ad-
equate ”garbage model“ to handle non-keyword speech. Designing
a garbage-model is a nontrivial problem since the garbage model can
potentially model any phoneme sequence including the keyword
itself. Further disadvantages of HMM modeling are the suboptimal
convergence of the Expectation Maximization (EM) algorithm to lo-
cal maxima, the assumption of conditional independence of the ob-
servations, and the fact that HMMs do not directly maximize the
keyword detection rate.
For these reasons we follow [3] in using a supervised, discrim-
inative approach to keyword spotting, that does not require the use
of HMMs. In general, discriminative learing algorithms are likely
to outperform generative models such as HMMs since the objective
function used during training more closely reflects the actual deci-
sion task. The discriminative method described in [3] uses feature
functions to non-linearly map the speech utterance, along with the
target keyword, into an abstract vector space. It was shown to pre-
vail over HMM modeling. However, in contrast to state-of-the-art
HMM recognizers which use triphones to incorporate information
from past and future speech frames, the discriminative system does
not explicitly consider contextual knowledge. In this work we build
in context information by including the outputs of a bidirectional
Long Short-Term Memory (BLSTM) recurrent neural network [4, 5]
in the feature functions. Similar neural network architectures have
been successfully applied to speech or emotion recognition related
tasks [6, 5, 7], where they exploit contextual information whenever
speech production or perception is influenced by emotion, strong ac-
cents, or background noise. In contrast to [6], our keyword spotting
approach uses BLSTM for phoneme discrimination and not for the
recognition of whole keywords. As well as reducing the complex-
ity of the network, the use of phonemes makes it applicable to any
keyword spotting task.
In the experimental section we evaluate the robustness of our
discriminative BLSTM keyword spotter on the Belfast Sensitive Ar-
tificial Listener (SAL) database [8]. We show that applying BLSTM
significantly increases the area under the Receiver Operating Char-
acteristics (ROC) curve, which is a common measure for keyword
spotting performance.
The paper is structured as follows: Section 2 describes the train-
ing algorithm of our keyword spotter, Section 3 explains the BLSTM
architecture, Section 4 introduces the various BLSTM enhanced fea-
ture functions, Section 5 presents the experimental setup as well as
the keyword spotting results, and conclusions are given in Section 6.
2. DISCRIMINATIVE KEYWORD SPOTTING
The goal of the discriminative keyword spotter applied in this work
is to determine the likelihood that a specific keyword is uttered in
a given speech sequence. Thereby each keyword kconsists of a
phoneme sequence ¯pk= (p1, ..., pL)with Lbeing the length of
the sequence and pldenoting a phoneme out of the domain Pof
possible phoneme symbols. The speech signal is represented by a
sequence of feature vectors ¯x = (x1, ..., xt, ..., xT)where Tis the
length of the utterance. Xand Kmark the domain of all possible
feature vectors and the lexicon of keywords respectively. The align-
ment of the keyword phonemes is defined by the start times slof
the phonemes as well as by the end time of the last phoneme eL:
¯sk= (s1, ..., sL, eL). We assume that the start time of phoneme
pl+1 corresponds to the end time of phoneme pl, so that el=sl+1.
The keyword spotter ftakes as input a feature vector sequence ¯x as
well as a keyword phoneme sequence ¯pkand outputs a real valued
confidence that the keyword kis uttered in ¯x. In order to make the
final decision whether kis contained in ¯x, the confidence score is
compared to a threshold b. The confidence calculation is based on
a set of non-linear feature functions {φj}n
j=1 (see Section 4) which
take a sequence of feature vectors ¯x, a keyword phoneme sequence
¯pk, and a suggested alignment ¯skto compute a confidence measure
for the candidate keyword alignment.
The keyword spotting algorithm searches for the best alignment
¯sproducing the highest possible confidence for the phoneme se-
quence of keyword kin ¯x. Merging the feature functions φjto an
n-dimensional vector function φand introducing a weight vector w,
the keyword spotter is given as
f(¯x,¯pk) = max
¯s
w·φ(¯x,¯pk,¯s).(1)
Consequently foutputs a weighted sum of feature function scores
maximized over all possible keyword alignments. This output then
corresponds to the confidence that the keyword kis uttered in the
speech feature sequence ¯x. Since the number of possible alignments
is exponentially large, the maximization is calculated using dynamic
programming.
In order to evaluate the performance of a keyword spotter, it is
common to compute the Receiver Operating Characteristics curve
[1, 2] which shows the true positive rate as a function of the false
positive rate. The operating point on this curve can be adjusted by
changing the keyword rejection threshold b. If a high true positive
rate shall be obtained at a preferably low false positive rate, the area
under the ROC curve (AUC) has to be maximized. With X+
kdenot-
ing a set of utterances that contains the keyword kand X
ka set that
does not contain the keyword respectively, the AUC for keyword k
is calculated as
Ak=1
|X +
k||X
k|X
¯
x+∈X +
k
¯
x
∈X
k
I{f(¯
x+,¯pk)>f (¯
x
,¯pk)}(2)
and can be thought of as the probability that an utterance contain-
ing keyword k(¯
x+) produces a higher confidence than a sequence
in which kis not uttered (¯
x). Thereby I{·} denotes the indicator
function. When speaking of the average AUC, we refer to
A=1
KX
k∈K
Ak.(3)
In [3] an algorithm for the computation of the weight vector win
Equation 1 is presented. The algorithm aims at training the weights
win a way that they maximize the average AUC on unseen data.
One training example {¯pki,¯
x+
i,¯
x
i,¯ski
i}consists of an utterance in
which keyword kiis uttered, one sequence in which the keyword is
not uttered, the phoneme sequence of the keyword, and the correct
alignment of ki. With
¯s0= arg max
¯s
wi1·φ(¯x
i,¯pki,¯s)(4)
representing the most probable alignment of kiin ¯
x
iaccording to
the weights wi1of the previous training iteration i1, a term
φi=1
|X +
ki||X
ki|φ(¯x+
i,¯pki,¯ski)φ(¯x
i,¯pki,¯s0)(5)
is computed which is the difference of feature functions for ¯x+
iand
¯x
i. For the update rule of wthe Passive-Aggressive algorithm for
binary classification (PA-I) outlined in [9] is applied. Consequently
wis updated according to
wi=wi1+αiφi(6)
whereas αican be calculated as
αi= min C, [1 wi1·φi]+
||φi||2.(7)
The parameter Ccontrols the ”aggressiveness“ of the update rule
and [1 wi1·φi]+can be interpreted as the ”loss“ suffered on
iteration i. After every training step the AUC on a validation set is
computed whereas the vector wwhich achieves the best AUC on the
validation set is the final output of the algorithm.
3. BIDIRECTIONAL LSTM
The basic idea of bidirectional recurrent neural networks [10] is to
use two recurrent network layers, one that processes the training se-
quence forwards and one that processes it backwards. Both networks
are connected to the same output layer, which therefore has access to
complete information about the data points before and after the cur-
rent point in the sequence. The amount of context information that
the network actually uses is learned during training, and does not
have to be specified beforehand. This makes bidirectional networks
a very flexible tool for sequence labeling, and they have been suc-
cessfully applied to areas as diverse as protein secondary structure
prediction [11] and speech recognition [10].
Analysis of the error flow in conventional recurrent neural nets
(RNNs) resulted in the finding that long time lags are inaccessible
to existing RNNs since the backpropagated error either blows up or
decays over time (vanishing gradient problem). This led to the intro-
duction of Long Short Term Memory (LSTM) RNNs [4]. An LSTM
layer is composed of recurrently connected memory blocks, each
of which contains one or more recurrently connected memory cells,
along with three multiplicative “gate” units: the input, output, and
forget gates. The gates perform functions analogous to read, write,
and reset operations. More specifically, the cell input is multiplied
by the activation of the input gate, the cell output by that of the out-
put gate, and the previous cell values by the forget gate. Their effect
is to allow the network to store and retrieve information over long
periods of time. If, for example the input gate remains closed, the
activation of the cell will not be overwritten by new inputs and can
therefore be made available to the net much later in the sequence
by opening the output gate. This principle overcomes the vanishing
gradient problem and gives access to long range context information.
Combining bidirectional networks with LSTM gives Bidirec-
tional LSTM (BLSTM), which has demonstrated excellent perfor-
mance in phoneme recognition [5] and keyword spotting [6].
4. FEATURE FUNCTIONS
As mentioned in Section 2, our keyword spotter is based on a set of
non-linear feature functions {φj}n
j=1 that map a speech utterance,
together with a candidate alignment, into an abstract vector space.
We use n= 7 feature functions which proved successful for the
keyword spotter described in Section 2 [12]. We experiment with
including the output activations of the BLSTM network described
in Section 3 into the first feature function. In one variant this is
extended to a two-dimensional function, giving in an overall feature
dimension of n= 8. In what follows we describe five versions of
the first feature function, denoted φ1A-φ1E.
Feature function φ1Ais the same as used in [3] and is based on
the hierarchical phoneme classifier described in [13]. The classifier
outputs a confidence gp(x)that phoneme pis pronounced in xwhich
is then summed over the whole phoneme sequence to give
φ1A(¯
x,¯p, ¯s) =
|¯p|
X
i=1
si+11
X
t=si
gpi(xt).(8)
Unlike φ1A, the feature function φ1Bincorporates contextual infor-
mation for the computation of the phoneme probabilities by replac-
ing the confidences gp(x)by the BLSTM output activations op(x),
thus
φ1B(¯
x,¯p, ¯s) =
|¯p|
X
i=1
si+11
X
t=si
opi(xt).(9)
Since the BLSTM outputs tend to produce high-confidence phoneme
probability distribution spikes for the recognized phoneme of a
frame while all other activations are close to zero, it is beneficial
to also include the probability distribution g(x)(which - due to the
hierarchical structure of the classifier - consists of multiple rather
low-confidence spikes) in the first feature function, as in φ1C-
φ1E. Therefore φ1Cexpands the first feature function to a two-
dimensional function which can be written as
φ1C(¯
x,¯p, ¯s) = P|¯p|
i=1 Psi+11
t=sigpi(xt)
P|¯p|
i=1 Psi+11
t=siopi(xt)!.(10)
Alternatively φ1Dconsists of a linear combination of the distribu-
tions g(x)and o(x)so that
φ1D(¯
x,¯p, ¯s) =
|¯p|
X
i=1
si+11
X
t=si
G·gpi(xt) + O·opi(xt),(11)
whereas Gand Oare constant weighting factors.
The function φ1Etakes the maximum of the distributions g(x)
and o(x). This maintains the high-confidence BLSTM output acti-
vations as well as the multiple rather low-confidence hypotheses of
g(x)for p-tcoordinates where opi(xt)is close to zero:
φ1E(¯
x,¯p, ¯s) =
|¯p|
X
i=1
si+11
X
t=si
max `gpi(xt), opi(xt)´.(12)
The remaining feature functions φ2-φ7used in this work are the
same as in [3]. φ2-φ5measure the Euclidean distance between
feature vectors at both sides of the suggested phoneme boundaries,
assuming that the correct alignment will produce a large sum of dis-
tances, since the distances at the phoneme boundaries are likely to
be high compared to those within a phoneme. Function φ6scores
the timing sequences based on typical phoneme durations whereas
φ7consideres the speaking rate implied with the candidate phoneme
alignment, presuming that the speaking rate changes only slowly
over time (see [3] for formulas).
5. EXPERIMENTS AND RESULTS
For the training of our keyword spotter and for the comparison of
the different feature functions φ1A-φ1Ewe used the TIMIT cor-
pus. The TIMIT training set was divided into five parts whereas
1,500 utterances were used to train the framebased phoneme recog-
nizer of the first feature function. 150 utterances served as training
set for the forced alignment algorithm which we applied to initialize
the weight vector w(for details see [12]). 100 sequences formed the
validation set of the forced aligner, and from the remaining 1,946 ut-
terances two times 1200 samples were selected for training and two
times 200 utterances for validation of the keyword spotter. From
the TIMIT test set 80 keywords were chosen randomly. For each
keyword we selected at most 20 utterances which contain the key-
word and 20 which do not contain the keyword. The feature vectors
consisted of cepstral mean normalized MFCC features 0 to 12 with
first and second order delta coefficients. As aggressiveness param-
eter Cfor the update algorithm (see Equation 7) we used C= 1.
For the training of the BLSTM used for feature functions φ1B-φ1E
we chose the same 1,500 utterances as for the phoneme recognizer
of φ1A, however we split them into 1,400 sequences for training and
100 for validation. The BLSTM input layer had a size of 39 (one
for each MFCC feature) and the size of the output layer was also 39
since we used the reduced set of 39 TIMIT phonemes. Both hid-
den LSTM layers contained 100 memory blocks of one cell each.
To improve generalization, zero mean Gaussian noise with standard
deviation 0.6 was added to the inputs during training. We used a
learning rate of 105and a momentum of 0.9.
In [3] the keyword spotter applying feature function φ1Awas
shown to outperform a state-of-the-art left-right HMM with 5 emit-
ting states and 40 diagonal Gaussians, consisting of two sub HMM
models, the keyword model and the garbage model (see Figure 1).
0.0 0.2 0.4 0 .6 0.8 1.0
0.5
1.0
DISC
HMM
false pos itive rate
true pos itive rate
Fig. 1. ROC curve for the discriminative keyword spotter using φ1A
(DISC) and the HMM approach (results taken from [3])
For feature function φ1Dwe used the parameters Gand Othat
resulted in the best phoneme recognition rate (G= 1 and O= 1.5).
Table 1 shows the average AUC for the different versions of the fea-
ture function φ1: best performance is achieved when using φ1Dor
φ1A, and there is no statistical significant difference between the re-
sult obtained with these two feature functions. Figure 2 illustrates
the ROC curve obtained with φ1D(DISC-BLSTM) and φ1A(DISC)
for the TIMIT experiment. Next, we compared the performance of
version of φ1AUC
φ1D0.981
φ1A0.980
φ1E0.970
φ1C0.965
φ1B0.942
Table 1. AUC for different versions of φ1(TIMIT experiment)
the keyword spotter using φ1Awith the best BLSTM keyword spot-
ter using φ1Don the Belfast Sensitive Artificial Listener database. In
contrast to the TIMIT database which contains read utterances, the
SAL corpus contains spontaneous and emotionally colored speech.
Note that the SAL utterances have a length of up to 15 seconds which
1200 positive and 200 negative utterances
0.0 0.2 0.4 0 .6 0.8 1.0
0.5
1.0
DISC
DISC-BLSTM
false pos itive rate
true pos itive rate
Fig. 2. ROC curve for the discriminative keyword spotter using φ1A
(DISC) and φ1D(DISC-BLSTM)
is longer than the TIMIT sequences, increasing the probability of
false positives. For a more detailed description of the SAL database
see [8] or [7]. We randomly selected 24 keywords, whereas for each
keyword we chose 20 utterances in which the keyword is not uttered
and up to 20 utterances (depending on how often the keyword oc-
curs in the whole corpus) which include the keyword. On average, a
keyword consisted of 5.4 phonemes. Both the BLSTM network and
the keyword spotter were trained on the TIMIT database without
any further adaptation to the SAL corpus. For this task our BLSTM
approach (using φ1D) was able to outperform the keyword spotter
which does not use long-range dependencies via BLSTM output ac-
tivations. The average AUC was 0.80 for the BLSTM experiment
and 0.68 for the experiment using the original feature function φ1A,
respectively. The ROC for both experiments can be seen in Figure 3.
0.0 0.2 0 .4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
DISC
DISC-BLSTM
false positive rate
true positive rate
Fig. 3. ROC curve for the SAL experiment applying the BLSTM
feature function φ1D(DISC-BLSTM) and the original function φ1A
(DISC)
6. CONCLUSION
This work presented several methods for enhancing the robustness
of a discriminative keyword spotter with a BLSTM recurrent neural
network. The best method used a modified feature function that in-
cluded both the phoneme probability scores obtained from a BLSTM
network and those given by a hierarchical phoneme classifier. For
the TIMIT experiment, both the BLSTM keyword spotter and the
non-enhanced version gave almost perfect detection rates. However
the BLSTM system gave an 18% improvement in average AUC on
the SAL database. This indicates the greater robustness of BLSTM
to spontaneous, emotionally colored speech.
For future experiments we will focus on retraining the BLSTM
keyword spotter on forced alignments from the SAL database, as
a next step towards further improving keyword detection rates for
spontaneous emotional speech.
7. ACKNOWLEDGMENT
The research leading to these results has received funding from the
European Community’s Seventh Framework Programme (FP7/2007-
2013) under grant agreement No. 211486 (SEMAINE).
8. REFERENCES
[1] H. Ketabdar, J. Vepa, S. Bengio, and H. Boulard, “Posterior
based keyword spotting with a priori thresholds, in Proceed-
ings of Interspeech, Pittsburgh, Pennsylvania, 2006.
[2] Y. B. Ayed, D. Fohr, J. P. Haton, and G. Chollet, “Confidence
measure for keyword spotting using support vector machines,
in Proceedings of International Conference on Audio, Speech
and Signal Processing, Montreal, Canada, 2004.
[3] J. Keshet, D. Grangier, and S. Bengio, “Discriminative key-
word spotting,” in Workshop on Non-Linear Speech Processing
NOLISP, Paris, France, 2007.
[4] S. Hochreiter and J. Schmidhuber, “Long short-term memory,
Neural Computation, vol. 9(8), pp. 1735–1780, 1997.
[5] A. Graves, S. Fernandez, and J. Schmidhuber, “Bidirectional
lstm networks for improved phoneme classification and recog-
nition,” in Proceedings of ICANN, Warsaw, Poland, 2005,
vol. 18, pp. 602–610.
[6] S. Fernandez, A. Graves, and J. Schmidhuber, “An application
of recurrent neural networks to discriminative keyword spot-
ting,” in Proceedings of ICANN, Porto, Portugal, 2007, pp.
220–229.
[7] M. W ¨
ollmer, F. Eyben, S. Reiter, B. Schuller, C. Cox,
E. Douglas-Cowie, and R. Cowie, “Abandoning emotion
classes - towards continuous emotion recognition with mod-
elling of long-range dependencies,” in Proceedings Inter-
speech, Brisbane, Australia, 2008.
[8] E. Douglas-Cowie, R. Cowie, I. Sneddon, C. Cox, O. Lowry,
M. McRorie, J. C. Martin, L. Devillers, S. Abrilian, A. Batliner,
N. Amir, and K. Karpouzis, The HUMAINE Database, vol.
4738, pp. 488–500, 2007.
[9] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and
Y. Singer, “Online passive aggressive algorithms, Journal
of Machine Learning Research, vol. 7, 2006.
[10] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural
networks,” IEEE Transactions on Signal Processing, vol. 45,
pp. 2673–2681, November 1997.
[11] P. Baldi, S. Brunak, P. Frasconi, G. Soda, and G. Pollastri, “Ex-
ploiting the past and the future in protein secondary structure
prediction,” BIOINF: Bioinformatics, vol. 15, 1999.
[12] J. Keshet, Large Margin Algorithms for Discriminative Con-
tinuous Speech Recognition, Ph.D. thesis, Hebrew University,
2007.
[13] O. Dekel, J. Keshet, and Y. Singer, “Online algorithm for hier-
archical phoneme classification,” in Workshop on Multimodal
Interaction and Related Machine Learning Algorithms, Mar-
tigny, Switzerland, 2004, pp. 146–159.
... In subsequent works, this approach is developed. Wöllmer et al. (2009b) add a hidden layer of bidirectional LSTM network as features, Tabibian et al. (2011) use a genetic algorithm instead of a linear classifier, and Tabibian et al. (2014) describe the use of kernel trick within the framework of discriminative keyword spotting. A very detailed explanation of disriminative keyword spotting can also be found in Tabibian et al. (2013;2016). ...
... FOM (Gish et al., 1990;Rose and Paul, 1990;Naylor et al., 1992;Zeppenfeld and Waibel, 1992;Chang and Lippmann, 1994;Gish and Ng, 1993;Rohlicek et al., 1993;Knill and Young, 1996;Junkawitsch et al., 1997;Zheng et al., 1999;Szöke et al., 2005;Lehtonen, 2005;Jansen and Niyogi, 2009a,c;Szöke et al., 2010;Tabibian et al., 2011;Bohac, 2012;Sangeetha and Jothilakshmi, 2014;Sadhu and Ghosh, 2017;Tabibian et al., 2018) EER (Szöke et al., 2010;Bohac, 2012) Accuracy (Morgan et al., 1990Ida and Yamasaki, 1998;Ge and Yan, 2017;Benisty et al., 2018;Fernández-Marqués et al., 2018) FA/kw/h (Rohlicek et al., 1989;Vroomen and Normandin, 1992;Feng and Mazor, 1992;Leow et al., 2012;Kavya and Karjigi, 2014) ROC (Marcus, 1992;Siu et al., 1994;Keshet et al., 2009;Wöllmer et al., 2009bWöllmer et al., , 2013Shokri et al., 2013;Sadhu and Ghosh, 2017;Kumatani et al., 2017) Detection rate (Feng and Mazor, 1992;Khne et al., 2004;Shokri et al., 2011;Leow et al., 2012) Substitution rate (Feng and Mazor, 1992) Deletion rate (Feng and Mazor, 1992;Kavya and Karjigi, 2014) Rejection rate (Feng and Mazor, 1992;Heracleous and Shimizu, 2003) Insertion rate (Klemm et al., 1995) Recognition rate (Liu et al., 2000;Heracleous and Shimizu, 2003;Zhu et al., 2013) Discriminative error rate (Cuayáhuitl and Serridge, 2002) FAR (Khne et al., 2004;Shokri et al., 2011;Chen et al., 2014a;Hou et al., 2016;Gruenstein et al., 2017;Ge and Yan, 2017;Sun et al., 2017;Tabibian et al., 2018;Benisty et al., 2018;Guo et al., 2018;Wu et al., 2018;Myer and Tomar, 2018) FRR (Chen et al., 2014a;Gruenstein et al., 2017;Sun et al., 2017;Guo et al., 2018;Wu et al., 2018;Myer and Tomar, 2018) RTF (Szöke et al., 2005;Bohac, 2012;Tabibian et al., 2018) TPR, FPR (Wöllmer et al., 2009a) Miss rate (Hou et al., 2016) Recall (Baljekar et al., 2014;Li and Wang, 2014;Zehetner et al., 2014;Hwang et al., 2015) Precision (Zehetner et al., 2014;Hwang et al., 2015) F1, latency (Hwang et al., 2015) Mean time between false alarms (Baljekar et al., 2014) Processing time (Li and Wang, 2014) Misses, hits (Li and Wang, 2014) RAM usage, flops, accuracy to size, accuracy to ops (Fernández-Marqués et al., 2018) Custom (Marcus, 1992;Silaghi and Vargiya, 2005;Szöke et al., 2010) A. Kolesau is a PhD student at Department of Information Technologies, Vilnius Gediminas Technical University. His research interests include machine learning and speech recognition. ...
... It is able to compensate for the ineffectiveness of RNN in the transmission of long time series of information. LSTM networks have shown outstanding performance in a wide variety of pattern recognition applications, including language translation, picture analysis, voice recognition, defect detection, and text recognition [15][16][17][18]. ...
... Moreover, predictions on sequential data are particularly challenging when both the upstream and downstream information of a sequence is important for a specific time-step. Application examples include Phoneme Speech Recognition [7], [17] and Bioinformatics problems, such as the Protein Secondary Structure Prediction (PSSP) [18]- [20] and other related problems (e.g., Transmembrane Protein Topology Prediction [21]). In such sequence-based problems the events are dynamic and located downstream and upstream, i.e., left and right in the sequence. ...
Article
Full-text available
Trying to extract features from complex sequential data for classification and prediction problems is an extremely difficult task. This task is even more challenging when both the upstream and downstream information of a time-series is important to process the sequence at a specific time-step. One typical problem which falls in this category is Protein Secondary Structure Prediction (PSSP). Recurrent Neural Networks (RNNs) have been successful in handling sequential data. These methods are demanding in terms of time and space efficiency. On the other hand, simple Feed-Forward Neural Networks (FFNNs) can be trained really fast with the Backpropagation algorithm, but in practice they give poor results in this category of problems. The Hessian Free Optimization (HFO) algorithm is one of the latest developments in the field of Artificial Neural Network (ANN) training algorithms which can converge faster and more accurately. In this paper, we present the implementation of simple FFNNs trained with the powerful HFO second-order learning algorithm for the PSSP problem. In our approach, a single FFNN trained with the HFO learning algorithm can achieve an approximately 79.6% per residue ( Q3Q_{3} ) accuracy on the PISCES dataset. Despite the simplicity of our method, the results are comparable to some of the state of the art methods which have been designed for this problem. A majority voting ensemble method and filtering with Support Vector Machines have also been applied, which increase our results to 80.4% per residue ( Q3Q_{3} ) accuracy. Finally, our method has been tested on the CASP13 independent dataset and achieved 78.14% per residue ( Q3Q_{3} ) accuracy. Moreover, the HFO does not require tuning of any parameters which makes training much faster than other state of the art methods, a very important feature with big datasets and facilitates fast training of FFNN ensembles.
... For example, it is appropriate to predict a missing word based on context. Bidirectional LSTM [21] is made up of two LSTM cells, and the output is determined by the two together. ...
Article
Full-text available
Human activity recognition (HAR) has become a popular topic in research because of its wide application. With the development of deep learning, new ideas have appeared to address HAR problems. Here, a deep network architecture using residual bidirectional long short-term memory (LSTM) is proposed. The advantages of the new network include that a bidirectional connection can concatenate the positive time direction (forward state) and the negative time direction (backward state). Second, residual connections between stacked cells act as shortcut for gradients, effectively avoiding the gradient vanishing problem. Generally, the proposed network shows improvements on both the temporal (using bidirectional cells) and the spatial (residual connections stacked) dimensions, aiming to enhance the recognition rate. When testing with the Opportunity dataset and the public domain UCI dataset, the accuracy is significantly improved compared with previous results.
... In [7], for single word acknowledgment, a co-preparing philosophy is utilized that shows the shrouded Markov models are more sensitive than BLSTM NN. Neural Networks (NN) has its own specific manner for catchphrase spotting through bidirectional long here and now memory (BLSTM) NN [18], [19], on the as opposed of the technique recommend in this research work. The expressed effort, contracts just with catchphrase detecting in discourse. ...
Article
Recognition of human handwriting which offers the new way to improve the computer interface with the human and this process is very much useful for documents.Keyword spotting refers the spontaneous recognition of handwritten text, letter, and scripts from historical hand written books and the procedure of recovering all instance of a known keyword from an article. With a specific end goal to choose new components this paper, propose "a repetitive neural system manually written acknowledgment framework" for watchword spotting.The watchword seeing is finished utilizing an adjustment of the connectionist temporat classification Token Passing calculation in coincidence with a repetitive neural system. The proposed watchword spotting technique for written by hand message utilizing neural system, with another adaptation of connectionist temporat classification Token Pass calculation with quick and reliable catchphrase spotting can be executed without utilizing any content line or portioning separate words.
Chapter
In this study, we present a novel deep learning architecture for brain-computer interfaces based on event related potentials (ERP). The topology of the neural network combines convolutional and recurrent layers in order to learn high-level spatial and temporal features. Specifically, our model uses a convolutional layer, intended to detect spatial patterns over the scalp in short periods of time, followed by two bidirectional long-short term memory (BLSTM) layers to extract long-term temporal dependencies within the data. To the best of our knowledge, this is the first time that BLSTM layers are explored for ERP classification. This study takes part in the MEDICON 2019 IFMBE scientific challenge. The model has been evaluated using the provided dataset for the competition (15 subjects with autism spectrum disorder, 7 BCI sessions), achieving an average accuracy of 84% in command selection. In the course of our experiments, this approach outperformed traditional methods, such as step-wise linear discriminant analysis (SWLDA), and other deep learning architectures.
Article
Full-text available
Spoken keyword spotting refers to the detection of all occurrences of desired words in continuous speech utterances. This paper includes a comprehensive review on various spoken keyword spotting (especially discriminative spoken keyword spotting) approaches. The most common datasets and evaluation measures for training and evaluating the spoken keyword spotting systems are reviewed in this paper. Moreover, the main framework for structured discriminative spoken keyword spotting (SDKWS) is presented. Different parts of the SDKWS framework such as feature extraction, model training, search algorithm and thresholding are discussed in this paper. Finally, the paper is concluded in the conclusion section and the future works are presented in the last part of that section.
Conference Paper
Full-text available
This paper proposes a new approach for keyword spotting, which is not based on HMMs. The proposed method employs a new discriminative learning procedure, in which the learn- ing phase aims at maximizing the area under the ROC curve, as this quantity is the most common measure to evaluate keyword spotters. The keyword spotter we devise is based on non- linearly mapping the input acoustic representation of the speech utterance along with the target keyword into an abstract vector space. Building on techniques used for large margin methods for predicting whole sequences, our keyword spotter distills to a classifier in the abstract vector-space which separates speech utterances in which the keyword is uttered from speech utterances in which the keyword is not uttered. We describe a simple iterative algorithm for learning the keyword spotter and discuss its formal properties. Experiments with the TIMIT corpus show that our method outperforms the conventional HMM-based approach.
Article
Full-text available
Abstract Automatic speech recognition has long been a considered dream. While ASR does work today, and it is commercially available, it is extremely sensitive to noise, talker variations, and environments. The current state-of-the-art automatic speech recognizers are based on generative models that capture some temporal dependencies such as hidden Markov models (HMMs). While HMMs have been immensely important in the development of large-scale speech processing applications and in particular speech recognition, their performance is far from the performance of a human listener. HMMs have several drawbacks, both in modeling the speech signal and as learning algorithms. The present dissertation develops fundamental algorithms for continuous speech recognition, which are not based on the HMMs. These algorithms are based on latest advances in large margin and kernel methods, and they aim at minimizing the error induced by the speech recognition problem. Chapter 1 consists of a basic introduction of the current state of automatic speech recognition with the HMM and its limitations. We also present the advantages of the large margin and kernel methods and give a short outline of the thesis. In Chapter 2 we present large-margin algorithms for the task of hierarchical phoneme classication.,Phonetic theory of spoken speech embeds the set of phonemes of western lan- guages in a phonetic hierarchy where the phonemes constitute the leaves of the tree, while broad phonetic groups | such as vowels and consonants | correspond to internal vertices. Motivated by this phonetic structure, we propose a hierarchical model that incorporates the notion of the similarity between the phonemes and between phonetic groups. As in large margin methods, we associate a vector in a high dimensional space with each phoneme or
Conference Paper
Full-text available
Support vector machines (SVM) is a new and very promising classification technique developed from the theory of structural risk minimisation. We propose an alternative out-of-vocabulary word detection method relying on confidence measures and support vector machines. Confidence measures are computed from phone level information provided by a hidden Markov model (HMM) based speech recognizer. We use three kinds of average techniques as arithmetic, geometric and harmonic averages to compute a confidence measure for each word. The acceptance/rejection decision of a word is based on the confidence feature vector which is processed by a SVM classifier. The performance of the proposed SVM classifier is compared with methods based on the averaging of confidence measures.
Conference Paper
Full-text available
In this paper, we carry out two experiments on the TIMIT speech corpus with bidirectional and unidirectional Long Short Term Memory (LSTM) networks. In the first experiment (framewise phoneme classification) we find that bidirectional LSTMoutperforms both unidirectional LSTMand conventional Recurrent Neural Networks (RNNs). In the second (phoneme recognition) we find that a hybrid BLSTM-HMM system improves on an equivalent traditional HMM system, as well as unidirectional LSTM-HMM.
Conference Paper
Full-text available
The goal of keyword spotting is to detect the presence of specific spoken words in unconstrained speech. The majority of keyword spotting systems are based on generative hidden Markov models and lack discriminative capabilities. However, discriminative keyword spotting systems are currently based on frame-level posterior probabilities of sub-word units. This paper presents a discriminative keyword spotting system based on recurrent neural networks only, that uses information from long time spans to estimate word-level posterior probabilities. In a keyword spotting task on a large database of unconstrained speech the system achieved a keyword spotting accuracy of 84.5%.
Conference Paper
Full-text available
We present an algorithmic framework for phoneme classifi- cation where the set of phonemes is organized in a predefined hierarchi- cal structure. This structure is encoded via a rooted tree which induces a metric over the set of phonemes. Our approach combines techniques from large margin kernel methods and Bayesian analysis. Extending the notion of large margin to hierarchical classification, we associate a proto- type with each individual phoneme and with each phonetic group which corresponds to a node in the tree. We then formulate the learning task as an optimization problem with margin constraints over the phoneme set. In the spirit of Bayesian methods, we impose similarity requirements between the prototypes corresponding to adjacent phonemes in the pho- netic hierarchy. We describe a new online algorithm for solving the hi- erarchical classification problem and provide worst-case loss analysis for the algorithm. We demonstrate the merits of our approach by applying the algorithm to synthetic data and as well as speech data.
Article
Full-text available
We present a family of margin based online learning algorithms for various prediction tasks. In particular we derive and analyze algorithms for binary and multiclass categorization, regression, uniclass prediction and sequence prediction. The update steps of our different algorithms are all based on analytical solutions to simple constrained optimization problems. This unified view al- lows us to prove worst-case loss bounds for the different algorithms and for the various decision problems based on a single lemma. Our bounds on the cumulative loss of the algorithms are relative to the smallest loss that can be attained by any fixed hypothes is, and as such are applicable to both realizable and unrealizable settings. We demonstrate some of the merits of the proposed algorithms in a series of experiments with synthetic and real data sets.
Article
This paper proposes a new approach for keyword spotting, which is based on large margin and kernel methods rather than on HMMs. Unlike previous approaches, the proposed method employs a discriminative learning procedure, in which the learn- ing phase aims at achieving a high area under the ROC curve, as this quantity is the most common measure to evaluate keyword spotters. The keyword spotter we devise is based on mapping the input acoustic representation of the speech utterance along with the target keyword into a vector space. Building on techniques used for large margin and kernel methods for predicting whole sequences, our keyword spotter distills to a classifier in this vector-space, which separates speech utterances in which the keyword is uttered from speech utterances in which the keyword is not uttered. We describe a simple iterative algorithm for training the keyword spotter and discuss its formal properties, showing theoretically that it attains high area under the ROC curve. Experiments on read speech with the TIMIT corpus show that the resulted discriminative system outperforms the conventional context-independent HMM-based system. Further experiments using the TIMIT trained model, but tested on both read (HTIMIT, WSJ) and spontaneous speech (OGI-Stories), show that without further training or adaptation to the new corpus our discriminative system outperforms the conventional context-independent HMM-based system.
Article
In this paper, we propose a new posterior based scoring approach for keyword and non keyword (garbage) elements. The estimation of these scores is based on HMM state posterior probability definition, taking into account long contextual information and the prior knowledge (e.g. keyword model topology). The state posteriors are then integrated into keyword and garbage posteriors for every frame. These posteriors are used to make a decision on detection of the keyword at each frame. The frame level decisions are then accumulated (in this case, by counting) to make a global decision on having the keyword in the utterance. In this way, the contribution of possible outliers are minimized, as opposed to the conventional Viterbi decoding approach which accumulates likelihoods. Experiments on keywords from the Conversational Telephone Speech (CTS) and Numbers'95 databases are reported. Results show that the new scoring approach leads to better trade off between true and false alarms compared to the Viterbi decoding approach, while also providing the possibility to precalculate keyword specific spotting thresholds related to the length of the keywords.