Content uploaded by Joseph Keshet
Author content
All content in this area was uploaded by Joseph Keshet on Mar 26, 2017
Content may be subject to copyright.
Content uploaded by Joseph Keshet
Author content
All content in this area was uploaded by Joseph Keshet on Feb 01, 2017
Content may be subject to copyright.
ROBUST DISCRIMINATIVE KEYWORD SPOTTING FOR EMOTIONALLY COLORED
SPONTANEOUS SPEECH USING BIDIRECTIONAL LSTM NETWORKS
Martin W¨
ollmer1, Florian Eyben1, Joseph Keshet2, Alex Graves3, Bj¨
orn Schuller1, Gerhard Rigoll1
1Institute for Human-Machine Communication, Technische Universit¨
at M¨
unchen, Germany
2Idiap Research Institute, Martigny, Switzerland
3Institute for Computer Science VI, Technische Universit¨
at M¨
unchen, Germany
woellmer@tum.de
ABSTRACT
In this paper we propose a new technique for robust keyword spot-
ting that uses bidirectional Long Short-Term Memory (BLSTM) re-
current neural nets to incorporate contextual information in speech
decoding. Our approach overcomes the drawbacks of generative
HMM modeling by applying a discriminative learning procedure
that non-linearly maps speech features into an abstract vector space.
By incorporating the outputs of a BLSTM network into the speech
features, it is able to make use of past and future context for phoneme
predictions. The robustness of the approach is evaluated on a key-
word spotting task using the HUMAINE Sensitive Artificial Listener
(SAL) database, which contains accented, spontaneous, and emo-
tionally colored speech. The test is particularly stringent because the
system is not trained on the SAL database, but only on the TIMIT
corpus of read speech. We show that our method prevails over a
discriminative keyword spotter without BLSTM-enhanced feature
functions, which in turn has been proven to outperform HMM-based
techniques.
Index Terms—Speech recognition, Robustness, Recurrent neu-
ral networks
1. INTRODUCTION
The goal of keyword spotting is to reliably detect the presence of a
specific word in a given speech utterance. This is most commonly
done with Hidden Markov Models (HMM) [1, 2]. However, the
use of HMMs has various drawbacks, such as the need for an ad-
equate ”garbage model“ to handle non-keyword speech. Designing
a garbage-model is a nontrivial problem since the garbage model can
potentially model any phoneme sequence — including the keyword
itself. Further disadvantages of HMM modeling are the suboptimal
convergence of the Expectation Maximization (EM) algorithm to lo-
cal maxima, the assumption of conditional independence of the ob-
servations, and the fact that HMMs do not directly maximize the
keyword detection rate.
For these reasons we follow [3] in using a supervised, discrim-
inative approach to keyword spotting, that does not require the use
of HMMs. In general, discriminative learing algorithms are likely
to outperform generative models such as HMMs since the objective
function used during training more closely reflects the actual deci-
sion task. The discriminative method described in [3] uses feature
functions to non-linearly map the speech utterance, along with the
target keyword, into an abstract vector space. It was shown to pre-
vail over HMM modeling. However, in contrast to state-of-the-art
HMM recognizers which use triphones to incorporate information
from past and future speech frames, the discriminative system does
not explicitly consider contextual knowledge. In this work we build
in context information by including the outputs of a bidirectional
Long Short-Term Memory (BLSTM) recurrent neural network [4, 5]
in the feature functions. Similar neural network architectures have
been successfully applied to speech or emotion recognition related
tasks [6, 5, 7], where they exploit contextual information whenever
speech production or perception is influenced by emotion, strong ac-
cents, or background noise. In contrast to [6], our keyword spotting
approach uses BLSTM for phoneme discrimination and not for the
recognition of whole keywords. As well as reducing the complex-
ity of the network, the use of phonemes makes it applicable to any
keyword spotting task.
In the experimental section we evaluate the robustness of our
discriminative BLSTM keyword spotter on the Belfast Sensitive Ar-
tificial Listener (SAL) database [8]. We show that applying BLSTM
significantly increases the area under the Receiver Operating Char-
acteristics (ROC) curve, which is a common measure for keyword
spotting performance.
The paper is structured as follows: Section 2 describes the train-
ing algorithm of our keyword spotter, Section 3 explains the BLSTM
architecture, Section 4 introduces the various BLSTM enhanced fea-
ture functions, Section 5 presents the experimental setup as well as
the keyword spotting results, and conclusions are given in Section 6.
2. DISCRIMINATIVE KEYWORD SPOTTING
The goal of the discriminative keyword spotter applied in this work
is to determine the likelihood that a specific keyword is uttered in
a given speech sequence. Thereby each keyword kconsists of a
phoneme sequence ¯pk= (p1, ..., pL)with Lbeing the length of
the sequence and pldenoting a phoneme out of the domain Pof
possible phoneme symbols. The speech signal is represented by a
sequence of feature vectors ¯x = (x1, ..., xt, ..., xT)where Tis the
length of the utterance. Xand Kmark the domain of all possible
feature vectors and the lexicon of keywords respectively. The align-
ment of the keyword phonemes is defined by the start times slof
the phonemes as well as by the end time of the last phoneme eL:
¯sk= (s1, ..., sL, eL). We assume that the start time of phoneme
pl+1 corresponds to the end time of phoneme pl, so that el=sl+1.
The keyword spotter ftakes as input a feature vector sequence ¯x as
well as a keyword phoneme sequence ¯pkand outputs a real valued
confidence that the keyword kis uttered in ¯x. In order to make the
final decision whether kis contained in ¯x, the confidence score is
compared to a threshold b. The confidence calculation is based on
a set of non-linear feature functions {φj}n
j=1 (see Section 4) which
take a sequence of feature vectors ¯x, a keyword phoneme sequence
¯pk, and a suggested alignment ¯skto compute a confidence measure
for the candidate keyword alignment.
The keyword spotting algorithm searches for the best alignment
¯sproducing the highest possible confidence for the phoneme se-
quence of keyword kin ¯x. Merging the feature functions φjto an
n-dimensional vector function φand introducing a weight vector w,
the keyword spotter is given as
f(¯x,¯pk) = max
¯s
w·φ(¯x,¯pk,¯s).(1)
Consequently foutputs a weighted sum of feature function scores
maximized over all possible keyword alignments. This output then
corresponds to the confidence that the keyword kis uttered in the
speech feature sequence ¯x. Since the number of possible alignments
is exponentially large, the maximization is calculated using dynamic
programming.
In order to evaluate the performance of a keyword spotter, it is
common to compute the Receiver Operating Characteristics curve
[1, 2] which shows the true positive rate as a function of the false
positive rate. The operating point on this curve can be adjusted by
changing the keyword rejection threshold b. If a high true positive
rate shall be obtained at a preferably low false positive rate, the area
under the ROC curve (AUC) has to be maximized. With X+
kdenot-
ing a set of utterances that contains the keyword kand X−
ka set that
does not contain the keyword respectively, the AUC for keyword k
is calculated as
Ak=1
|X +
k||X −
k|X
¯
x+∈X +
k
¯
x
−∈X −
k
I{f(¯
x+,¯pk)>f (¯
x
−,¯pk)}(2)
and can be thought of as the probability that an utterance contain-
ing keyword k(¯
x+) produces a higher confidence than a sequence
in which kis not uttered (¯
x−). Thereby I{·} denotes the indicator
function. When speaking of the average AUC, we refer to
A=1
KX
k∈K
Ak.(3)
In [3] an algorithm for the computation of the weight vector win
Equation 1 is presented. The algorithm aims at training the weights
win a way that they maximize the average AUC on unseen data.
One training example {¯pki,¯
x+
i,¯
x−
i,¯ski
i}consists of an utterance in
which keyword kiis uttered, one sequence in which the keyword is
not uttered, the phoneme sequence of the keyword, and the correct
alignment of ki. With
¯s0= arg max
¯s
wi−1·φ(¯x−
i,¯pki,¯s)(4)
representing the most probable alignment of kiin ¯
x−
iaccording to
the weights wi−1of the previous training iteration i−1, a term
∆φi=1
|X +
ki||X −
ki|“φ(¯x+
i,¯pki,¯ski)−φ(¯x−
i,¯pki,¯s0)”(5)
is computed which is the difference of feature functions for ¯x+
iand
¯x−
i. For the update rule of wthe Passive-Aggressive algorithm for
binary classification (PA-I) outlined in [9] is applied. Consequently
wis updated according to
wi=wi−1+αi∆φi(6)
whereas αican be calculated as
αi= min C, [1 −wi−1·∆φi]+
||∆φi||2ff.(7)
The parameter Ccontrols the ”aggressiveness“ of the update rule
and [1 −wi−1·∆φi]+can be interpreted as the ”loss“ suffered on
iteration i. After every training step the AUC on a validation set is
computed whereas the vector wwhich achieves the best AUC on the
validation set is the final output of the algorithm.
3. BIDIRECTIONAL LSTM
The basic idea of bidirectional recurrent neural networks [10] is to
use two recurrent network layers, one that processes the training se-
quence forwards and one that processes it backwards. Both networks
are connected to the same output layer, which therefore has access to
complete information about the data points before and after the cur-
rent point in the sequence. The amount of context information that
the network actually uses is learned during training, and does not
have to be specified beforehand. This makes bidirectional networks
a very flexible tool for sequence labeling, and they have been suc-
cessfully applied to areas as diverse as protein secondary structure
prediction [11] and speech recognition [10].
Analysis of the error flow in conventional recurrent neural nets
(RNNs) resulted in the finding that long time lags are inaccessible
to existing RNNs since the backpropagated error either blows up or
decays over time (vanishing gradient problem). This led to the intro-
duction of Long Short Term Memory (LSTM) RNNs [4]. An LSTM
layer is composed of recurrently connected memory blocks, each
of which contains one or more recurrently connected memory cells,
along with three multiplicative “gate” units: the input, output, and
forget gates. The gates perform functions analogous to read, write,
and reset operations. More specifically, the cell input is multiplied
by the activation of the input gate, the cell output by that of the out-
put gate, and the previous cell values by the forget gate. Their effect
is to allow the network to store and retrieve information over long
periods of time. If, for example the input gate remains closed, the
activation of the cell will not be overwritten by new inputs and can
therefore be made available to the net much later in the sequence
by opening the output gate. This principle overcomes the vanishing
gradient problem and gives access to long range context information.
Combining bidirectional networks with LSTM gives Bidirec-
tional LSTM (BLSTM), which has demonstrated excellent perfor-
mance in phoneme recognition [5] and keyword spotting [6].
4. FEATURE FUNCTIONS
As mentioned in Section 2, our keyword spotter is based on a set of
non-linear feature functions {φj}n
j=1 that map a speech utterance,
together with a candidate alignment, into an abstract vector space.
We use n= 7 feature functions which proved successful for the
keyword spotter described in Section 2 [12]. We experiment with
including the output activations of the BLSTM network described
in Section 3 into the first feature function. In one variant this is
extended to a two-dimensional function, giving in an overall feature
dimension of n= 8. In what follows we describe five versions of
the first feature function, denoted φ1A-φ1E.
Feature function φ1Ais the same as used in [3] and is based on
the hierarchical phoneme classifier described in [13]. The classifier
outputs a confidence gp(x)that phoneme pis pronounced in xwhich
is then summed over the whole phoneme sequence to give
φ1A(¯
x,¯p, ¯s) =
|¯p|
X
i=1
si+1−1
X
t=si
gpi(xt).(8)
Unlike φ1A, the feature function φ1Bincorporates contextual infor-
mation for the computation of the phoneme probabilities by replac-
ing the confidences gp(x)by the BLSTM output activations op(x),
thus
φ1B(¯
x,¯p, ¯s) =
|¯p|
X
i=1
si+1−1
X
t=si
opi(xt).(9)
Since the BLSTM outputs tend to produce high-confidence phoneme
probability distribution spikes for the recognized phoneme of a
frame while all other activations are close to zero, it is beneficial
to also include the probability distribution g(x)(which - due to the
hierarchical structure of the classifier - consists of multiple rather
low-confidence spikes) in the first feature function, as in φ1C-
φ1E. Therefore φ1Cexpands the first feature function to a two-
dimensional function which can be written as
φ1C(¯
x,¯p, ¯s) = P|¯p|
i=1 Psi+1−1
t=sigpi(xt)
P|¯p|
i=1 Psi+1−1
t=siopi(xt)!.(10)
Alternatively φ1Dconsists of a linear combination of the distribu-
tions g(x)and o(x)so that
φ1D(¯
x,¯p, ¯s) =
|¯p|
X
i=1
si+1−1
X
t=si
G·gpi(xt) + O·opi(xt),(11)
whereas Gand Oare constant weighting factors.
The function φ1Etakes the maximum of the distributions g(x)
and o(x). This maintains the high-confidence BLSTM output acti-
vations as well as the multiple rather low-confidence hypotheses of
g(x)for p-tcoordinates where opi(xt)is close to zero:
φ1E(¯
x,¯p, ¯s) =
|¯p|
X
i=1
si+1−1
X
t=si
max `gpi(xt), opi(xt)´.(12)
The remaining feature functions φ2-φ7used in this work are the
same as in [3]. φ2-φ5measure the Euclidean distance between
feature vectors at both sides of the suggested phoneme boundaries,
assuming that the correct alignment will produce a large sum of dis-
tances, since the distances at the phoneme boundaries are likely to
be high compared to those within a phoneme. Function φ6scores
the timing sequences based on typical phoneme durations whereas
φ7consideres the speaking rate implied with the candidate phoneme
alignment, presuming that the speaking rate changes only slowly
over time (see [3] for formulas).
5. EXPERIMENTS AND RESULTS
For the training of our keyword spotter and for the comparison of
the different feature functions φ1A-φ1Ewe used the TIMIT cor-
pus. The TIMIT training set was divided into five parts whereas
1,500 utterances were used to train the framebased phoneme recog-
nizer of the first feature function. 150 utterances served as training
set for the forced alignment algorithm which we applied to initialize
the weight vector w(for details see [12]). 100 sequences formed the
validation set of the forced aligner, and from the remaining 1,946 ut-
terances two times 1200 samples were selected for training and two
times 200 utterances for validation of the keyword spotter. From
the TIMIT test set 80 keywords were chosen randomly. For each
keyword we selected at most 20 utterances which contain the key-
word and 20 which do not contain the keyword. The feature vectors
consisted of cepstral mean normalized MFCC features 0 to 12 with
first and second order delta coefficients. As aggressiveness param-
eter Cfor the update algorithm (see Equation 7) we used C= 1.
For the training of the BLSTM used for feature functions φ1B-φ1E
we chose the same 1,500 utterances as for the phoneme recognizer
of φ1A, however we split them into 1,400 sequences for training and
100 for validation. The BLSTM input layer had a size of 39 (one
for each MFCC feature) and the size of the output layer was also 39
since we used the reduced set of 39 TIMIT phonemes. Both hid-
den LSTM layers contained 100 memory blocks of one cell each.
To improve generalization, zero mean Gaussian noise with standard
deviation 0.6 was added to the inputs during training. We used a
learning rate of 10−5and a momentum of 0.9.
In [3] the keyword spotter applying feature function φ1Awas
shown to outperform a state-of-the-art left-right HMM with 5 emit-
ting states and 40 diagonal Gaussians, consisting of two sub HMM
models, the keyword model and the garbage model (see Figure 1).
0.0 0.2 0.4 0 .6 0.8 1.0
0.5
1.0
DISC
HMM
false pos itive rate
true pos itive rate
Fig. 1. ROC curve for the discriminative keyword spotter using φ1A
(DISC) and the HMM approach (results taken from [3])
For feature function φ1Dwe used the parameters Gand Othat
resulted in the best phoneme recognition rate (G= 1 and O= 1.5).
Table 1 shows the average AUC for the different versions of the fea-
ture function φ1: best performance is achieved when using φ1Dor
φ1A, and there is no statistical significant difference between the re-
sult obtained with these two feature functions. Figure 2 illustrates
the ROC curve obtained with φ1D(DISC-BLSTM) and φ1A(DISC)
for the TIMIT experiment. Next, we compared the performance of
version of φ1AUC
φ1D0.981
φ1A0.980
φ1E0.970
φ1C0.965
φ1B0.942
Table 1. AUC for different versions of φ1(TIMIT experiment)
the keyword spotter using φ1Awith the best BLSTM keyword spot-
ter using φ1Don the Belfast Sensitive Artificial Listener database. In
contrast to the TIMIT database which contains read utterances, the
SAL corpus contains spontaneous and emotionally colored speech.
Note that the SAL utterances have a length of up to 15 seconds which
1200 positive and 200 negative utterances
0.0 0.2 0.4 0 .6 0.8 1.0
0.5
1.0
DISC
DISC-BLSTM
false pos itive rate
true pos itive rate
Fig. 2. ROC curve for the discriminative keyword spotter using φ1A
(DISC) and φ1D(DISC-BLSTM)
is longer than the TIMIT sequences, increasing the probability of
false positives. For a more detailed description of the SAL database
see [8] or [7]. We randomly selected 24 keywords, whereas for each
keyword we chose 20 utterances in which the keyword is not uttered
and up to 20 utterances (depending on how often the keyword oc-
curs in the whole corpus) which include the keyword. On average, a
keyword consisted of 5.4 phonemes. Both the BLSTM network and
the keyword spotter were trained on the TIMIT database without
any further adaptation to the SAL corpus. For this task our BLSTM
approach (using φ1D) was able to outperform the keyword spotter
which does not use long-range dependencies via BLSTM output ac-
tivations. The average AUC was 0.80 for the BLSTM experiment
and 0.68 for the experiment using the original feature function φ1A,
respectively. The ROC for both experiments can be seen in Figure 3.
0.0 0.2 0 .4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
DISC
DISC-BLSTM
false positive rate
true positive rate
Fig. 3. ROC curve for the SAL experiment applying the BLSTM
feature function φ1D(DISC-BLSTM) and the original function φ1A
(DISC)
6. CONCLUSION
This work presented several methods for enhancing the robustness
of a discriminative keyword spotter with a BLSTM recurrent neural
network. The best method used a modified feature function that in-
cluded both the phoneme probability scores obtained from a BLSTM
network and those given by a hierarchical phoneme classifier. For
the TIMIT experiment, both the BLSTM keyword spotter and the
non-enhanced version gave almost perfect detection rates. However
the BLSTM system gave an 18% improvement in average AUC on
the SAL database. This indicates the greater robustness of BLSTM
to spontaneous, emotionally colored speech.
For future experiments we will focus on retraining the BLSTM
keyword spotter on forced alignments from the SAL database, as
a next step towards further improving keyword detection rates for
spontaneous emotional speech.
7. ACKNOWLEDGMENT
The research leading to these results has received funding from the
European Community’s Seventh Framework Programme (FP7/2007-
2013) under grant agreement No. 211486 (SEMAINE).
8. REFERENCES
[1] H. Ketabdar, J. Vepa, S. Bengio, and H. Boulard, “Posterior
based keyword spotting with a priori thresholds,” in Proceed-
ings of Interspeech, Pittsburgh, Pennsylvania, 2006.
[2] Y. B. Ayed, D. Fohr, J. P. Haton, and G. Chollet, “Confidence
measure for keyword spotting using support vector machines,”
in Proceedings of International Conference on Audio, Speech
and Signal Processing, Montreal, Canada, 2004.
[3] J. Keshet, D. Grangier, and S. Bengio, “Discriminative key-
word spotting,” in Workshop on Non-Linear Speech Processing
NOLISP, Paris, France, 2007.
[4] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural Computation, vol. 9(8), pp. 1735–1780, 1997.
[5] A. Graves, S. Fernandez, and J. Schmidhuber, “Bidirectional
lstm networks for improved phoneme classification and recog-
nition,” in Proceedings of ICANN, Warsaw, Poland, 2005,
vol. 18, pp. 602–610.
[6] S. Fernandez, A. Graves, and J. Schmidhuber, “An application
of recurrent neural networks to discriminative keyword spot-
ting,” in Proceedings of ICANN, Porto, Portugal, 2007, pp.
220–229.
[7] M. W ¨
ollmer, F. Eyben, S. Reiter, B. Schuller, C. Cox,
E. Douglas-Cowie, and R. Cowie, “Abandoning emotion
classes - towards continuous emotion recognition with mod-
elling of long-range dependencies,” in Proceedings Inter-
speech, Brisbane, Australia, 2008.
[8] E. Douglas-Cowie, R. Cowie, I. Sneddon, C. Cox, O. Lowry,
M. McRorie, J. C. Martin, L. Devillers, S. Abrilian, A. Batliner,
N. Amir, and K. Karpouzis, The HUMAINE Database, vol.
4738, pp. 488–500, 2007.
[9] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and
Y. Singer, “Online passive aggressive algorithms,” Journal
of Machine Learning Research, vol. 7, 2006.
[10] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural
networks,” IEEE Transactions on Signal Processing, vol. 45,
pp. 2673–2681, November 1997.
[11] P. Baldi, S. Brunak, P. Frasconi, G. Soda, and G. Pollastri, “Ex-
ploiting the past and the future in protein secondary structure
prediction,” BIOINF: Bioinformatics, vol. 15, 1999.
[12] J. Keshet, Large Margin Algorithms for Discriminative Con-
tinuous Speech Recognition, Ph.D. thesis, Hebrew University,
2007.
[13] O. Dekel, J. Keshet, and Y. Singer, “Online algorithm for hier-
archical phoneme classification,” in Workshop on Multimodal
Interaction and Related Machine Learning Algorithms, Mar-
tigny, Switzerland, 2004, pp. 146–159.