Conference PaperPDF Available

Multi-Domain Joint Semantic Frame Parsing using Bi-directional RNN-LSTM


Abstract and Figures

Sequence-to-sequence deep learning has recently emerged as a new paradigm in supervised learning for spoken language understanding. However, most of the previous studies explored this framework for building single domain models for each task, such as slot filling or domain classification, comparing deep learning based approaches with conventional ones like conditional random fields. This paper proposes a holistic multi-domain, multi-task (i.e. slot filling, domain and intent detection) modeling approach to estimate complete semantic frames for all user utterances addressed to a conversational system , demonstrating the distinctive power of deep learning methods , namely bi-directional recurrent neural network (RNN) with long-short term memory (LSTM) cells (RNN-LSTM) to handle such complexity. The contributions of the presented work are threefold: (i) we propose an RNN-LSTM architecture for joint modeling of slot filling, intent determination, and domain classification ; (ii) we build a joint multi-domain model enabling multi-task deep learning where the data from each domain reinforces each other; (iii) we investigate alternative architectures for modeling lexical context in spoken language understanding. In addition to the simplicity of the single model framework, experimental results show the power of such an approach on Mi-crosoft Cortana real user data over alternative methods based on single domain/task deep learning.
Content may be subject to copyright.
Multi-Domain Joint Semantic Frame Parsing using Bi-directional RNN-LSTM
Dilek Hakkani-T¨
ur, Gokhan Tur, Asli Celikyilmaz, Yun-Nung Chen,
Jianfeng Gao, Li Deng, and Ye-Yi Wang
Microsoft, Redmond, WA, USA
National Taiwan University, Taipei, Taiwan
{dilek, gokhan.tur, asli, y.v.chen}, {jfgao, deng, yeyiwang}
Sequence-to-sequence deep learning has recently emerged
as a new paradigm in supervised learning for spoken language
understanding. However, most of the previous studies ex-
plored this framework for building single domain models for
each task, such as slot filling or domain classification, com-
paring deep learning based approaches with conventional ones
like conditional random fields. This paper proposes a holistic
multi-domain, multi-task (i.e. slot filling, domain and intent
detection) modeling approach to estimate complete semantic
frames for all user utterances addressed to a conversational sys-
tem, demonstrating the distinctive power of deep learning meth-
ods, namely bi-directional recurrent neural network (RNN) with
long-short term memory (LSTM) cells (RNN-LSTM) to handle
such complexity. The contributions of the presented work are
three-fold: (i) we propose an RNN-LSTM architecture for joint
modeling of slot filling, intent determination, and domain clas-
sification; (ii) we build a joint multi-domain model enabling
multi-task deep learning where the data from each domain re-
inforces each other; (iii) we investigate alternative architectures
for modeling lexical context in spoken language understanding.
In addition to the simplicity of the single model framework, ex-
perimental results show the power of such an approach on Mi-
crosoft Cortana real user data over alternative methods based on
single domain/task deep learning.
Index Terms: recurrent neural networks, long short term mem-
ory, multi-domain language understanding, joint modeling
1. Introduction
In the last decade, a variety of practical goal-oriented conver-
sation understanding systems have been built for a number of
domains, such as the virtual personal assistants Microsoft’s Cor-
tana and Apple’s Siri. Three key tasks in such targeted un-
derstanding applications are domain classification, intent deter-
mination and slot filling [1], aiming to form a semantic frame
that captures the semantics of user utterances/queries. Domain
classification is often completed first in spoken language under-
standing (SLU) systems, serving as a top-level triage for subse-
quent processing. Intent determination and slot filling are then
run for each domain to fill a domain specific semantic template.
An example semantic frame for a movie-related utterance, ”find
recent comedies by James Cameron”, is shown in Figure 1.
This modular design approach (i.e., modeling SLU as 3
tasks) has the advantage of flexibility; specific modifications
(e.g., insertions, deletions) to a domain can be implemented
without requiring changes to other domains. Another advantage
is that, in this approach, one can use task/domain specific fea-
tures, which often significantly improve the accuracy of these
Wfind recent comedies by james cameron
↓ ↓
SO B-date B-genre O B-dir I-dir
Ifind movie
Figure 1: An example utterance with annotations of semantic
slots in IOB format (S), domain (D), and intent (I), B-dir and
I-dir denote the director name.
task/domain specific models. Also, this approach often yields
more focused understanding in each domain since the intent de-
termination only needs to consider a relatively small set of in-
tent and slot classes over a single (or limited set) of domains,
and model parameters could be optimized for the specific set of
intent and slots. However, this approach also has disadvantages:
First of all, one needs to train these models for each domain.
This is an error-prone process, requiring careful engineering to
insure consistency in processing across domains. Also, during
run-time, such pipelining of tasks results in transfer of errors
from one task to the following tasks. Furthermore, there is no
data or feature sharing between the individual domain models,
resulting in data fragmentation, whereas, some semantic intents
(such as, finding or buying a domain specific entity) and slots
(such as, dates, times, and locations) could actually be common
to many domains [2, 3]. Finally, the users may not know which
domains are covered by the system and to what extent, so this
issue results in interactions where the users do not know what
to expect and hence resulting in user dissatisfaction [4, 5].
We propose a single recurrent neural network (RNN) archi-
tecture that integrates the three tasks of domain detection, intent
detection and slot filling for multiple domains in a single SLU
model. This model is trained using all available utterances from
all domains, paired with their semantic frames. The input of
this RNN is the input sequence of words (e.g., user queries) and
the output is the full semantic frame, including domain, intent,
and slots, as shown in Figure 1. Since the dependency between
the words is important for SLU tasks, we investigate alternative
architectures for integrating lexical context and dependencies.
We compare the single model approach to alternative ways of
building models for multi-task, multi-domain scenarios.
The next section sets the baseline RNN-LSTM architecture
based on the slot filling task [6], and explores various architec-
tures for exploiting lexical contexts. In Section 3, we extend this
architecture to model domains and intents of user utterances in
addition to slot filling, and propose a multi-domain multi-task
architecture for SLU. In the experiments, we first investigate the
performance of alternative architectures on the benchmark ATIS
data set [7], and then on the Microsoft Cortana muilti-domain
data. We show that the single multi-domain, joint model ap-
proach is not only simpler, but also results in the best F-measure
in experimental results.
2. Deep Learning for SLU
A major task in spoken language understanding in goal-oriented
human-machine conversational understanding systems is to au-
tomatically classify the domain of a user query along with do-
main specific intents and fill in a set of arguments or ”slots” to
form a semantic frame. In this study, we follow the popular IOB
(in-out-begin) format for representing the slot tags as shown in
Figure 1.
A detailed survey of pre-deep learning era approaches for
domain detection, intent determination, and slot filling can
be found in [1]. Basically, domain detection and intent de-
termination tasks are framed as classification problems, for
which researchers have employed support vector machines [8],
maximum entropy classifiers [9], or boosting based classi-
fiers [10, 11]. Similarly, slot filling is framed as a sequence
classification problem and hidden Markov models [12] and con-
ditional random fields [13, 14] have been employed.
With the advances on deep learning, deep belief networks
(DBNs) with deep neural networks (DNNs) have first been em-
ployed for intent determination in call centers [15], and later for
domain classification in personal assistants [16, 17, 18]. More
recently, an RNN architecture with LSTM cells have been em-
ployed for intent classification [19].
For slot filling, deep learning research has started as exten-
sions of DNNs and DBNs (e.g., [20]) and is sometimes merged
with CRFs [21]. One notable extension is the use of recursive
neural networks, framing the problem as semantic parsing [22].
To the best of our knowledge RNNs have first been employed
for slot filling by Yao et al. [23] and Mesnil et al. [24] con-
currently. We have compiled a comprehensive review of RNN
based slot filling approaches in [6].
Especially with the re-discovery of LSTM cells [25] for
RNNs, this architecture has started to emerge [26]. LSTM cells
are shown to have superior properties, such as faster conver-
gence and elimination of the problem of vanishing or exploding
gradients in sequence via self-regularization, as presented be-
low. As a result, LSTM is more robust than RNN in capturing
long-span dependencies.
2.1. RNN with LSTM cells for slot filling
To estimate the sequence of tags Y=y1, ..., yncorrespond-
ing to an input sequence of tokens X=x1, ..., xn, we use the
Elman RNN architecture [27], consisting of an input layer, a
hidden layer (for the single layer version), and an output layer.
The input, hidden and output layers consist of a set of neurons
representing the input, hidden, and output at each time step t,
xt, ht, and yt, respectively. The input is typically represented
by 1-hot vector or word level embeddings. Given the input layer
xtat time t, and hidden state from the previous time step ht1,
the hidden and output layers for the current time step are com-
puted as follows:
ˆyt= argmax pt(3)
where Wxh and Why are the matrices that denote the weights
between the input and hidden layers and hidden and output
layers, respectively. φdenotes the activation function, i.e.,
Figure 2: LSTM cell, as depicted in [31].
tanh or sigm. The softmax is defined as: softmax(zm) =
ezm/Piezi. The weights of the model are trained using back-
propagation to maximize the conditional likelihood of the train-
ing set labels:
p(yt|x1, ..., xt).(4)
Previous work [28] has shown that training model param-
eters with backpropagation over time could result in exploding
or vanishing gradients. Exploding gradients could be allevi-
ated by gradient clipping [29], but this does not help vanishing
gradients. LSTM cells [25] were designed to mitigate the van-
ishing gradient problem. In addition to the hidden layer vector
ht, LSTMs maintain a memory vector, ct, which it can choose
to read from, write to or reset using a gating mechanism and
sigmoid functions. The input gate, itis used to scale down the
input; the forget gate, ftis used to scale down the memory vec-
tor ct; the output gate, otis used to scale down the output to
reach the final ht. Following the precise formulation of [30],
these gates in LSTMs are computed as follows, as also shown
in Figure 2:
where the sigm sand tanh are applied element-wise, Wtis
the weight matrix, and
2.2. Integration of context
In SLU, word tags are not only determined by the associated
terms, but also contexts [32]. For example, in ATIS data, the
city name Boston could be tagged as originating or destination
city, according to the lexical context it appears in. For capturing
such dependencies, we investigated two extensions to the RNN-
LSTM architecture (Figure 3.(a)): look-around LSTM (LSTM-
LA) and bi-directional LSTM (bLSTM) [33].
At each time step, in addition to xt, LSTM-LA (Fig-
ure 3.(b)) considers the following and preceding words as part
of the input, by concatenating the input vectors for the neigh-
boring words. In this work, our input at time tconsisted of a
single vector formed by concatenating xt1, xt, xt+1.
In bLSTM (Figure 3.(c)), two LSTM architectures are tra-
versed in a left-to-right and right-to-left manner, and their hid-
(a) LSTM (b) LSTM-LA
(c) bLSTM-LA (b) Intent LSTM
Figure 3: RNN-LSTM architectures used in this work.
den layers are concatenated when computing the output se-
quence (we use the superscripts band ffor denoting parameters
for the backward and forward directions):
where forward and backward gates are computed respectively
as follows:
In order to make the implementation more efficient, many
of the shared computations are done once such as input vector
preparation or top level gradient computation, pttrutht, where
truthtis the 1-hot vector for the target tag.
Figure 3 depicts these three architectures, as well as the in-
tent LSTM architecture of [19] that we used for modeling of
intents and domains in isolation as the baseline.
3. Joint, Multi-Domain Modeling of
Domain, Intent and Slots
A commonly used approach to represent slot tags for slot fill-
ing is associating each input word wtof utterance kwith an
IOB-style tag as exemplified in Figure 1, hence the input se-
quence Xis w1, ..., wnand the output is the sequence of slot
tags s1, ..., sn. We follow this approach and associate a slot tag
with each word.
For joint modeling of domain, intent, and slots, we as-
sume an additional token at the end of each input utterance k,
<EOS>, and associate a combination of domain and intent tags
dkand ikto this sentence final token by concatenating these
tags. Hence, the new input and output sequence are :
X=w1, ..., wn, <EOS>
Y=s1, ..., sn, dkik
The main rationale of this idea is similar to the sequence-
to-sequence modeling approach, as used in machine transla-
tion [34] or chit-chat [35] systems approaches. The last hidden
layer of the query is supposed to contain a latent semantic repre-
sentation of the whole input utterance, so that it can be utilized
for domain and intent prediction (dkik).
4. Experiments
For training all architectures, we used mini-batch stochastic gra-
dient descent with a batch size of 10 examples and adagrad [36].
We experimented with different hidden layer sizes in {50, 75,
100, 125, 150}and a fixed learning rate in {0.01, 0.05, 0.1}
in all of the experiments. We used only lexical features (i.e.,
no dictionaries), and represented input with 1-hot word vectors,
including all the vocabulary terms. In addition to 1-hot word
vectors, we experimented with word2vec [37] and Senna [38]
embeddings, and did not observe significant performance im-
provement, hence only results with 1-hot vectors are reported.
All parameters were uniformly initialized in [0.01,0.01].
4.1. Data sets
For investigating the integration of contexts for slot filling, we
have experimented with the benchmark ATIS data set [7] for
the air travel domain. For experiments related to joint domain,
intent, and slot modeling, four domains are chosen: alarm, cal-
endar, communication and technical, to create a diverse set in
terms of vocabulary size, number of intents and slots. The
number of training, development and test utterances, vocabu-
lary size, number of intents and slots for each of these data sets
are listed in Table 4. As seen in the last row of this table, the
number of intents and slots in the joined data set is less than
the sum of the number of intents and slots in individual do-
mains, this is because some of these are shared across different
4.2. Slot Filling Experiments
ATIS data set comes with a commonly used training and test
split [7]. For tuning parameters, we further split the training set
into 90% training and 10% development set. After choosing the
parameters that maximize the F-measure on the development
set, we retrained the model with all of the training data with the
optimum parameter set with 10 different initializations and av-
eraged F-measures. The maximum F-measure (best F) is com-
puted on the test set when 90% of the training examples were
used and the average F-measure (avg. F) is computed by averag-
ing F-measure from the 10 runs when all the training examples
are used with the optimum parameters. These results are shown
in Table 2. We get the best F-measure with the bi-directional
LSTM architecture (though comparable with LSTM-LA), the
relative performances of RNN, LSTM, and LSTM-LA are in
parallel with our earlier work [39], though F-measure is slightly
lower due to differences in normalization.
4.3. Multi-Domain, Joint Model Experiments
Following the slot filling experiments, we used bi-directional
LSTM for modeling slots alone and jointly modeling intent and
slots, and following [19], we use LSTM for modeling intents.
Table 1: Data sets used in the experiments. For each domain, the number of examples in the training, dev, and test sets, input vocabulary
size of the training set, and number of unique intents and slots.
Data Set # Train # Dev # Test |V|# Intents # Slots
ATIS 4,978 - 893 900 17 79
Alarm 8,096 1,057 846 433 16 8
Calendar 21,695 3,626 2,555 1,832 20 18
Communication 13,779 2,662 1,529 4,336 25 20
Techincal 7,687 993 867 2,180 5 18
4 domains 51,257 8,338 5,797 6,680 59 42
Table 2: F-measure results using ATIS data. The first column
shows the best F-measure on the test set, when the model was
trained with 90% of the training examples, the second column
shows F-measure averaged over 10 random initializations with
parameters optimized in the development set (10%).
Model best F avg. F
RNN 93.06% 92.09%
LSTM 93.80% 93.09%
LSTM-LA 95.12% 94.68%
bLSTM 95.48% 94.70%
We experimented with 4 settings, and report slot F-measure
(SLOT F, Table 3), intent accuracy (INTENT A, Table 3 and
overall frame error rate (OVERALL E, Table 4) for each of
SD-Sep: For each domain, a separate intent detection
and slot filling model was trained, resulting in 2× |D|
classifiers, where |D|is the number of domains. Opti-
mum parameters were found on the development set for
each experiment and used for computing performance on
the test set. The output of all the classifiers were joined
for overall error rates.
SD-Joint: For each domain, a single model that esti-
mates both intent and sequence of slots was used, result-
ing in |D|classifiers.
MD-Sep: An intent detection model and a slot filling
model were trained using data from all the domains, re-
sulting in 2 classifiers. The output of intent detection
was merged with the output of slot filling for computing
overall template error rates.
MD-Joint: A single classifier for estimating the full se-
mantic frame that includes domain, intent, and slots for
each utterance was trained using all the data.
The first two settings assume that the correct domain for
each example in the test set is provided. To estimate such higher
level domain estimation, we trained an LSTM model for domain
detection using all the data, the accyracy of the domain detec-
tion is 95.5% on the test set. Table 3 shows results for intent
detection and slot filling when the true domain is known for the
first two settings, hence the performances of these two settings
seem higher, however, Table 4 shows overall frame error rates
when the domain estimation is integrated in the decision of the
final frame. In both single-domain and multi-domain settings,
intent detection accuracy improves with joint training (although
small), but slot filling degrades. On the overall, we achieve the
lowest error with the single model approach. The 13.4% seman-
tic frame error rate on all the data is significantly better than the
commonly used SD-Sep.
Table 3: Slot F-measure and intent accuracy results in single
domain (SD) and multi domain (MD) joint and separate model-
ing experiments.
SLOT F SD-Sep SD-Joint MD-Sep MD-Joint
Alarm 95.9% 93.9% 94.5% 94.3%
Cal. 94.5% 93.7% 92.6% 92.4%
Comm. 86.4% 83.8% 85.1% 82.7%
Tech. 90.4% 89.8% 89.6% 88.3%
All 91.8% 90.5% 90.0% 89.4%
INTENT A SD-Sep SD-Joint MD-Sep MD-Joint
Alarm 96.5% 96.2% 94.9% 94.3%
Cal. 97.2% 97.6% 94.2% 94.3%
Comm. 96.1% 95.8% 94.0% 95.4%
Tech. 94.6% 95.9% 93.9% 95.3%
All 96.4% 96.7% 94.1% 94.6%
Table 4: Overall frame level error rates.
Overall E SD-Sep SD-Joint MD-Sep MD-Joint
Alarm 9.5% 9.8% 9.1% 9.2%
Cal. 10.7% 11.1% 11.3% 10.1%
Comm. 19.8% 20.6% 16.3% 17.3%
Tech. 20.4% 20.6% 21.4% 20.2%
All 14.4% 14.9% 13.7% 13.4%
5. Conclusions
We propose a multi-domain, multi-task (i.e. domain and intent
detection and slot filling) sequence tagging approach to esti-
mate complete semantic frames for user utterances addressed to
a conversational system. First, we investigate alternative archi-
tectures for modeling lexical context for spoken language un-
derstanding. Then we present our approach that jointly mod-
els slot filling, intent determination, and domain classification
in a single bi-directional RNN with LSTM cells. User queries
from multiple domains are combined in a single model enabling
multi-task deep learning. We empirically show improvements
with the proposed approach in experimental results, over al-
ternatives. In addition to the simplicity of the single model
framework for SLU, as our future research, such an architec-
ture opens way to handling belief state update, other non-lexical
contexts, such as user contacts or dialogue history in one holis-
tic model [32]. Furthermore, an RNN-LSTM based language
generation system [40] can be jointly trained enabling the end-
to-end conversational understanding framework.
6. Acknowledgments
Authors would like to thank Nikhil Ramesh, Bin Cao, Derek
Liu and reviewers for useful discussions and feedback.
7. References
[1] G. Tur and R. D. Mori, Eds., Spoken Language Understanding:
Systems for Extracting Semantic Information from Speech. New
York, NY: John Wiley and Sons, 2011.
[2] Y.-B. Kim, K. Stratos, R. Sarikaya, and M. Jeong, “New trans-
fer learning techniques for disparate label sets,” in Proceedings of
[3] Y.-N. Chen, D. Hakkani-Tur, and X. He, “Zero-shot learning of
intent embeddings for expansion by convolutional deep structured
semantic models,” in Proceedings of ICASSP. IEEE, 2016.
[4] Y.-N. Chen, W. Y. Wang, and A. I. Rudnicky, “Unsupervised in-
duction and filling of semantic slots for spoken dialogue systems
using frame-semantic parsing,” in Proceedings of ASRU. IEEE,
2013, pp. 120–125.
[5] Y.-N. Chen, W. Y. Wang, A. Gershman, and A. I. Rudnicky, “Ma-
trix factorization with knowledge graph propagation for unsuper-
vised spoken language understanding,” in Proceedings of ACL-
IJCNLP. ACL, 2015, pp. 483–494.
[6] G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D. Hakkani-
Tur, X. He, L. Heck, G. Tur, D. Yu et al., “Using recurrent neu-
ral networks for slot filling in spoken language understanding,”
IEEE/ACM Transactions on Audio, Speech, and Language Pro-
cessing, vol. 23, no. 3, pp. 530–539, 2015.
[7] P. J. Price, “Evaluation of spoken language systems: The ATIS
domain,” in Proceedings of the DARPA Workshop on Speech and
Natural Language, Hidden Valley, PA, June 1990.
[8] P. Haffner, G. Tur, and J. Wright, “Optimizing SVMs for complex
call classification,” in Proceedings of the ICASSP, Hong Kong,
April 2003.
[9] C. Chelba, M. Mahajan, and A. Acero, “Speech utterance classi-
fication,” in Proceedings of the ICASSP, Hong Kong, May 2003.
[10] R. E. Schapire and Y. Singer, “Boostexter: A boosting-based sys-
tem for text categorization,Machine Learning, vol. 39, no. 2/3,
pp. 135–168, 2000.
[11] B. Favre, D. Hakkani-T ¨
ur, and S. Cuendet, “Icsiboost,”, 2007.
[12] R. Pieraccini, E. Tzoukermann, Z. Gorelov, J.-L. Gauvain,
E. Levin, C.-H. Lee, and J. G. Wilpon, “A speech understanding
system based on statistical representation of semantics,” in Pro-
ceedings of the ICASSP, San Francisco, CA, March 1992.
[13] Y.-Y. Wang, L. Deng, and A. Acero, “Spoken language under-
standing - an introduction to the statistical framework,IEEE Sig-
nal Processing Magazine, vol. 22, no. 5, pp. 16–31, September
[14] C. Raymond and G. Riccardi, “Generative and discriminative al-
gorithms for spoken language understanding,” in Proceedings of
the Interspeech, Antwerp, Belgium, 2007.
[15] R. Sarikaya, G. E. Hinton, and B. Ramabhadran, “Deep belief nets
for natural language call-routing,” in Proceedings of the ICASSP,
Prague, Czech Republic, 2011.
[16] G. Tur, L. Deng, D. Hakkani-T ¨
ur, and X. He, “Towards deeper
understanding deep convex networks for semantic utterance clas-
sification,” in In Proceedings of the ICASSP, Kyoto, Japan, March
[17] L. Deng, G. Tur, X. He, and D. Hakkani-T ¨
ur, “Use of kernel deep
convex networks and end-to-end learning for spoken language un-
derstanding,” in In Prooceedings of the IEEE SLT Workshop, Mi-
ami, FL, December 2012.
[18] R. Sarikaya, G. E. Hinton, and A. Deoras, “Application of deep
belief networks for natural language understanding,” IEEE Trans-
actions on Audio, Speech, and Language Processing, vol. 22,
no. 4, April 2014.
[19] S. Ravuri and A. Stolcke, “Recurrent neural network and LSTM
models for lexical utterance classification,” in Interspeech, 2015.
[20] A. Deoras and R. Sarikaya, “Deep belief network based semantic
taggers for spoken language understanding,” in In Prooceedings
of the Interspeech, Lyon, France, August 2013.
[21] P. Xu and R. Sarikaya, “Convolutional neural network based tri-
angular CRF for joint intent detection and slot filling,” in Proceed-
ings of the IEEE ASRU, 2013.
[22] D. Guo, G. Tur, W.-T. Yih, and G. Zweig, “Joint semantic utter-
ance classification and slot filling with recursive neural networks,
in Proceedings of the IEEE SLT Workshop, 2014.
[23] K. Yao, G. Zweig, M.-Y. Hwang, Y. Shi, and D. Yu, “Recurrent
neural networks for language understanding,” in In Prooceedings
of the Interspeech, Lyon, France, August 2013.
[24] G. Mesnil, X. He, L. Deng, and Y. Bengio, “Investigation of
recurrent-neural-network architectures and learning methods for
spoken language understanding,” in Proceedings of Interspeech,
[25] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[26] K. Yao, B. Peng, Y. Zhang, D. Yu, G. Zweig, and Y. Shi, “Spo-
ken language understanding using long short-term memory neural
networks,” in Proceedings of the IEEE SLT Workshop, 2014.
[27] J. Elman, “Finding structure in time,Cognitive Science, vol. 14,
no. 2, 1990.
[28] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term de-
pendencies with gradient descent is difficult,” IEEE Transactions
on Neural Networks, vol. 5, no. 2, pp. 157–166, 1994.
[29] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of train-
ing recurrent neural networks,” arXiv preprint arXiv:1211.5063,
[30] A. Karpathy, J. Johnson, and L. Fei-Fei, “Visualizing and under-
standing recurrent networks,” arXiv preprint arXiv:1506.02078,
November 2015.
[31] R. Jozefowicz, W. Zaremba, and I. Sutskever, “An empirical ex-
ploration of recurrent network architectures,” in Proceedings of
the 32nd International Conference on Machine Learning (ICML-
15), 2015, pp. 2342–2350.
[32] Y.-N. Chen, D. Hakkani-T¨
ur, G. Tur, J. Gao, and D. Li, “End-to-
end memory networks with knowledge carryover for multi-turn
spoken language understanding,” in Proceedings of Interspeech,
[33] A. Graves and J. Schmidhuber, “Framewise phoneme classifica-
tion with bidirectional LSTM and other neural network architec-
tures,” Neural Networks, vol. 18, no. 5, pp. 602–610, 2005.
[34] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence
learning with neural networks,” in Advances in Neural Infor-
mation Processing Systems 27, Z. Ghahramani, M. Welling,
C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds., 2014,
pp. 3104–3112.
[35] O. Vinyals and Q. V. Le, “A neural conversational model,” in
ICML Deep Learning Workshop, 2015.
[36] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient meth-
ods for online learning and stochastic optimization,” Journal of
Machine Learning Research, no. 12, pp. 2121–2159, 2011.
[37] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient esti-
mation of word representations in vector space,” in Workshop at
ICLR, 2013.
[38] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu,
and P. Kuksa, “Natural language processing (almost) from
scratch,” Journal of Machine Learning Research, no. 12, pp.
2483–2537, 2011.
[39] G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D. Hakkani-
Tur, X. He, L. Heck, G. Tur, D. Yu, and G. Zweig, “Using re-
current neural networks for slot filling in spoken language under-
standing,” IEEE Transactions on Audio, Speech, and Language
Processing, vol. 23, no. 3, March 2015.
[40] T.-H. Wen, M. Gasic, N. Mrksic, P.-H. Su, D. Vandyke, and
S. Young, “Semantically conditioned LSTM-based natural lan-
guage generation for spoken dialogue systems,” arXiv preprint
arXiv:1508.01745, 2015.
... On the other hand, sequential tagging models are conventional methods for addressing the extraction of relational structures (Yang et al., 2019a;Hakkani-Tür et al., 2016;Ma et al., 2020;Zhang et al., 2018;Ramponi et al., 2020). These methods convert the relational structure into a sequential format and make predictions by labeling tokens in the input passage. ...
... Task-oriented semantic parsing. Most works on task-oriented semantic parsing focus on intent classification and slot filling tasks (Tür et al., 2010;Gupta et al., 2018;Hakkani-Tür et al., 2016;Zhang et al., 2018;Louvan and Magnini, 2020). Recently, some more advanced neural network based approaches have been proposed, such as MLP-mixer (Fusco et al., 2022) or sequence-to-sequence formulation (Desai et al., 2021). ...
Relational structure extraction covers a wide range of tasks and plays an important role in natural language processing. Recently, many approaches tend to design sophisticated graphical models to capture the complex relations between objects that are described in a sentence. In this work, we demonstrate that simple tagging models can surprisingly achieve competitive performances with a small trick -- priming. Tagging models with priming append information about the operated objects to the input sequence of pretrained language model. Making use of the contextualized nature of pretrained language model, the priming approach help the contextualized representation of the sentence better embed the information about the operated objects, hence, becomes more suitable for addressing relational structure extraction. We conduct extensive experiments on three different tasks that span ten datasets across five different languages, and show that our model is a general and effective model, despite its simplicity. We further carry out comprehensive analysis to understand our model and propose an efficient approximation to our method, which can perform almost the same performance but with faster inference speed.
... RNN with LSTM can be seen as an improved model of traditional RNN language model, which takes text sentences as input sequence to calculate the error of each model. But when the text sequence information is long, the RNN model with LSTM can effectively overcome the problem of sequence information decay [20]. Compared with traditional RNN language models, RNN with LSTM can fully cover longer sentences, and it performs well in multiple validation experiments, especially for English sentence structures with connectives. ...
Full-text available
Most international academic papers are written in English, and the use of tenses in English academic papers often follows some conventional rules. Automatically extracting and analyzing English tenses in scientific papers have begun to attract researchers’ attention for the global environment. In the analysis of the English tense of scientific papers, consider that the neural network model that combines attention mechanism and sequential input network such as Long Short-Term Memory (LSTM) network has a long training time, low extraction accuracy, and cannot parallelize text input. We propose an environmental affection-driven English tense analysis model, which includes an attention mechanism and LSTM model and conducts a temporal analysis of English texts based on an affective computing model. In this paper, our proposed method is verified based on the self-built healthcare exercise-based corpus over public English environment. By comparison, the experimental results show that the method proposed in this paper has better performance than ordinary Convolutional Neural Network (CNN), Support Vector Machine (SVM), and LSTM based on attention mechanism.
... In Fig. 1, both backward and forward information is clearly shown by the directed arrows in the hidden layer, w n represents the input and y n represents the output respectively. So that at any point in time, the information from both past and future are preserved using the hidden states [14,15]. This special feature of bidirectional LSTM increases the accuracy of the RNN model. ...
Full-text available
Machine learning methods played a major role in improving the accuracy of predictions and classification of DNA (Deoxyribonucleic Acid) and protein sequences. In eukaryotes, Splice-site identification and prediction is though not a straightforward job because of numerous false positives. To solve this problem, here, in this paper, we represent a bidirectional Long Short Term Memory (LSTM) Recurrent Neural Network (RNN) based deep learning model that has been developed to identify and predict the splice-sites for the prediction of exons from eukaryotic DNA sequences. During the splicing mechanism of the primary mRNA transcript, the introns, the non-coding region of the gene are spliced out and the exons, the coding region of the gene are joined. This bidirectional LSTM-RNN model uses the intron features that start with splice site donor (GT) and end with splice site acceptor (AG) in order of its length constraints. The model has been improved by increasing the number of epochs while training. This designed model achieved a maximum accuracy of 95.5%. This model is compatible with huge sequential data such as the complete genome.
... The best model was chosen for each method based on the dev set F1 score. The sentence level semantic frame accuracy is also considered for correctness, where the correct intent label must be predicted and all input tokens must be assigned the correct slot labels without missing or incorrect predictions (Hakkani-Tur et al., 2016). ...
Full-text available
Recent joint intent detection and slot tagging models have seen improved performance when compared to individual models. In many real-world datasets, the slot labels and values have a strong correlation with their intent labels. In such cases, the intent label information may act as a useful feature to the slot tagging model. In this paper, we examine the effect of leveraging intent label features through 3 techniques in the slot tagging task of joint intent and slot detection models. We evaluate our techniques on benchmark spoken language datasets SNIPS and ATIS, as well as over a large private Bixby dataset and observe an improved slot-tagging performance over state-of-the-art models.
... Automatic speech recognition (ASR) has shown robustness in the presence of noise largely due to the adoption of neural network based acoustic models [1,2,3,4],large scale training [5,6,7], and improved data augmentation strategies [8,9,10]. Multiple speaker scenarios, however, still pose a challenge [11,12]. ...
Full-text available
One of the most challenging scenarios for smart speakers is multi-talker, when target speech from the desired speaker is mixed with interfering speech from one or more speakers. A smart assistant needs to determine which voice to recognize and which to ignore and it needs to do so in a streaming, low-latency manner. This work presents two multi-microphone speech enhancement algorithms targeted at this scenario. Targeting on-device use-cases, we assume that the algorithm has access to the signal before the hotword, which is referred to as the noise context. First is the Context Aware Beamformer which uses the noise context and detected hotword to determine how to target the desired speaker. The second is an adaptive noise cancellation algorithm called Speech Cleaner which trains a filter using the noise context. It is demonstrated that the two algorithms are complementary in the signal-to-noise ratio conditions under which they work well. We also propose an algorithm to select which one to use based on estimated SNR. When using 3 microphone channels, the final system achieves a relative word error rate reduction of 55% at -12dB, and 43\% at 12dB.
... Robustness of automatic speech recognition in the presence of noise has made significant gains in recent years. This can be largely attributed to the adoption of neural network based acoustic models [1,2,3,4] and large scale training [5,6,7] coupled with improved data augmentation strategies [8,9,10]. However, conditions like reverberation, significant background noise, and competing speech still pose a formidable challenge for ASR models [11,12]. ...
Full-text available
This work introduces the Cleanformer, a streaming multichannel neural based enhancement frontend for automatic speech recognition (ASR). This model has a conformer-based architecture which takes as inputs a single channel each of raw and enhanced signals, and uses self-attention to derive a time-frequency mask. The enhanced input is generated by a multichannel adaptive noise cancellation algorithm known as Speech Cleaner, which makes use of noise context to derive its filter taps. The time-frequency mask is applied to the noisy input to produce enhanced output features for ASR. Detailed evaluations are presented with simulated and re-recorded datasets in speech-based and non-speech-based noise that show significant reduction in word error rate (WER) when using a large-scale state-of-the-art ASR model. It also will be shown to significantly outperform enhancement using a beamformer with ideal steering. The enhancement model is agnostic of the number of microphones and array configuration and, therefore, can be used with different microphone arrays without the need for retraining. It is demonstrated that performance improves with more microphones, up to 4, with each additional microphone providing a smaller marginal benefit. Specifically, for an SNR of -6dB, relative WER improvements of about 80\% are shown in both noise conditions.
... ;Liu and Lane (2016a,b);Hakkani-Tür et al. (2016) consider an implicit joint mechanism using a multi-task framework by sharing an encoder for both tasks.Goo et al. (2018); Li et al. (2018); Qin et al. (2019) consider explicitly leveraging intent detection information to guide slot filling. Wang et al. (2018); E et al. (2019); Zhang et al. (2020); Qin et al. (2021a) use a bi-directional connection between slot filling and intent detection. ...
Due to high data demands of current methods, attention to zero-shot cross-lingual spoken language understanding (SLU) has grown, as such approaches greatly reduce human annotation effort. However, existing models solely rely on shared parameters, which can only perform implicit alignment across languages. We present Global--Local Contrastive Learning Framework (GL-CLeF) to address this shortcoming. Specifically, we employ contrastive learning, leveraging bilingual dictionaries to construct multilingual views of the same utterance, then encourage their representations to be more similar than negative example pairs, which achieves to explicitly aligned representations of similar sentences across languages. In addition, a key step in GL-CLeF is a proposed Local and Global component, which achieves a fine-grained cross-lingual transfer (i.e., sentence-level Local intent transfer, token-level Local slot transfer, and semantic-level Global transfer across intent and slot). Experiments on MultiATIS++ show that GL-CLeF achieves the best performance and successfully pulls representations of similar sentences across languages closer.
Intent detection and slot filling are two main tasks in natural language understanding and play an essential role in task-oriented dialogue systems. The joint learning of both tasks can improve inference accuracy and is popular in recent works. However, most joint models ignore the inference latency and cannot meet the need to deploy dialogue systems at the edge. In this paper, we propose a Fast Attention Network (FAN) for joint intent detection and slot filling tasks, guaranteeing both accuracy and latency. Specifically, we introduce a clean and parameter-refined attention module to enhance the information exchange between intent and slot, improving semantic accuracy by more than 2%. FAN can be implemented on different encoders and delivers more accurate models at every speed level. Our experiments on the Jetson Nano platform show that FAN inferences fifteen utterances per second with a small accuracy drop, showing its effectiveness and efficiency on edge devices.
The success of deep learning methods has stimulated the rapid development of many NLP research areas. Still, task-oriented dialogue modelling remains challenging due to both the inherent complexity of human language and task difficulty. Moreover, building such systems usually relies on large amounts of data with fine-grained annotations, and in many situations, it is difficult to obtain such data. It is thus important for dialogue systems to learn efficiently in low-resource scenarios so that the models can still effectively fulfill their tasks. This thesis aims to provide novel methods to tackle these difficulties in dialogue modelling. To communicate, most commonly, a dialogue system converts a semantic representation (e.g., a dialogue act) into natural language in a process known as Natural Language Generation (NLG). A tree-based NLG model is proposed and shown to be more easily adapted to unseen domains in comparison to other models. This desirable property arises due to the fact that modelling semantic structure facilitates knowledge sharing between source and target domains. We also show that the NLG task can be jointly learned with its dual task, the natural language understanding (NLU), which maps natural language utterances to semantic counterparts. Our approach consists of a stochastic generative model with a shared latent variable for two tasks, which can be trained with significantly less data than individual components. The focus then shifts to a more general setup of dialogue generation. In end-to-end dialogue modelling, systems consume user utterances and learn to directly generate responses, where intermediate dialogue acts are usually used as auxiliary learning signals for model optimisation. We show that semi-supervised methods that were proposed for computer vision tasks can be beneficial to dialogue modelling. We also address the problem of developing dialogue systems when little training data is available. To this end, we propose a learning framework where user and dialogue models are jointly optimised. We show that the data generated by their interaction can be used to further optimise the two models and leads to improved model performance. Yet again, this approach reduces the amount of data for end-to-end dialogue modelling on low-resource domains. Lastly, an understanding model is proposed to address the prevalent phenomena of coreference and ellipsis in dialogues. This model first performs coreference resolution and then rewrites the input user utterance into a complete sentence that resolves coreferent entities and omitted information. As a side contribution, the acquired data for model training is released to the research community.
Slot filling and intent prediction are basic tasks in capturing semantic frame of human utterances. Slots and intent have strong correlation for semantic frame parsing. For each utterance, a specific intent type is generally determined with the indication information of words having slot tags (called as slot words), and in reverse the intent type decides that words of certain categories should be used to fill as slots. However, the Intent-Slot correlation is rarely modeled explicitly in existing studies, and hence may be not fully exploited. In this paper, we model Intent-Slot correlation explicitly and propose a new framework for joint intent prediction and slot filling. Firstly, we explore the effects of slot words on intent by differentiating them from the other words, and we recognize slot words by solving a sequence labeling task with the bi-directional long short-term memory (BiLSTM) model. Then, slot recognition information is introduced into attention- based intent prediction and slot filling to improve semantic results. In addition, we integrate the Slot-Gated mechanism into slot filling to model dependency of slots on intent. Finally, we obtain slot recognition, intent prediction and slot filling by training with joint optimization. Experimental results on the benchmark Air-line Travel Information System (ATIS) and Snips datasets show that our Intent-Slot correlation model achieves state-of-the-art semantic frame performance with a lightweight structure.
Conference Paper
Full-text available
Spoken language understanding (SLU) is a core component of a spoken dialogue system. In the traditional architecture of dialogue systems, the SLU component treats each utterance independent of each other, and then the following components aggregate the multi-turn information in the separate phases. However , there are two challenges: 1) errors from previous turns may be propagated and then degrade the performance of the current turn; 2) knowledge mentioned in the long history may not be carried into the current turn. This paper addresses the above issues by proposing an architecture using end-to-end memory networks to model knowledge carryover in multi-turn conversations , where utterances encoded with intents and slots can be stored as embeddings in the memory and the decoding phase applies an attention model to leverage previously stored semantics for intent prediction and slot tagging simultaneously. The experiments on Microsoft Cortana conversational data show that the proposed memory network architecture can effectively extract salient semantics for modeling knowledge carryover in the multi-turn conversations and outperform the results using the state-of-the-art recurrent neural network framework (RNN) designed for single-turn SLU.
Full-text available
Natural language generation (NLG) is a critical component of spoken dialogue and it has a significant impact both on usability and perceived quality. Most NLG systems in common use employ rules and heuristics and tend to generate rigid and stylised responses without the natural variation of human language. They are also not easily scaled to systems covering multiple domains and languages. This paper presents a statistical language generator based on a semantically controlled Long Short-term Memory (LSTM) structure. The LSTM generator can learn from unaligned data by jointly optimising sentence planning and surface realisation using a simple cross entropy training criterion, and language variation can be easily achieved by sampling from output candidates. An objective evaluation in two differing test domains showed improved performance compared to previous methods with fewer heuristics. Human judges scored the LSTM system higher on informativeness and naturalness and overall preferred it to the other systems.
Conference Paper
Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.7 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a strong phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which beats the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Conference Paper
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Conference Paper
The recent surge of intelligent personal assistants motivates spoken language understanding of dialogue systems. However, the domain constraint along with the inflexible intent schema remains a big issue. This paper focuses on the task of intent expansion, which helps remove the domain limit and make an intent schema flexible. A con-volutional deep structured semantic model (CDSSM) is applied to jointly learn the representations for human intents and associated utterances. Then it can flexibly generate new intent embeddings without the need of training samples and model-retraining, which bridges the semantic relation between seen and unseen intents and further performs more robust results. Experiments show that CDSSM is capable of performing zero-shot learning effectively, e.g. generating embeddings of previously unseen intents, and therefore expand to new intents without retraining , and outperforms other semantic embeddings. The discussion and analysis of experiments provide a future direction for reducing human effort about annotating data and removing the domain constraint in spoken dialogue systems. Index Terms— zero-shot learning, spoken language understanding (SLU), spoken dialogue system (SDS), convolutional deep structured semantic model (CDSSM), embeddings, expansion.
One of the key problems in spoken language understanding (SLU) is the task of slot filling. In light of the recent success of applying deep neural network technologies in domain detection and intent identification, we carried out an in-depth investigation on the use of recurrent neural networks for the more difficult task of slot filling involving sequence discrimination. In this work, we implemented and compared several important recurrent-neural-network architectures, including the Elman-type and Jordan-type recurrent networks and their variants. To make the results easy to reproduce and compare, we implemented these networks on the common Theano neural network toolkit, and evaluated them on the ATIS benchmark. We also compared our results to a conditional random fields (CRF) baseline. Our results show that on this task, both types of recurrent networks outperform the CRF baseline substantially, and a bi-directional Jordantype network that takes into account both past and future dependencies among slots works best, outperforming a CRFbased baseline by 14% in relative error reduction.
This paper investigates the use of deep belief networks (DBN) for semantic tagging, a sequence classification task, in spoken language understanding (SLU).We evaluate the performance of the DBN based sequence tagger on the well-studied ATIS task and compare our technique to conditional random fields (CRF), a state-of-the-art classifier for sequence classification. In con- junction with lexical and named entity features, we also use dependency parser based syntactic features and part of speech (POS) tags [1]. Under both noisy conditions (output of auto- matic speech recognition system) and clean conditions (manual transcriptions), our deep belief network based sequence tagger outperforms the best CRF based system described in [1] by an absolute 2% and 1% F-measure, respectively.Upon carrying out an analysis of cases where CRF and DBN models made differ- ent predictions, we observed that when discrete features are pro- jected onto a continuous space during neural network training, the model learns to cluster these features leading to its improved generalization capability, relative to a CRF model, especially in cases where some features are either missing or noisy.
In natural language understanding (NLU), a user utterance can be labeled differently depending on the domain or application (e.g., weather vs. calendar). Standard domain adaptation techniques are not directly applicable to take advantage of the existing annotations because they assume that the label set is invariant. We propose a solution based on label embeddings induced from canonical correlation analysis (CCA) that reduces the problem to a standard domain adaptation task and allows use of a number of transfer learning techniques. We also introduce a new transfer learning technique based on pretraining of hidden-unit CRFs (HUCRFs). We perform extensive experiments on slot tagging on eight personal digital assistant domains and demonstrate that the proposed methods are superior to strong baselines.
In recent years, continuous space models have proven to be highly effective at language processing tasks ranging from paraphrase detection to language modeling. These models are distinctive in their ability to achieve generalization through continuous space representations, and compositionality through arithmetic operations on those representations. Examples of such models include feed-forward and recurrent neural network language models. Recursive neural networks (RecNNs) extend this framework by providing an elegant mechanism for incorporating both discrete syntactic structure and continuous-space word and phrase representations into a powerful compositional model. In this paper, we show that RecNNs can be used to perform the core spoken language understanding (SLU) tasks in a spoken dialog system, more specifically domain and intent determination, concurrently with slot filling, in one jointly trained model. We find that a very simple RecNN model achieves competitive performance on the benchmark ATIS task, as well as on a Microsoft Cortana conversational understanding task.