Conference PaperPDF Available

Multi-Domain Joint Semantic Frame Parsing using Bi-directional RNN-LSTM

Authors:

Abstract and Figures

Sequence-to-sequence deep learning has recently emerged as a new paradigm in supervised learning for spoken language understanding. However, most of the previous studies explored this framework for building single domain models for each task, such as slot filling or domain classification, comparing deep learning based approaches with conventional ones like conditional random fields. This paper proposes a holistic multi-domain, multi-task (i.e. slot filling, domain and intent detection) modeling approach to estimate complete semantic frames for all user utterances addressed to a conversational system , demonstrating the distinctive power of deep learning methods , namely bi-directional recurrent neural network (RNN) with long-short term memory (LSTM) cells (RNN-LSTM) to handle such complexity. The contributions of the presented work are threefold: (i) we propose an RNN-LSTM architecture for joint modeling of slot filling, intent determination, and domain classification ; (ii) we build a joint multi-domain model enabling multi-task deep learning where the data from each domain reinforces each other; (iii) we investigate alternative architectures for modeling lexical context in spoken language understanding. In addition to the simplicity of the single model framework, experimental results show the power of such an approach on Mi-crosoft Cortana real user data over alternative methods based on single domain/task deep learning.
Content may be subject to copyright.
Multi-Domain Joint Semantic Frame Parsing using Bi-directional RNN-LSTM
Dilek Hakkani-T¨
ur, Gokhan Tur, Asli Celikyilmaz, Yun-Nung Chen,
Jianfeng Gao, Li Deng, and Ye-Yi Wang
Microsoft, Redmond, WA, USA
National Taiwan University, Taipei, Taiwan
{dilek, gokhan.tur, asli, y.v.chen}@ieee.org, {jfgao, deng, yeyiwang}@microsoft.com
Abstract
Sequence-to-sequence deep learning has recently emerged
as a new paradigm in supervised learning for spoken language
understanding. However, most of the previous studies ex-
plored this framework for building single domain models for
each task, such as slot filling or domain classification, com-
paring deep learning based approaches with conventional ones
like conditional random fields. This paper proposes a holistic
multi-domain, multi-task (i.e. slot filling, domain and intent
detection) modeling approach to estimate complete semantic
frames for all user utterances addressed to a conversational sys-
tem, demonstrating the distinctive power of deep learning meth-
ods, namely bi-directional recurrent neural network (RNN) with
long-short term memory (LSTM) cells (RNN-LSTM) to handle
such complexity. The contributions of the presented work are
three-fold: (i) we propose an RNN-LSTM architecture for joint
modeling of slot filling, intent determination, and domain clas-
sification; (ii) we build a joint multi-domain model enabling
multi-task deep learning where the data from each domain re-
inforces each other; (iii) we investigate alternative architectures
for modeling lexical context in spoken language understanding.
In addition to the simplicity of the single model framework, ex-
perimental results show the power of such an approach on Mi-
crosoft Cortana real user data over alternative methods based on
single domain/task deep learning.
Index Terms: recurrent neural networks, long short term mem-
ory, multi-domain language understanding, joint modeling
1. Introduction
In the last decade, a variety of practical goal-oriented conver-
sation understanding systems have been built for a number of
domains, such as the virtual personal assistants Microsoft’s Cor-
tana and Apple’s Siri. Three key tasks in such targeted un-
derstanding applications are domain classification, intent deter-
mination and slot filling [1], aiming to form a semantic frame
that captures the semantics of user utterances/queries. Domain
classification is often completed first in spoken language under-
standing (SLU) systems, serving as a top-level triage for subse-
quent processing. Intent determination and slot filling are then
run for each domain to fill a domain specific semantic template.
An example semantic frame for a movie-related utterance, ”find
recent comedies by James Cameron”, is shown in Figure 1.
This modular design approach (i.e., modeling SLU as 3
tasks) has the advantage of flexibility; specific modifications
(e.g., insertions, deletions) to a domain can be implemented
without requiring changes to other domains. Another advantage
is that, in this approach, one can use task/domain specific fea-
tures, which often significantly improve the accuracy of these
Wfind recent comedies by james cameron
↓ ↓
SO B-date B-genre O B-dir I-dir
Dmovies
Ifind movie
Figure 1: An example utterance with annotations of semantic
slots in IOB format (S), domain (D), and intent (I), B-dir and
I-dir denote the director name.
task/domain specific models. Also, this approach often yields
more focused understanding in each domain since the intent de-
termination only needs to consider a relatively small set of in-
tent and slot classes over a single (or limited set) of domains,
and model parameters could be optimized for the specific set of
intent and slots. However, this approach also has disadvantages:
First of all, one needs to train these models for each domain.
This is an error-prone process, requiring careful engineering to
insure consistency in processing across domains. Also, during
run-time, such pipelining of tasks results in transfer of errors
from one task to the following tasks. Furthermore, there is no
data or feature sharing between the individual domain models,
resulting in data fragmentation, whereas, some semantic intents
(such as, finding or buying a domain specific entity) and slots
(such as, dates, times, and locations) could actually be common
to many domains [2, 3]. Finally, the users may not know which
domains are covered by the system and to what extent, so this
issue results in interactions where the users do not know what
to expect and hence resulting in user dissatisfaction [4, 5].
We propose a single recurrent neural network (RNN) archi-
tecture that integrates the three tasks of domain detection, intent
detection and slot filling for multiple domains in a single SLU
model. This model is trained using all available utterances from
all domains, paired with their semantic frames. The input of
this RNN is the input sequence of words (e.g., user queries) and
the output is the full semantic frame, including domain, intent,
and slots, as shown in Figure 1. Since the dependency between
the words is important for SLU tasks, we investigate alternative
architectures for integrating lexical context and dependencies.
We compare the single model approach to alternative ways of
building models for multi-task, multi-domain scenarios.
The next section sets the baseline RNN-LSTM architecture
based on the slot filling task [6], and explores various architec-
tures for exploiting lexical contexts. In Section 3, we extend this
architecture to model domains and intents of user utterances in
addition to slot filling, and propose a multi-domain multi-task
architecture for SLU. In the experiments, we first investigate the
performance of alternative architectures on the benchmark ATIS
data set [7], and then on the Microsoft Cortana muilti-domain
data. We show that the single multi-domain, joint model ap-
proach is not only simpler, but also results in the best F-measure
in experimental results.
2. Deep Learning for SLU
A major task in spoken language understanding in goal-oriented
human-machine conversational understanding systems is to au-
tomatically classify the domain of a user query along with do-
main specific intents and fill in a set of arguments or ”slots” to
form a semantic frame. In this study, we follow the popular IOB
(in-out-begin) format for representing the slot tags as shown in
Figure 1.
A detailed survey of pre-deep learning era approaches for
domain detection, intent determination, and slot filling can
be found in [1]. Basically, domain detection and intent de-
termination tasks are framed as classification problems, for
which researchers have employed support vector machines [8],
maximum entropy classifiers [9], or boosting based classi-
fiers [10, 11]. Similarly, slot filling is framed as a sequence
classification problem and hidden Markov models [12] and con-
ditional random fields [13, 14] have been employed.
With the advances on deep learning, deep belief networks
(DBNs) with deep neural networks (DNNs) have first been em-
ployed for intent determination in call centers [15], and later for
domain classification in personal assistants [16, 17, 18]. More
recently, an RNN architecture with LSTM cells have been em-
ployed for intent classification [19].
For slot filling, deep learning research has started as exten-
sions of DNNs and DBNs (e.g., [20]) and is sometimes merged
with CRFs [21]. One notable extension is the use of recursive
neural networks, framing the problem as semantic parsing [22].
To the best of our knowledge RNNs have first been employed
for slot filling by Yao et al. [23] and Mesnil et al. [24] con-
currently. We have compiled a comprehensive review of RNN
based slot filling approaches in [6].
Especially with the re-discovery of LSTM cells [25] for
RNNs, this architecture has started to emerge [26]. LSTM cells
are shown to have superior properties, such as faster conver-
gence and elimination of the problem of vanishing or exploding
gradients in sequence via self-regularization, as presented be-
low. As a result, LSTM is more robust than RNN in capturing
long-span dependencies.
2.1. RNN with LSTM cells for slot filling
To estimate the sequence of tags Y=y1, ..., yncorrespond-
ing to an input sequence of tokens X=x1, ..., xn, we use the
Elman RNN architecture [27], consisting of an input layer, a
hidden layer (for the single layer version), and an output layer.
The input, hidden and output layers consist of a set of neurons
representing the input, hidden, and output at each time step t,
xt, ht, and yt, respectively. The input is typically represented
by 1-hot vector or word level embeddings. Given the input layer
xtat time t, and hidden state from the previous time step ht1,
the hidden and output layers for the current time step are com-
puted as follows:
ht=φ(Wxh
[
ht1
xt
]
)(1)
pt=softmax(Whyht)(2)
ˆyt= argmax pt(3)
where Wxh and Why are the matrices that denote the weights
between the input and hidden layers and hidden and output
layers, respectively. φdenotes the activation function, i.e.,
xt
ht-1
ftitgtot
ct-1
ht
ct
Figure 2: LSTM cell, as depicted in [31].
tanh or sigm. The softmax is defined as: softmax(zm) =
ezm/Piezi. The weights of the model are trained using back-
propagation to maximize the conditional likelihood of the train-
ing set labels:
Y
t
p(yt|x1, ..., xt).(4)
Previous work [28] has shown that training model param-
eters with backpropagation over time could result in exploding
or vanishing gradients. Exploding gradients could be allevi-
ated by gradient clipping [29], but this does not help vanishing
gradients. LSTM cells [25] were designed to mitigate the van-
ishing gradient problem. In addition to the hidden layer vector
ht, LSTMs maintain a memory vector, ct, which it can choose
to read from, write to or reset using a gating mechanism and
sigmoid functions. The input gate, itis used to scale down the
input; the forget gate, ftis used to scale down the memory vec-
tor ct; the output gate, otis used to scale down the output to
reach the final ht. Following the precise formulation of [30],
these gates in LSTMs are computed as follows, as also shown
in Figure 2:
[
it
ft
ot
gt
]
=
(
sigm
sigm
sigm
tanh
)
Wt
[
xt
ht1
]
,(5)
where the sigm sand tanh are applied element-wise, Wtis
the weight matrix, and
ct=ftct1+itgt,(6)
ht=otanh(ct).(7)
2.2. Integration of context
In SLU, word tags are not only determined by the associated
terms, but also contexts [32]. For example, in ATIS data, the
city name Boston could be tagged as originating or destination
city, according to the lexical context it appears in. For capturing
such dependencies, we investigated two extensions to the RNN-
LSTM architecture (Figure 3.(a)): look-around LSTM (LSTM-
LA) and bi-directional LSTM (bLSTM) [33].
At each time step, in addition to xt, LSTM-LA (Fig-
ure 3.(b)) considers the following and preceding words as part
of the input, by concatenating the input vectors for the neigh-
boring words. In this work, our input at time tconsisted of a
single vector formed by concatenating xt1, xt, xt+1.
In bLSTM (Figure 3.(c)), two LSTM architectures are tra-
versed in a left-to-right and right-to-left manner, and their hid-
𝑤0𝑤1𝑤2𝑤𝑛
0
𝑓1
𝑓2
𝑓𝑛
𝑓
0
𝑏1
𝑏2
𝑏𝑛
𝑏
𝑦0𝑦1𝑦2𝑦𝑛
(a) LSTM (b) LSTM-LA
(c) bLSTM-LA (b) Intent LSTM
intent
𝑤0𝑤1𝑤2𝑤𝑛
012𝑛
𝑦0𝑦1𝑦2𝑦𝑛
𝑤0𝑤1𝑤2𝑤𝑛
012𝑛
𝑦0𝑦1𝑦2𝑦𝑛
𝑤0𝑤1𝑤2𝑤𝑛
012𝑛
Figure 3: RNN-LSTM architectures used in this work.
den layers are concatenated when computing the output se-
quence (we use the superscripts band ffor denoting parameters
for the backward and forward directions):
pt=softmax(Wf
hyhf
t+Wb
hyhb
t),(8)
where forward and backward gates are computed respectively
as follows:
[
if
t
ff
t
of
t
gf
t
]
=
(
sigm
sigm
sigm
tanh
)
Wf
t
[
xt
hf
t1
]
,(9)
[
ib
t
fb
t
ob
t
gb
t
]
=
(
sigm
sigm
sigm
tanh
)
Wb
t
[
xt
hb
t+1
]
.(10)
In order to make the implementation more efficient, many
of the shared computations are done once such as input vector
preparation or top level gradient computation, pttrutht, where
truthtis the 1-hot vector for the target tag.
Figure 3 depicts these three architectures, as well as the in-
tent LSTM architecture of [19] that we used for modeling of
intents and domains in isolation as the baseline.
3. Joint, Multi-Domain Modeling of
Domain, Intent and Slots
A commonly used approach to represent slot tags for slot fill-
ing is associating each input word wtof utterance kwith an
IOB-style tag as exemplified in Figure 1, hence the input se-
quence Xis w1, ..., wnand the output is the sequence of slot
tags s1, ..., sn. We follow this approach and associate a slot tag
with each word.
For joint modeling of domain, intent, and slots, we as-
sume an additional token at the end of each input utterance k,
<EOS>, and associate a combination of domain and intent tags
dkand ikto this sentence final token by concatenating these
tags. Hence, the new input and output sequence are :
X=w1, ..., wn, <EOS>
Y=s1, ..., sn, dkik
The main rationale of this idea is similar to the sequence-
to-sequence modeling approach, as used in machine transla-
tion [34] or chit-chat [35] systems approaches. The last hidden
layer of the query is supposed to contain a latent semantic repre-
sentation of the whole input utterance, so that it can be utilized
for domain and intent prediction (dkik).
4. Experiments
For training all architectures, we used mini-batch stochastic gra-
dient descent with a batch size of 10 examples and adagrad [36].
We experimented with different hidden layer sizes in {50, 75,
100, 125, 150}and a fixed learning rate in {0.01, 0.05, 0.1}
in all of the experiments. We used only lexical features (i.e.,
no dictionaries), and represented input with 1-hot word vectors,
including all the vocabulary terms. In addition to 1-hot word
vectors, we experimented with word2vec [37] and Senna [38]
embeddings, and did not observe significant performance im-
provement, hence only results with 1-hot vectors are reported.
All parameters were uniformly initialized in [0.01,0.01].
4.1. Data sets
For investigating the integration of contexts for slot filling, we
have experimented with the benchmark ATIS data set [7] for
the air travel domain. For experiments related to joint domain,
intent, and slot modeling, four domains are chosen: alarm, cal-
endar, communication and technical, to create a diverse set in
terms of vocabulary size, number of intents and slots. The
number of training, development and test utterances, vocabu-
lary size, number of intents and slots for each of these data sets
are listed in Table 4. As seen in the last row of this table, the
number of intents and slots in the joined data set is less than
the sum of the number of intents and slots in individual do-
mains, this is because some of these are shared across different
domains.
4.2. Slot Filling Experiments
ATIS data set comes with a commonly used training and test
split [7]. For tuning parameters, we further split the training set
into 90% training and 10% development set. After choosing the
parameters that maximize the F-measure on the development
set, we retrained the model with all of the training data with the
optimum parameter set with 10 different initializations and av-
eraged F-measures. The maximum F-measure (best F) is com-
puted on the test set when 90% of the training examples were
used and the average F-measure (avg. F) is computed by averag-
ing F-measure from the 10 runs when all the training examples
are used with the optimum parameters. These results are shown
in Table 2. We get the best F-measure with the bi-directional
LSTM architecture (though comparable with LSTM-LA), the
relative performances of RNN, LSTM, and LSTM-LA are in
parallel with our earlier work [39], though F-measure is slightly
lower due to differences in normalization.
4.3. Multi-Domain, Joint Model Experiments
Following the slot filling experiments, we used bi-directional
LSTM for modeling slots alone and jointly modeling intent and
slots, and following [19], we use LSTM for modeling intents.
Table 1: Data sets used in the experiments. For each domain, the number of examples in the training, dev, and test sets, input vocabulary
size of the training set, and number of unique intents and slots.
Data Set # Train # Dev # Test |V|# Intents # Slots
ATIS 4,978 - 893 900 17 79
Alarm 8,096 1,057 846 433 16 8
Calendar 21,695 3,626 2,555 1,832 20 18
Communication 13,779 2,662 1,529 4,336 25 20
Techincal 7,687 993 867 2,180 5 18
4 domains 51,257 8,338 5,797 6,680 59 42
Table 2: F-measure results using ATIS data. The first column
shows the best F-measure on the test set, when the model was
trained with 90% of the training examples, the second column
shows F-measure averaged over 10 random initializations with
parameters optimized in the development set (10%).
Model best F avg. F
RNN 93.06% 92.09%
LSTM 93.80% 93.09%
LSTM-LA 95.12% 94.68%
bLSTM 95.48% 94.70%
We experimented with 4 settings, and report slot F-measure
(SLOT F, Table 3), intent accuracy (INTENT A, Table 3 and
overall frame error rate (OVERALL E, Table 4) for each of
these:
SD-Sep: For each domain, a separate intent detection
and slot filling model was trained, resulting in 2× |D|
classifiers, where |D|is the number of domains. Opti-
mum parameters were found on the development set for
each experiment and used for computing performance on
the test set. The output of all the classifiers were joined
for overall error rates.
SD-Joint: For each domain, a single model that esti-
mates both intent and sequence of slots was used, result-
ing in |D|classifiers.
MD-Sep: An intent detection model and a slot filling
model were trained using data from all the domains, re-
sulting in 2 classifiers. The output of intent detection
was merged with the output of slot filling for computing
overall template error rates.
MD-Joint: A single classifier for estimating the full se-
mantic frame that includes domain, intent, and slots for
each utterance was trained using all the data.
The first two settings assume that the correct domain for
each example in the test set is provided. To estimate such higher
level domain estimation, we trained an LSTM model for domain
detection using all the data, the accyracy of the domain detec-
tion is 95.5% on the test set. Table 3 shows results for intent
detection and slot filling when the true domain is known for the
first two settings, hence the performances of these two settings
seem higher, however, Table 4 shows overall frame error rates
when the domain estimation is integrated in the decision of the
final frame. In both single-domain and multi-domain settings,
intent detection accuracy improves with joint training (although
small), but slot filling degrades. On the overall, we achieve the
lowest error with the single model approach. The 13.4% seman-
tic frame error rate on all the data is significantly better than the
commonly used SD-Sep.
Table 3: Slot F-measure and intent accuracy results in single
domain (SD) and multi domain (MD) joint and separate model-
ing experiments.
SLOT F SD-Sep SD-Joint MD-Sep MD-Joint
Alarm 95.9% 93.9% 94.5% 94.3%
Cal. 94.5% 93.7% 92.6% 92.4%
Comm. 86.4% 83.8% 85.1% 82.7%
Tech. 90.4% 89.8% 89.6% 88.3%
All 91.8% 90.5% 90.0% 89.4%
INTENT A SD-Sep SD-Joint MD-Sep MD-Joint
Alarm 96.5% 96.2% 94.9% 94.3%
Cal. 97.2% 97.6% 94.2% 94.3%
Comm. 96.1% 95.8% 94.0% 95.4%
Tech. 94.6% 95.9% 93.9% 95.3%
All 96.4% 96.7% 94.1% 94.6%
Table 4: Overall frame level error rates.
Overall E SD-Sep SD-Joint MD-Sep MD-Joint
Alarm 9.5% 9.8% 9.1% 9.2%
Cal. 10.7% 11.1% 11.3% 10.1%
Comm. 19.8% 20.6% 16.3% 17.3%
Tech. 20.4% 20.6% 21.4% 20.2%
All 14.4% 14.9% 13.7% 13.4%
5. Conclusions
We propose a multi-domain, multi-task (i.e. domain and intent
detection and slot filling) sequence tagging approach to esti-
mate complete semantic frames for user utterances addressed to
a conversational system. First, we investigate alternative archi-
tectures for modeling lexical context for spoken language un-
derstanding. Then we present our approach that jointly mod-
els slot filling, intent determination, and domain classification
in a single bi-directional RNN with LSTM cells. User queries
from multiple domains are combined in a single model enabling
multi-task deep learning. We empirically show improvements
with the proposed approach in experimental results, over al-
ternatives. In addition to the simplicity of the single model
framework for SLU, as our future research, such an architec-
ture opens way to handling belief state update, other non-lexical
contexts, such as user contacts or dialogue history in one holis-
tic model [32]. Furthermore, an RNN-LSTM based language
generation system [40] can be jointly trained enabling the end-
to-end conversational understanding framework.
6. Acknowledgments
Authors would like to thank Nikhil Ramesh, Bin Cao, Derek
Liu and reviewers for useful discussions and feedback.
7. References
[1] G. Tur and R. D. Mori, Eds., Spoken Language Understanding:
Systems for Extracting Semantic Information from Speech. New
York, NY: John Wiley and Sons, 2011.
[2] Y.-B. Kim, K. Stratos, R. Sarikaya, and M. Jeong, “New trans-
fer learning techniques for disparate label sets,” in Proceedings of
ACL-IJCNLP. ACL, 2015.
[3] Y.-N. Chen, D. Hakkani-Tur, and X. He, “Zero-shot learning of
intent embeddings for expansion by convolutional deep structured
semantic models,” in Proceedings of ICASSP. IEEE, 2016.
[4] Y.-N. Chen, W. Y. Wang, and A. I. Rudnicky, “Unsupervised in-
duction and filling of semantic slots for spoken dialogue systems
using frame-semantic parsing,” in Proceedings of ASRU. IEEE,
2013, pp. 120–125.
[5] Y.-N. Chen, W. Y. Wang, A. Gershman, and A. I. Rudnicky, “Ma-
trix factorization with knowledge graph propagation for unsuper-
vised spoken language understanding,” in Proceedings of ACL-
IJCNLP. ACL, 2015, pp. 483–494.
[6] G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D. Hakkani-
Tur, X. He, L. Heck, G. Tur, D. Yu et al., “Using recurrent neu-
ral networks for slot filling in spoken language understanding,”
IEEE/ACM Transactions on Audio, Speech, and Language Pro-
cessing, vol. 23, no. 3, pp. 530–539, 2015.
[7] P. J. Price, “Evaluation of spoken language systems: The ATIS
domain,” in Proceedings of the DARPA Workshop on Speech and
Natural Language, Hidden Valley, PA, June 1990.
[8] P. Haffner, G. Tur, and J. Wright, “Optimizing SVMs for complex
call classification,” in Proceedings of the ICASSP, Hong Kong,
April 2003.
[9] C. Chelba, M. Mahajan, and A. Acero, “Speech utterance classi-
fication,” in Proceedings of the ICASSP, Hong Kong, May 2003.
[10] R. E. Schapire and Y. Singer, “Boostexter: A boosting-based sys-
tem for text categorization,Machine Learning, vol. 39, no. 2/3,
pp. 135–168, 2000.
[11] B. Favre, D. Hakkani-T ¨
ur, and S. Cuendet, “Icsiboost,”
http://code.google.come/p/icsiboost, 2007.
[12] R. Pieraccini, E. Tzoukermann, Z. Gorelov, J.-L. Gauvain,
E. Levin, C.-H. Lee, and J. G. Wilpon, “A speech understanding
system based on statistical representation of semantics,” in Pro-
ceedings of the ICASSP, San Francisco, CA, March 1992.
[13] Y.-Y. Wang, L. Deng, and A. Acero, “Spoken language under-
standing - an introduction to the statistical framework,IEEE Sig-
nal Processing Magazine, vol. 22, no. 5, pp. 16–31, September
2005.
[14] C. Raymond and G. Riccardi, “Generative and discriminative al-
gorithms for spoken language understanding,” in Proceedings of
the Interspeech, Antwerp, Belgium, 2007.
[15] R. Sarikaya, G. E. Hinton, and B. Ramabhadran, “Deep belief nets
for natural language call-routing,” in Proceedings of the ICASSP,
Prague, Czech Republic, 2011.
[16] G. Tur, L. Deng, D. Hakkani-T ¨
ur, and X. He, “Towards deeper
understanding deep convex networks for semantic utterance clas-
sification,” in In Proceedings of the ICASSP, Kyoto, Japan, March
2012.
[17] L. Deng, G. Tur, X. He, and D. Hakkani-T ¨
ur, “Use of kernel deep
convex networks and end-to-end learning for spoken language un-
derstanding,” in In Prooceedings of the IEEE SLT Workshop, Mi-
ami, FL, December 2012.
[18] R. Sarikaya, G. E. Hinton, and A. Deoras, “Application of deep
belief networks for natural language understanding,” IEEE Trans-
actions on Audio, Speech, and Language Processing, vol. 22,
no. 4, April 2014.
[19] S. Ravuri and A. Stolcke, “Recurrent neural network and LSTM
models for lexical utterance classification,” in Interspeech, 2015.
[20] A. Deoras and R. Sarikaya, “Deep belief network based semantic
taggers for spoken language understanding,” in In Prooceedings
of the Interspeech, Lyon, France, August 2013.
[21] P. Xu and R. Sarikaya, “Convolutional neural network based tri-
angular CRF for joint intent detection and slot filling,” in Proceed-
ings of the IEEE ASRU, 2013.
[22] D. Guo, G. Tur, W.-T. Yih, and G. Zweig, “Joint semantic utter-
ance classification and slot filling with recursive neural networks,
in Proceedings of the IEEE SLT Workshop, 2014.
[23] K. Yao, G. Zweig, M.-Y. Hwang, Y. Shi, and D. Yu, “Recurrent
neural networks for language understanding,” in In Prooceedings
of the Interspeech, Lyon, France, August 2013.
[24] G. Mesnil, X. He, L. Deng, and Y. Bengio, “Investigation of
recurrent-neural-network architectures and learning methods for
spoken language understanding,” in Proceedings of Interspeech,
2013.
[25] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[26] K. Yao, B. Peng, Y. Zhang, D. Yu, G. Zweig, and Y. Shi, “Spo-
ken language understanding using long short-term memory neural
networks,” in Proceedings of the IEEE SLT Workshop, 2014.
[27] J. Elman, “Finding structure in time,Cognitive Science, vol. 14,
no. 2, 1990.
[28] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term de-
pendencies with gradient descent is difficult,” IEEE Transactions
on Neural Networks, vol. 5, no. 2, pp. 157–166, 1994.
[29] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of train-
ing recurrent neural networks,” arXiv preprint arXiv:1211.5063,
2012.
[30] A. Karpathy, J. Johnson, and L. Fei-Fei, “Visualizing and under-
standing recurrent networks,” arXiv preprint arXiv:1506.02078,
November 2015.
[31] R. Jozefowicz, W. Zaremba, and I. Sutskever, “An empirical ex-
ploration of recurrent network architectures,” in Proceedings of
the 32nd International Conference on Machine Learning (ICML-
15), 2015, pp. 2342–2350.
[32] Y.-N. Chen, D. Hakkani-T¨
ur, G. Tur, J. Gao, and D. Li, “End-to-
end memory networks with knowledge carryover for multi-turn
spoken language understanding,” in Proceedings of Interspeech,
2016.
[33] A. Graves and J. Schmidhuber, “Framewise phoneme classifica-
tion with bidirectional LSTM and other neural network architec-
tures,” Neural Networks, vol. 18, no. 5, pp. 602–610, 2005.
[34] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence
learning with neural networks,” in Advances in Neural Infor-
mation Processing Systems 27, Z. Ghahramani, M. Welling,
C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds., 2014,
pp. 3104–3112.
[35] O. Vinyals and Q. V. Le, “A neural conversational model,” in
ICML Deep Learning Workshop, 2015.
[36] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient meth-
ods for online learning and stochastic optimization,” Journal of
Machine Learning Research, no. 12, pp. 2121–2159, 2011.
[37] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient esti-
mation of word representations in vector space,” in Workshop at
ICLR, 2013.
[38] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu,
and P. Kuksa, “Natural language processing (almost) from
scratch,” Journal of Machine Learning Research, no. 12, pp.
2483–2537, 2011.
[39] G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D. Hakkani-
Tur, X. He, L. Heck, G. Tur, D. Yu, and G. Zweig, “Using re-
current neural networks for slot filling in spoken language under-
standing,” IEEE Transactions on Audio, Speech, and Language
Processing, vol. 23, no. 3, March 2015.
[40] T.-H. Wen, M. Gasic, N. Mrksic, P.-H. Su, D. Vandyke, and
S. Young, “Semantically conditioned LSTM-based natural lan-
guage generation for spoken dialogue systems,” arXiv preprint
arXiv:1508.01745, 2015.
... Spoken language understanding is an important component of task-oriented dialog systems powering today's voice controlled AI agents, and chat bots. Intent classification and slot tagging are two main sub-tasks in SLU [1,2,3,4,5]. Human language is inherently compositional [6], and humans possess the ability to understand infinite new utterances by focusing on relevant informative sub-parts of the utterance which were learned previously [7]. ...
... The first benchmark dataset ATIS [16] contains utterances related to airline reservation. We consider the data split from [17,1] containing 4,978 training , and 893 test utterances in the standard split (Ttrain, Ttest). We also use the second benchmark dataset SNIPS [18] containing various utterances in entertainment, weather, and restaurant domains. ...
... For benchmarking, we use CoNLL03 dataset for resource-rich setting and MIT restaurant (Liu et al., 2013), MIT movie (Liu et al., 2013) and ATIS (Hakkani-Tur et al., 2016) datasets are used for low-resource and domain transfer settings. See Table 4 in Appendix for more details. ...
... One approach is to interpret slot, intent and domain as a hierarchy in the semantic frame thus produced. [19] used a private multi-domain data set and addressed the task by combining domain and intent labels into one tag. In [20] domain is classified at the level of the utterance using a simple attention mechanism over RNN hidden states. ...
... Fig. 1 shows an example of slot filling, given the utterance "what is the weather in koontz lake" in the "GetWeather" domain, models need to find the slot entity "koontz lake" corresponding to the slot type "city". Conventional methods take slot filling as a sequence labeling problem (i.e., a token-level classification task), and such fully supervised learning methods require a large amount of annotated data [3,4,5,6,7]. However, the construction of such data in real life is labor-intensive and time-consuming. ...
... Transformer encoder-based approaches are especially adept at this task, performing the classification step using sentence embeddings. Intent detection data is limited in task-oriented datasets, and most approaches [4,14,16] focus on single-utterance queries for voice assistants [9,6], forgoing multi-turn interactions. ...
Preprint
Full-text available
Dialogue systems need to deal with the unpredictability of user intents to track dialogue state and the heterogeneity of slots to understand user preferences. In this paper we investigate the hypothesis that solving these challenges as one unified model will allow the transfer of parameter support data across the different tasks. The proposed principled model is based on a Transformer encoder, trained on multiple tasks, and leveraged by a rich input that conditions the model on the target inferences. Conditioning the Transformer encoder on multiple target inferences over the same corpus, i.e., intent and multiple slot types, allows learning richer language interactions than a single-task model would be able to. In fact, experimental results demonstrate that conditioning the model on an increasing number of dialogue inference tasks leads to improved results: on the MultiWOZ dataset, the joint intent and slot detection can be improved by 3.2\% by conditioning on intent, 10.8\% by conditioning on slot and 14.4\% by conditioning on both intent and slots. Moreover, on real conversations with Farfetch costumers, the proposed conditioned BERT can achieve high joint-goal and intent detection performance throughout a dialogue.
Chapter
Spoken language understanding (SLU) primarily entailing slot filling and intent detection has been studied for many years with achieving significant results. However, in Chinese SLU tasks, Some models fail to take word-level information into account, and there is insufficient interaction between slot information and intent information. To address the aforementioned issues, we propose a novel bi-directional interaction graph framework with filter gate mechanism (BIG-FG) for Chinese spoken language understanding, which can make a fine-grained interaction directly with slot information and intent information, while also effectively fusing character-word semantic information. The model consists of two core modules: (1) bi-directional interaction graph (BIG), which is based on a multi-layer graph attention network with the bi-directional connections between intent information, slot information, and adjacent slot information, fully considering the correlation between slot filling and intent detection; (2) filter gate (FG), which enhances fusion performance by solving the problem of semantic ambiguity brought by direct fusion of character-word semantic information. Experiments on two datasets demonstrate that our model outperforms the best benchmark model by 0.39% and 2.65% in the Overall(Acc) evaluation metric, respectively, and accomplishes the state-of-the-arts performance.
Chapter
Few-shot Named Entity Recognition (NER) is the task of identifying new named entities using only a small number of labeled examples. Prompt-based learning has been successful in few-shot NER by using prompts to guide the labeling process and increase efficiency. However, previous prompt-based methods for few-shot NER have limitations such as high computational complexity and insufficient few-shot capability. To address these concerns, we propose a multi-task instruction framework called CotNER for Few-shot NER, which utilizes a chain-of-thought prompting generative approach. We introduce two auxiliary tasks, entity extraction and entity recognition, and integrate reasoning processes through chain-of-thought prompting. Our approach outperforms previous methods on various benchmarks, as demonstrated by extensive experiments.
Article
Full-text available
Recent advancements in Natural Language Processing (NLP) have drastically changed how people and computers communicate. This beginner-oriented research article explores NLP-based chat platforms, investigating how they work, what they can do, and the challenges they face. The study looks at modern techniques for creating chatbots, including rule-based systems, generative models, and retrieval methods. It dives into the technologies driving these platforms, like pre-trained language models and context management. Ethical concerns like bias and privacy are also covered. The article examines how these chat platforms affect fields like customer service, healthcare, education, and entertainment. By summarizing existing research, the article highlights current trends and future possibilities in NLP-based chat systems, offering insights to newcomers and interested parties.
Article
The study offers a developed algorithm, aimed to observe cognitive-communicative syntactic specifiers based on recursive comparison in the process of computer-aided translation of statements in agglutinating languages. The key method that enables fast and adequate search for translation equivalent in the languages with different structures is the multilevel recursive comparison of interim forms with dictionary equivalents, as well with contextual matches within the frameworks of prospective modelling in the course of an invariant thesaurus development according to dynamic complicated codes. The combination of cognitive-semantic and semantemic-morphological parallel comparison with spiral references will make it possible to create not only an equivalent version of the target text that meets the requirements of lexico-morphological correctness, but also to ensure the transfer of cognitive and consituational elements of the original utterance. The inclusion of cognitive-communicative syntactic samples in the recursive comparison algorithm during automatic processing of the utterance, as the final stage of generating the target utterance, is designed to solve the problem of polysemic, synonymous and homonymous barriers that arise at the stage of generating the target text. The described automated analysis algorithm is demonstrated on the utterances in languages with different structures (Turkish, Russian and English). The article also provides examples of the interaction of the invariant thesaurus of commonly used constructions with variant correspondences while connecting recursive comparison with elements of cognitive-communicative syntax.
Conference Paper
Full-text available
Spoken language understanding (SLU) is a core component of a spoken dialogue system. In the traditional architecture of dialogue systems, the SLU component treats each utterance independent of each other, and then the following components aggregate the multi-turn information in the separate phases. However , there are two challenges: 1) errors from previous turns may be propagated and then degrade the performance of the current turn; 2) knowledge mentioned in the long history may not be carried into the current turn. This paper addresses the above issues by proposing an architecture using end-to-end memory networks to model knowledge carryover in multi-turn conversations , where utterances encoded with intents and slots can be stored as embeddings in the memory and the decoding phase applies an attention model to leverage previously stored semantics for intent prediction and slot tagging simultaneously. The experiments on Microsoft Cortana conversational data show that the proposed memory network architecture can effectively extract salient semantics for modeling knowledge carryover in the multi-turn conversations and outperform the results using the state-of-the-art recurrent neural network framework (RNN) designed for single-turn SLU.
Article
Full-text available
Natural language generation (NLG) is a critical component of spoken dialogue and it has a significant impact both on usability and perceived quality. Most NLG systems in common use employ rules and heuristics and tend to generate rigid and stylised responses without the natural variation of human language. They are also not easily scaled to systems covering multiple domains and languages. This paper presents a statistical language generator based on a semantically controlled Long Short-term Memory (LSTM) structure. The LSTM generator can learn from unaligned data by jointly optimising sentence planning and surface realisation using a simple cross entropy training criterion, and language variation can be easily achieved by sampling from output candidates. An objective evaluation in two differing test domains showed improved performance compared to previous methods with fewer heuristics. Human judges scored the LSTM system higher on informativeness and naturalness and overall preferred it to the other systems.
Conference Paper
Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.7 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a strong phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which beats the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Conference Paper
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Conference Paper
The recent surge of intelligent personal assistants motivates spoken language understanding of dialogue systems. However, the domain constraint along with the inflexible intent schema remains a big issue. This paper focuses on the task of intent expansion, which helps remove the domain limit and make an intent schema flexible. A con-volutional deep structured semantic model (CDSSM) is applied to jointly learn the representations for human intents and associated utterances. Then it can flexibly generate new intent embeddings without the need of training samples and model-retraining, which bridges the semantic relation between seen and unseen intents and further performs more robust results. Experiments show that CDSSM is capable of performing zero-shot learning effectively, e.g. generating embeddings of previously unseen intents, and therefore expand to new intents without retraining , and outperforms other semantic embeddings. The discussion and analysis of experiments provide a future direction for reducing human effort about annotating data and removing the domain constraint in spoken dialogue systems. Index Terms— zero-shot learning, spoken language understanding (SLU), spoken dialogue system (SDS), convolutional deep structured semantic model (CDSSM), embeddings, expansion.
Article
One of the key problems in spoken language understanding (SLU) is the task of slot filling. In light of the recent success of applying deep neural network technologies in domain detection and intent identification, we carried out an in-depth investigation on the use of recurrent neural networks for the more difficult task of slot filling involving sequence discrimination. In this work, we implemented and compared several important recurrent-neural-network architectures, including the Elman-type and Jordan-type recurrent networks and their variants. To make the results easy to reproduce and compare, we implemented these networks on the common Theano neural network toolkit, and evaluated them on the ATIS benchmark. We also compared our results to a conditional random fields (CRF) baseline. Our results show that on this task, both types of recurrent networks outperform the CRF baseline substantially, and a bi-directional Jordantype network that takes into account both past and future dependencies among slots works best, outperforming a CRFbased baseline by 14% in relative error reduction.
Article
This paper investigates the use of deep belief networks (DBN) for semantic tagging, a sequence classification task, in spoken language understanding (SLU).We evaluate the performance of the DBN based sequence tagger on the well-studied ATIS task and compare our technique to conditional random fields (CRF), a state-of-the-art classifier for sequence classification. In con- junction with lexical and named entity features, we also use dependency parser based syntactic features and part of speech (POS) tags [1]. Under both noisy conditions (output of auto- matic speech recognition system) and clean conditions (manual transcriptions), our deep belief network based sequence tagger outperforms the best CRF based system described in [1] by an absolute 2% and 1% F-measure, respectively.Upon carrying out an analysis of cases where CRF and DBN models made differ- ent predictions, we observed that when discrete features are pro- jected onto a continuous space during neural network training, the model learns to cluster these features leading to its improved generalization capability, relative to a CRF model, especially in cases where some features are either missing or noisy.
Article
In natural language understanding (NLU), a user utterance can be labeled differently depending on the domain or application (e.g., weather vs. calendar). Standard domain adaptation techniques are not directly applicable to take advantage of the existing annotations because they assume that the label set is invariant. We propose a solution based on label embeddings induced from canonical correlation analysis (CCA) that reduces the problem to a standard domain adaptation task and allows use of a number of transfer learning techniques. We also introduce a new transfer learning technique based on pretraining of hidden-unit CRFs (HUCRFs). We perform extensive experiments on slot tagging on eight personal digital assistant domains and demonstrate that the proposed methods are superior to strong baselines.
Article
In recent years, continuous space models have proven to be highly effective at language processing tasks ranging from paraphrase detection to language modeling. These models are distinctive in their ability to achieve generalization through continuous space representations, and compositionality through arithmetic operations on those representations. Examples of such models include feed-forward and recurrent neural network language models. Recursive neural networks (RecNNs) extend this framework by providing an elegant mechanism for incorporating both discrete syntactic structure and continuous-space word and phrase representations into a powerful compositional model. In this paper, we show that RecNNs can be used to perform the core spoken language understanding (SLU) tasks in a spoken dialog system, more specifically domain and intent determination, concurrently with slot filling, in one jointly trained model. We find that a very simple RecNN model achieves competitive performance on the benchmark ATIS task, as well as on a Microsoft Cortana conversational understanding task.