Conference PaperPDF Available

Evaluating Natural Language Understanding Services for Conversational Question Answering Systems


Abstract and Figures

Conversational interfaces recently gained a lot of attention. One of the reasons for the current hype is the fact that chatbots (one particularly popular form of conversational interfaces) nowadays can be created without any programming knowledge, thanks to different toolkits and so-called Natural Language Understanding (NLU) services. While these NLU services are already widely used in both, industry and science, so far, they have not been analysed systematically. In this paper, we present a method to evaluate the classification performance of NLU services. Moreover, we present two new corpora, one consisting of annotated questions and one consisting of annotated questions with the corresponding answers. Based on these corpora, we conduct an evaluation of some of the most popular NLU services. Thereby we want to enable both, researchers and companies to make more educated decisions about which service they should use.
Content may be subject to copyright.
Proceedings of the SIGDIAL 2017 Conference, pages 174–185,
ucken, Germany, 15-17 August 2017. c
2017 Association for Computational Linguistics
Evaluating Natural Language Understanding Services
for Conversational Question Answering Systems
Daniel Braun Adrian Hernandez Mendez
Technical University of Munich
Department of Informatics
Florian Matthes Manfred Langen
Siemens AG
Corporate Technology
Conversational interfaces recently gained
a lot of attention. One of the reasons
for the current hype is the fact that chat-
bots (one particularly popular form of con-
versational interfaces) nowadays can be
created without any programming knowl-
edge, thanks to different toolkits and so-
called Natural Language Understanding
(NLU) services. While these NLU ser-
vices are already widely used in both, in-
dustry and science, so far, they have not
been analysed systematically. In this pa-
per, we present a method to evaluate the
classification performance of NLU ser-
vices. Moreover, we present two new cor-
pora, one consisting of annotated ques-
tions and one consisting of annotated
questions with the corresponding answers.
Based on these corpora, we conduct an
evaluation of some of the most popular
NLU services. Thereby we want to enable
both, researchers and companies to make
more educated decisions about which ser-
vice they should use.
1 Introduction
Long before the terms conversational interface
or chatbot were coined, Turing (1950) described
them as the ultimate test for artificial intelligence.
Despite their long history, there is a recent hype
about chatbots in both, the scientific community
(cf. e.g. Ferrara et al. (2016)) and industry (Gart-
ner,2016). While there are many related rea-
sons for this development, we think that three key
changes were particularly important:
Rise of universal chat platforms (like Tele-
gram, Facebook Messenger, Slack, etc.)
Advances in machine learning (ML)
Natural Language Understanding (NLU) as a
In this paper, we focus on the latter. As we
will show in Section 2, NLU services are already
used by a number of researchers for building con-
versational interfaces. However, due to the lack
of a systematic evaluation of theses services, the
decision why one services was prefered over an-
other, is usually not well justified. With this paper,
we want to bridge this gap and enable both, re-
searchers and companies, to make more educated
decisions about which service they should use. We
describe the functioning of NLU services and their
role within the general architecture of chatbots.
We explain, how NLU services can be evaluated
and conduct an evaluation, based on two different
corpora consisting of nearly 500 annotated ques-
tions, of the most popular services.
2 Related Work
Recent publications have discussed the usage of
NLU services in different domains and for differ-
ent purposes, e.g. question answering for localized
search (McTear et al.,2016), form-driven dialogue
systems (Stoyanchev et al.,2016), dialogue man-
agement (Schnelle-Walka et al.,2016), and the in-
ternet of things (Kar and Haldar,2016).
However, none of these publications explicitly
discuss, why they choose one particular NLU ser-
vice over another and how this decision may have
influenced the performance of their system and
hence their results. Moreover, to the best of our
knowledge, so far there exists no systematic evalu-
ation of a particular NLU service, let alone a com-
parison of multiple services.
Dale (2015) lists five NLP cloud services and
describes their capabilities, but without conduct-
ing an evaluation. In the domain of spoken dialog
systems, similar evaluations have been conducted
for automatic speech recognizer services, e.g. by
Twiefel et al. (2014) and Morbini et al. (2013).
Speaking about chatbots in general, Shawar and
Atwell (2007) present an approach to conduct end-
to-end evaluations, however, they do not take into
account the single elements of a system. Resnik
and Lin (2010) provide a good overview and eval-
uation of Natural Language Processing (NLP) sys-
tems in general. Many of the principals they
apply for their evaluation (e.g. inter-annotator
agreement and partitioning of data) play an impor-
tant role in our evaluation too. A comprehensive
and extensive survey of question answering tech-
nologies was presented by Kolomiyets and Moens
(2011). However, there has been a lot of progress
since 2011, including the here presented NLU ser-
One of our two corpora was labelled using
Amazon Mechanical Turk (AMT, cf. Section
5.2), while there have been long discussions about
whether or not AMT can replace the work of ex-
perts for labelling linguistic data, the recent con-
sensus is that, given enough annotators, crowd-
sourced labels from AMT are as reliable as ex-
pert data. (Snow et al.,2008;Munro et al.,2010;
3 Chatbot Architecture
In order to understand the role of NLU services
for chatbots, one first has to look at the general ar-
chitecture of chatbots. While there exist different
documented chatbot architectures for concrete use
cases, no universal model of how a chatbot should
be designed has emerged yet. Our proposal for a
universal chatbot architecture is shown in Figure
1. It consists of three main parts: Request Inter-
pretation, Response Retrieval and Message Gener-
ation. The Message Generation follows the classi-
cal Natural Language Generation (NLG) pipeline
described by Reiter and Dale (2000). In the con-
text of Request Interpretation, a “request” is not
necessarily a question, but can also be any user in-
put like “My name is John”. Equally, a “response”
to this input could e.g. be “What a nice name”.
4 NLU Services
The general goal of NLU services is the extraction
of structured, semantic information from unstruc-
tured natural language input, e.g. chat messages.
They mainly do this by attaching user-defined la-
bels to messages or parts of messages. At the time
of writing, among the most popular NLU services
Watson Conversation2
Amazon Lex5
Moreover, there is a popular open source alter-
native which is called RASA6. RASA offers the
same functionality, while lacking the advantages
of cloud-based solutions (managed hosting, scal-
ability, etc). On the other hand, it offers the typi-
cal advantages of self-hosted open source software
(adaptability, data control, etc).
Table 1shows a comparison of the basic func-
tionality offered by the different services. All of
them, except for Amazon Lex, share the same
basic concept: Based on example data, the user
can train a classifier to classify so-called intents
(which represent the intent of the whole message
and are not bound to a certain position within the
message) and entities (which can consist of a sin-
gle or multiple characters).
Service Intents Entities Batch import
LUIS + + +
Watson + + + + + + + + O
Lex + O -
RASA + + +
Table 1: Comparison basic functionality of NLU
Figure 2shows a labelled sentence in the LUIS
web interface. The intent of this sentence was
classified as FindConnection, with a confidence of
97%. The labelled entities are: (next, Criterion),
(train, Vehicle), (Universit¨
at, StationStart), (Max-
Weber-Platz, StationDest). Amazon Lex shares
Request Interpretation Response Retrieval Message Generation
Knowledge Base
Query Generation
Candidate Retrieval
Candidate Selection
Text Planning
Sentence Planning
Linguistic Realizer
Figure 1: General Architecture for Chatbots
the concept of intents with the other services, but
instead of entities, Lex is using so-called slots,
which are not trained by concrete examples, but
example patterns like “When is the {Criterion}
{Vehicle}to {StationDest}”. Moreover, all ser-
vices, except for Amazon Lex, also offer an export
and import functionality which uses a json-format
to export and import the training data. While
offers this functionality, as of today, it only works
reliably for creating backups and restoring them,
but not importing new data7.
Figure 2: Labelled sentence with intent and enti-
ties in Microsoft LUIS
When it comes to the core of the services,
the machine learning algorithms and the data on
which they are initially trained, all services are
very secretive. None of them gives specific infor-
mation about the used technologies and datasets.
7cf. e.g.
The exception in this case is, of course, RASA,
which can either use MITIE (Geyer et al.,2016)
or spaCy (Choi et al.,2015) as ML backend.
5 Data Corpus
Our evaluation is based on two very different
data corpora. The Chatbot Corpus (cf. Sec-
tion 5.1) is based on questions gathered by a
Telegram chatbot in production use, answering
questions about public transport connections.
The StackExchange Corpus (cf. Section 5.2)
is based on data from two StackExchange8
platforms: ask ubuntu9and Web Applications10.
Both corpora are available on GitHub under the
Creative Commons CC BY-SA 3.0 license11:
5.1 Chatbot Corpus
The Chatbot Corpus consists of 206 questions,
which were manually labelled by the authors.
There are two different intents (Departure Time,
Find Connection) in the corpus and five different
entity types (StationStart, StationDest, Criterion,
Vehicle, Line). The general language of the ques-
tions was English, however, mixed with German
street and station names. Example entries from
the corpus can be found in Appendix A.1. For
the evaluation, the corpus was split into a train-
ing dataset with 100 entries and a test dataset with
106 entries.
43% of the questions in the training dataset be-
long to the intent Departure Time and 57% to Find
Connection. The distribution for the test dataset
is 33% (Departure Time) and 67% (Find Connec-
tion). Table 2shows how the different entity types
are distributed among the two datasets. While
some entity types occur very often, like Station-
Start, some occur very rarely, especially Line. We
do this differentiation to evaluate, if some services
handle very common, or very rare, entity types
better than others.
While in this corpus, there are more tagged enti-
ties in the training dataset than in the test dataset, it
is the other way round in the other corpus, which
will be introduced in the next section. Although
one might expect that this leads to better results,
the evaluation in Section 7shows that this is not
necessarily the case.
Entity Type training test Σ
StationStart 91 102 193
StationDest 57 71 128
Criterion 48 34 82
Vehicle 50 35 85
Line 426
Σ250 244 494
Table 2: Entity types within the chatbot corpus
5.2 StackExchange Corpus
For the generation of the StackExchange corpus,
we used the StackExchange Data Explorer12. We
choose the most popular questions (i.e. questions
with the highest scores and most views), from the
two StackExchange platforms ask ubuntu and Web
Applications, because they are likely to have a bet-
ter linguistic quality and a higher relevance, com-
pared to less popular questions. Additionally, we
used only questions with an accepted, i.e. correct,
answer. Although we did not use the answers in
our evaluation, we included them in our corpus,
in order to create a corpus that is not only useful
for this particular evaluation, but also for research
on question answering in general. In this way, we
gathered 290 questions and answers in total, 100
from Web Applications and 190 from ask ubuntu.
The corpus was labelled with intents and enti-
ties using Amazon Mechanical Turk (AMT). Each
question was labelled by five different workers,
summing up to nearly 1,500 datapoints.
For each platform, we created a list of can-
didates for intents, which were extracted from
the labels (i.e. tags) assigned to the questions
by StackExchange users. For each question, the
AMT workers were asked to chose one of these
intents or “None”, if they think no candidate is fit-
For ask ubuntu, the possible intents were:
“Make Update”, “Setup Printer”, “Shutdown
Computer”, and “Software Recommendation”.
For Web Applications, the candidates were:
“Change Password”, “Delete Account”, “Down-
load Video”, “Export Data”, “Filter Spam”, “Find
Alternative”, and “Sync Accounts”.
Similarly, a set of entity type candidates were
given. By marking parts of the questions with
the mouse, workers could assign these entity
types to words (or characters) within the ques-
tion. For Web Applications the possible entity
types were: “WebService”, “OperatingSystem”
and “Browser”. For ask ubuntu, they were: “Soft-
wareName”, “Printer”, and “UbuntuVersion”.
Moreover, workers were asked to state how con-
fident they are in their assessment: very confident,
somewhat confident, undecided, somewhat uncon-
fident, or very unconfident.
For the generation of the annotated, final cor-
pus, only submissions with a confidence level of
“undecided” or higher were taken into account.
A label, no matter if intent or entity, was only
added to the corpus if the inter-annotator agree-
ment among those confident annotators was 60%
or higher. If no intent could be found for a ques-
tion, satisfying these criteria, this question was not
added to the corpus. The final corpus was also
checked for false positives by two experts, but non
were found. Therefore the final corpus consists of
251 entries, 162 from ask ubuntu and 89 from Web
Applications. Example entries from the corpus are
shown in Appendix A.2.
For the evaluation, we also split this corpus.
Four datasets were separated, one for training and
one for testing, for each platform. The distribution
of intents among these datasets is shown in Table
3, the distribution of entity types is shown in Ta-
ble 4. Again, we do this differentiation to compare
the classification results for frequently and rarely
occurring intents and entity types.
Intent training test Σ
ChangePassword 2 6 8
DeleteAccount 7 10 17
DownloadVideo 1 0 1
ExportData 2 3 5
FilterSpam 6 14 20
FindAlternative 7 16 23
SyncAccounts 3 6 9
None 2 4 6
Σ30 54 84
(a) Web Applications datasets
Intent training test Σ
MakeUpdate 10 37 47
SetupPrinter 10 13 23
ShutdownComputer 13 14 27
S.Recommendation 17 40 57
None 358
Σ53 109 162
(b) ask ubuntu datasets
Table 3: Intents within StackExchange corpus
dataset Entity Type training test Σ
web apps
WebService 33 64 97
OS 1 0 1
Browser 1 0 1
Σ35 64 99
Printer 8 12 20
Software 3 4 7
Version 24 78 102
Σ35 94 129
Table 4: Entity types within the StackExchange
6 Experimental Design
In order to compare the performance of the differ-
ent NLU services, we used the corpora described
in Section 5. We used the respective training
datasets to train the NLU services LUIS, Watson
Conversation,, and RASA. Amazon Lex
was not included in this comparison because, as
mentioned in Section 4, it does not offer a batch
import functionality, which is crucial in order to
effectively train all services with the exact same
data. For the same reason, was also ex-
cluded from the experiment. While it does offer an
import option, currently, it only works reliable for
data which was created through the webin-
terface and not altered, or even created, manually.
Afterwards, the test datasets were sent to the
NLU services and the labels created by the ser-
vices were compared against our human created
gold standard. For training, we used the batch im-
port interfaces, offered by all compared services,
in this way it was not only possible to train all dif-
ferent services relatively fast, despite many hun-
dred individual labels, it also guaranteed, that all
services are fed with exactly the same data. Since
the data format differs from service to service, we
used a Python script to automatically convert the
training datasets from the format shown in the Ap-
pendix to the respective data format of the ser-
vices. For retrieving the results for the test datasets
from the NLU services, their respective REST-
APIs were used.
In order to evaluate the results, we calculated
true positives, false positives, and false negatives,
based on exact matches. Based on this data, we
computed precision and recall as well as F-score
for single intents, entity types, and corpora, as well
as overall results. We will say one service is better
than another if it has a higher F-score.
6.1 Hypotheses
Before the conduction of the experiment, we had
three main hypotheses:
1. The performance varies between services:
Although it might sound obvious, it is worth
mentioning that one of the reasons for this
evaluation is the fact that we think, there is
a difference between the compared NLU ser-
vices. Despite their very similar concepts and
“look and feel”, we expect differences when
it comes to annotation quality (i.e. F-scores),
which should be taken into account when de-
ciding for one or another service.
2. The commercial products will (overall)
perform better:
The initial language model of RASA, which
comes with MITIE, is about 300 MB of data.
The commercial services, on the other hand,
are fed with data by hundreds, if not thou-
sands, of users every day. We, therefore, as-
sume, that the commercial products will per-
form better in the evaluation, especially when
the training data is sparse.
3. The quality of the labels is influenced by
the domain:
We assume that, depending on the used
algorithms and models, individual services
will perform differently in different domains.
Therefore, we think it is not unlikely that
a service which performs well on the more
technical corpus from StackExchange will
perform considerably worse on the chatbot
corpus, which has a focus on spatial and time
data, and vice versa.
6.2 Limitations
One important limitation of this evaluation is the
fact that the results will not be representative for
other domains. On the opposite, as already men-
tioned in Hypothesis 3, we do believe that there
are important differences in performance between
different domains. Therefore our final conclusion
can not be that one service is absolutely better than
the others, but rather that on the given corpus, one
service performed better than the others. However,
we believe that the here presented approach will
help developers to conduct evaluations of NLU
services for their domain and thus empower them
to make better-informed decisions.
With regard to the used corpora, we made an
effort to make them as naturally as possible by us-
ing only real data from real users. However, when
analysing the results, one should keep in mind that
the Chatbot Corpus consists of questions which
were asked by users, which were aware of com-
municating with a chatbot. It is, therefore, con-
ceivable that they formulated their questions in a
way which they expect to be more understandable
for a chatbot.
Finally, NLU services, like all other services,
can change over time (and hopefully improve).
While it is easy to track these changes for locally
installed software, changes on cloud-based ser-
vices may happen without any notice to the user.
Conducting the very same experiment, described
in this paper, in six months time, might, therefore,
lead to different results. This evaluation can there-
fore only be a snapshot of the current state of the
compared services. While this might decrease the
reproducibility of our experiment, it is also a good
argument for a formalized, repeatable evaluation
process, as we describe it in this paper.
7 Evaluation
The detailed results of the evaluation, broken
down on single intents, entity types, corpora, and
overall, are shown in Table 5to 8. Each table
shows the result from a different NLU service.
Within the tables, each row represents one partic-
ular entity type or intent.
For each row, the corpus, type (intent/entity),
and true positives, false negatives, and false pos-
itives are given. From these values, precision, re-
call, and F-score have been calculated. The en-
tity types and intents are also sorted by the corpus
they appear in. For each corpus, there is a sum-
mary row, which shows precision, recall, and F-
score for the whole corpus. At the bottom of each
table, there is also an overall summary.
From a high-level perspective, LUIS performed
best with an F-score of 0.916, followed by RASA
(0.821), Watson Conversation (0.752), and
(0.687). LUIS also performed best on each in-
dividual dataset: chatbot, web apps, and ask
ubuntu. Similarly, performed worst on ev-
ery dataset, while the second place changes be-
tween RASA and Watson Conversation (cf. Figure
Based on this data, the second hypothesis can be
rejected. Although the best performance was in-
deed shown by a commercial product, RASA eas-
ily competes with the other commercial products.
The first hypothesis is supported by our find-
ings. We can see a difference between the ser-
vices, with the F-score of LUIS being nearly 0.3
higher than the F-score of However, a
conducted two-way ANOVA analysis with the F-
score as dependent variable and the NLU service
and the entity type/intent as fixed factors does
not show a significance at the level of p < 0.05
(p= 0.234, df = 3). An even larger corpus might
be necessary to get quantitatively more robust re-
With regard to the third hypothesis, the picture
is less clear. Although we can see a clear influ-
ence of the domain on the F-score within each ser-
vice, the ranking between different services is not
overall chatbot web apps ask ubuntu
Watson Conversation
Figure 3: F-scores for the different NLU services, grouped by corpus
much influenced. LUIS always performs best, in-
dependent from the domain, always worst,
also independent from the domain, merely the sec-
ond and third place changes. Therefore, although
the domain influences the results, it is not clear
whether or not it should also influence the deci-
sion which service should be used.
On a more detailed level, we also see differ-
ences between entities and intents. Especially seems to have big troubles identifying enti-
ties. On the web apps corpus, for example,
did not identify a single occurrence of the entity
type WebService, which occurred 64 times in the
dataset. If we calculate the F-score for this dataset
only based on the intents, it would increase from
0.519 to 0.803. The overall results of were
therefore heavily influenced by its shortcomings
regarding entity detection.
If we look at intents and entity types with sparse
training data, like Line, ChangePassword, and Ex-
portData, other than we expected, we do not see
a significantly better performance of commercial
8 Conclusion
The evaluation of the NLU services LUIS, Wat-
son Conversation,, and RASA, based on the
two corpora we presented in Section 5, has shown
that the quality of the annotations differs between
the different services. Before using an NLU ser-
vice, no matter if for commercial or scientific pur-
poses, one should therefore compare the different
services with domain specific data.
For our two corpora, LUIS showed the best re-
sults, however, the open source alternative RASA
could achieve similar results. Given the advan-
tages of open source solutions (mainly adaptabil-
ity), it might well be possible to achieve an even
better results with RASA, after some customiza-
With regard to absolute numbers, it is difficult
to decide whether an F-score of 0.916 or 0.821 is
satisfactory for productive use within a conversa-
tional question answering system. This decision
also depends strongly on the concrete use case.
We, therefore, focused on relative comparisons in
our evaluation and leave this decision to future
corpus entity type / intent type true + false - false + precision recall F-score
DepartureTime Intent 34 1 1 0.971 0.971 0.971
FindConnection Intent 70 1 1 0.986 0.986 0.986
Criterion Entity 34 0 0 1 1 1
Line Entity 0 2 0 0
StationDest Entity 65 6 3 0.956 0.915 0.935
StationStart Entity 90 17 5 0.947 0.841 0.891
Vehicle Entity 33 2 0 1 0.943 0.971
Σ326 29 10 0.970 0.918 0.943
web apps
ChangePassword Intent 3 3 0 1 0.5 0.667
DeleteAccount Intent 8 2 0 1 0.8 0.889
DownloadVideo Intent 0 0 0
ExportData Intent 3 0 1 0.75 1 0.857
FilterSpam Intent 12 2 0 1 0.857 0.923
FindAlternative Intent 14 2 2 0.875 0.875 0.875
None Intent 3 1 8 0.273 0.75 0.4
SyncAccounts Intent 5 1 0 1 0.833 0.909
WebService Entity 29 30 5 0.853 0.492 0.624
Σ77 41 16 0.828 0.653 0.73
ask ubuntu
MakeUpdate Intent 36 1 4 0.900 0.973 0.935
SetupPrinter Intent 12 1 2 0.857 0.923 0.889
ShutdownComputer Intent 14 0 0 1 1 1
SRecommendation Intent 36 4 5 0.878 0.9 0.889
None Intent 0 5 0 0
SoftwareName Entity 0 4 0 0
Printer Entity 5 7 0 1 0.417 0.589
UbuntuVersion Entity 67 10 11 0.859 0.87 0.864
Σ170 32 22 0.885 0.842 0.863
overall 820 102 48 0.945 0.889 0.916
Table 5: Results LUIS
corpus entity type / intent type true + false - false + precision recall F-score
DepartureTime Intent 33 2 1 0.971 0.943 0.957
FindConnection Intent 70 1 2 0.972 0.986 0.979
Criterion Entity 34 0 0 1 1 1
Line Entity 1 1 0 1 0.5 0.667
StationDest Entity 42 29 75 0.359 0.592 0.447
StationStart Entity 65 37 50 0.565 0.637 0.599
Vehicle Entity 35 0 0 1 1 1
Σ280 70 128 0.686 0.8 0.739
web apps
ChangePassword Intent 5 1 0 1 0.833 0.909
DeleteAccount Intent 9 1 3 0.750 0.9 0.818
DownloadVideo Intent 0 0 1 0
ExportData Intent 2 1 2 0.500 0.667 0.572
FilterSpam Intent 13 1 2 0.867 0.929 0.897
FindAlternative Intent 15 1 1 0.938 0.938 0.938
None Intent 0 4 1 0 0
SyncAccounts Intent 5 1 0 1 0.833 0.909
WebService Entity 23 41 5 0.821 0.359 0.5
Σ72 51 15 0.828 0.585 0.686
ask ubuntu
MakeUpdate Intent 37 0 4 0.902 1 0.948
SetupPrinter Intent 13 0 1 0.929 1 0.963
ShutdownComputer Intent 14 0 0 1 1 1
SRecommendation Intent 35 5 3 0.921 0.875 0.897
None Intent 1 4 1 0.500 0.2 0.286
SoftwareName Entity 0 4 0 0
Printer Entity 0 12 0 0
UbuntuVersion Entity 51 7 27 0.654 0.879 0.75
Σ151 32 36 0.807 0.825 0.816
overall 503 153 179 0.738 0.767 0.752
Table 6: Results Watson Conversation
corpus entity type / intent type true + false - false + precision recall F-score
DepartureTime Intent 35 0 4 0.897 1 0.946
FindConnection Intent 60 11 0 1 0.845 0.916
Criterion Entity 31 3 0 1 0.912 0.954
Line Entity 1 1 0 1 0.5 0.667
StationDest Entity 0 71 0 0
StationStart Entity 28 79 4 0.875 0.262 0.403
Vehicle Entity 34 1 5 0.872 0.971 0.919
Σ189 166 13 0.936 0.532 0.678
web apps
ChangePassword Intent 4 2 1 0.800 0.667 0.727
DeleteAccount Intent 10 0 2 0.833 1 0.909
DownloadVideo Intent 0 0 0
ExportData Intent 1 2 2 0.333 0.333 0.333
FilterSpam Intent 10 4 3 0.769 0.714 0.74
FindAlternative Intent 16 0 2 0.889 1 0.941
None Intent 2 2 1 0.667 0.5 0.572
SyncAccounts Intent 4 2 0 1 0.667 0.8
WebService Entity 0 64 0 0
Σ47 76 11 0.810 0.382 0.519
ask ubuntu
MakeUpdate Intent 36 1 3 0.923 0.973 0.947
SetupPrinter Intent 13 0 1 0.929 1 0.963
ShutdownComputer Intent 14 0 2 0.875 1 0.933
SRecommendation Intent 28 12 2 0.933 0.7 0.8
None Intent 2 3 8 0.200 0.4 0.267
SoftwareName Entity 0 4 0 0
Printer Entity 0 12 0 0
UbuntuVersion Entity 48 30 0 1 0.615 0.762
Σ141 46 32 0.815 0.754 0.783
overall 377 288 56 0.871 0.567 0.687
Table 7: Results
corpus entity type / intent type true + false - false + precision recall F-score
DepartureTime Intent 34 1 1 0.971 0.971 0.971
FindConnection Intent 70 1 1 0.986 0.986 0.986
Criterion Entity 34 0 0 1 1 1
Line Entity 0 2 0 0
StationDest Entity 65 6 3 0.956 0.915 0.935
StationStart Entity 90 17 5 0.947 0.841 0.891
Vehicle Entity 33 2 0 1 0.943 0.971
Σ326 29 10 0.970 0.918 0.943
web apps
ChangePassword Intent 4 2 0 1 0.667 0.8
DeleteAccount Intent 9 1 5 0.643 0.9 0.75
DownloadVideo Intent 0 0 1 0
ExportData Intent 0 3 0 0
FilterSpam Intent 13 1 0 1 0.929 0.963
FindAlternative Intent 15 1 8 0.652 0.938 0.769
None Intent 0 4 1 0 0
SyncAccounts Intent 3 3 0 1 0.5 0.667
WebService Entity 45 19 87 0.341 0.703 0.459
Σ89 34 102 0.466 0.724 0.567
ask ubuntu MakeUpdate Intent 34 3 2 0.944 0.919 0.931
SetupPrinter Intent 13 0 2 0.867 1 0.929
ShutdownComputer Intent 14 0 6 0.700 1 0.824
SRecommendation Intent 33 7 4 0.892 0.825 0.857
None Intent 0 5 1 0 0
SoftwareName Entity 0 4 11 0 0
Printer Entity 8 4 11 0.421 0.667 0.516
UbuntuVersion Entity 65 13 7 0.903 0.833 0.867
Σ167 36 44 0.791 0.823 0.807
overall 582 99 156 0.789 0.855 0.821
Table 8: Results RASA
Chris Callison-Burch. 2009. Fast, cheap, and creative:
evaluating translation quality using amazon’s me-
chanical turk. In Proceedings of the 2009 Confer-
ence on Empirical Methods in Natural Language
Processing: Volume 1-Volume 1. Association for
Computational Linguistics, pages 286–295.
Jinho D. Choi, Joel R. Tetreault, and Amanda Stent.
2015. It depends: Dependency parser comparison
using A web-based evaluation tool. In Proceed-
ings of the 53rd Annual Meeting of the Associa-
tion for Computational Linguistics and the 7th In-
ternational Joint Conference on Natural Language
Processing of the Asian Federation of Natural Lan-
guage Processing, ACL 2015, July 26-31, 2015, Bei-
jing, China, Volume 1: Long Papers. pages 387–396.
Robert Dale. 2015. Nlp meets the cloud.Nat-
ural Language Engineering 21(4):653–659.
Emilio Ferrara, Onur Varol, Clayton Davis, Filippo
Menczer, and Alessandro Flammini. 2016. The
rise of social bots. Communications of the ACM
Gartner. 2016. Hype cycle for emerging
technologies, 2016. Technical report.
Kelly Geyer, Kara Greenfield, Alyssa Mensch, and
Olga Simek. 2016. Named entity recognition in 140
characters or less. In 6th Workshop on Making Sense
of Microposts (#Microposts2016). pages 78–79.
Rohan Kar and Rishin Haldar. 2016. Applying
chatbots to the internet of things: Opportuni-
ties and architectural elements. arXiv preprint
arXiv:1611.03799 .
Oleksandr Kolomiyets and Marie-Francine
Moens. 2011. A survey on question answer-
ing technology from an information retrieval
perspective.Inf. Sci. 181(24):5412–5434.
Michael McTear, Zoraida Callejas, and David Griol.
2016. The Conversational Interface: Talking to
Smart Devices, Springer International Publishing,
Cham, chapter Implementing Spoken Language Un-
derstanding, pages 187–208.
Fabrizio Morbini, Kartik Audhkhasi, Kenji Sagae, Ron
Artstein, Dogan Can, Panayiotis Georgiou, Shri
Narayanan, Anton Leuski, and David Traum. 2013.
Which asr should i choose for my dialogue system.
In Proceedings of the 14th annual SIGdial Meeting
on Discourse and Dialogue. pages 394–403.
Robert Munro, Steven Bethard, Victor Kuperman,
Vicky Tzuyin Lai, Robin Melnick, Christopher
Potts, Tyler Schnoebelen, and Harry Tily. 2010.
Crowdsourcing and language studies: the new gen-
eration of linguistic data. In Proceedings of the
NAACL HLT 2010 workshop on creating speech and
language data with Amazon’s Mechanical Turk. As-
sociation for Computational Linguistics, pages 122–
Ehud Reiter and Robert Dale. 2000. Building Natural
Language Generation Systems. Studies in Natural
Language Processing. Cambridge University Press.
Philip Resnik and Jimmy Lin. 2010. Evaluation of nlp
systems. The handbook of computational linguistics
and natural language processing 57.
Dirk Schnelle-Walka, Stefan Radomski, Benjamin
Milde, Chris Biemann, and Max M¨
auser. 2016.
Nlu vs. dialog management: To whom am i speak-
ing? In Joint Workshop on Smart Connected
and Wearable Things (SCWT’2016), co-located with
Bayan Abu Shawar and Eric Atwell. 2007. Differ-
ent measurements metrics to evaluate a chatbot
system. In Proceedings of the Workshop on
Bridging the Gap: Academic and Industrial
Research in Dialog Technologies. Association
for Computational Linguistics, Stroudsburg,
PA, USA, NAACL-HLT-Dialog ’07, pages 89–96.
Rion Snow, Brendan O’Connor, Daniel Juraf-
sky, and Andrew Y. Ng. 2008. Cheap and
fast—but is it good?: Evaluating non-expert
annotations for natural language tasks. In
Proceedings of the Conference on Empirical
Methods in Natural Language Processing. Asso-
ciation for Computational Linguistics, Strouds-
burg, PA, USA, EMNLP ’08, pages 254–263.
Svetlana Stoyanchev, Pierre Lison, and Srinivas
Bangalore. 2016. Rapid prototyping of form-
driven dialogue systems using an open-source
framework. In Proceedings of the 17th Annual
Meeting of the Special Interest Group on Dis-
course and Dialogue. Association for Computa-
tional Linguistics, Los Angeles, pages 216–219.
Alan M Turing. 1950. Computing machinery and in-
telligence. Mind 59(236):433–460.
Johannes Twiefel, Timo Baumann, Stefan Heinrich,
and Stefan Wermter. 2014. Improving domain-
independent cloud-based speech recognition with
domain-dependent phonetic post-processing. In
AAAI. pages 1529–1536.
A Supplemental Material
A.1 Examples Chatbot Corpus
"text": "what is the cheapest
connection between
quiddestraße and
"intent": "FindConnection",
"entities": [
"entity": "Criterion",
"start": 3,
"stop": 3
"entity": "StationStart",
"start": 6,
"stop": 6
"entity": "StationDest",
"start": 8,
"stop": 8
"text": "when is the next u6
leaving from garching?",
"intent": "DepartureTime",
"entities": [
"entity": "Line",
"start": 4,
"stop": 4
"entity": "StationStart",
"start": 7,
"stop": 7
A.2 Examples StackExchange Corpus
A.2.1 Web Applications Dataset
"text": "How can I delete my
Twitter account?",
"url": "http://
"author": "Jared Harley",
"answer": {
"text": "[...]",
"author": "Ken Pespisa"
"intent": "Delete Account",
"entities": [
"text": "Twitter",
"stop": 5,
"start": 5,
"entity": "WebService"
"text": "Is it possible to
export my data from
Trello to back it up?",
"url": "http://
"author": "Clare Macrae",
"answer": {
"text": "[...]",
"author": "Daniel LeCheminant
"intent": "Export Data",
"entities": [
"text": "Trello",
"stop": 8,
"start": 8,
"entity": "WebService"
A.2.2 Ask Ubuntu Dataset
"text": "How do I install the
HP F4280 printer?",
"url": "
"author": "ok comp",
"answer": {
"text": "[...]",
"author": "nejode"
"intent": "Setup Printer",
"entities": [
"text": "HP F4280",
"stop": 6,
"start": 5,
"entity": "Printer"
"text": "What is a good MongoDB
GUI client?",
"url": "
"author": "Eyal",
"answer": {
"text": "[...]",
"author": "Eyal"
"intent": "Software
"entities": [
"text": "MongoDB",
"stop": 4,
"start": 4,
"entity": "SoftwareName"
... More recent approaches in the deployment of this technology were "Watson", a project which was developed by IBM researchers through employing DeepQA as "a software architecture for deep content analysis and evidence-based reasoning" [17], as well as a neural conversational model employing a sequence-to-sequence (seq2seq) framework for modelling conversations developed by Google researchers [18]. At present, due to the progress in language processing and dialogue modelling, there is a broad variety of systems that deploy spoken dialogue technology methods ranging from simple question-answering models which can answer a single question at a time, to sophisticated dialogue systems, which allow extended conversational interaction between end-users and devices [19]. ...
... While various chatbot architectures have been introduced for specific use cases, these do not conclude to a standardized architectural framework. A suggestion for a general architectural framework architecture for chatbot systems was proposed by Braun et al. in 2017 as shown in Fig. 2. According to this architecture a chatbox system consists of three main modules [19]: Request Interpretation, Response Retrieval and Message Generation. In the context of Request Interpretation, a "request" is not necessarily a question, but can also be any user input, while equally a "response" to this input could be any output statement. ...
The issue of establishing interaction methods among users, applications and systems involved in Smart Agriculture through interfaces which are simple and friendly in end-usage, is considered to be essential for achieving the maximum possible penetration of the IoT technologies in this sector, for the benefit of sustainability. Herewith, in this paper an attempt is made to encounter this issue through the involvement of intelligent conversational agents in controlling IoT devices applied to Smart Agriculture facilities, by introducing the idea of developing a chatbot system which is integrated to a messenger application of a popular social media platform in natural language environment. This solution is considered to provide an efficient, effective and user-friendly mean of interaction between the end-users and the IoT devices deployed in agriculture facilities.
... With regards to development, there are several known platforms. For instance, the Rasa AI tool [3] is an open source platform for chatbot development, based on Natural Language Processing and Machine Learning. This platform is commonly used in the research community, despite not offering cloud infrastructure (scalability, managed hosting, others). ...
Chatbots are conversational interfaces that enable human-like dialogue and can be designed in a textual chat format or a graphical interface with voice and embedding options. In the last year, there has been a significant growth in the emergence of chatbots in the market and this popularization has attracted the efforts of researchers to this area. Despite the existence of techniques to evaluate these tools, there is an urgent need to propose solutions that also support the chatbot design process. Currently, there is no knowledge of a specific list of requirements capable of supporting development teams in the process of designing these tools. In view of this, this directed study proposes a literature review aiming at deepening the knowledge about these tools and identifying important quality attributes in academic and industry sources. As a result, this directed study presents a list composed of 82 requirements related to Usefulness, Ease of Use and Presence to aid the design of these tools. These requirements presented in this study are useful to guide developers in the process of building quality chatbots, making this task less challenging and for researchers who aim to propose technologies that contribute to the development of better and better chatbots.
... Moreover, besides investigating CNN and LSTM methods from pretrained word and character embeddings, we also include pretraining from tweets and fine-tuning the embeddings via ULMFit. Focusing specifically on user intent classification in conversational agents, in [37] a method is presented for evaluation of commercial Natural Language Understanding (NLU) services. The authors introduce two datasets -ChatBot Corpus, containing 206 questions distributed amongst seven intents from a Telegram chatbot used to answer questions about public Comparison between class softmax scores and attribution scores plotted for each dataset classified using BERT ? ...
Full-text available
User-generated content is a fundamental source of information to aid the decision-making in several tasks, such as online marketing and follow-up intent response. Nonetheless, they also present several challenges, as the utterance sentences are short, noisy, lack proper grammar, and relate to multiple classes. Classification from texts has been widely addressed in the last years by extracting features from pretrained language models. However, because of the noisy nature of utterance sentences, directly extracting embeddings from general corpora may not work well to train a user-intention classifier. This manuscript investigates if such a perception empirically proves true in three real-world datasets, one written in English and two in Portuguese. To that, we evaluate pretrained embeddings with several strategies, including different language-based models, general and specific pretrained embeddings, learning embeddings from scratch, and fine-tuning embeddings. We show that adjusting the language of the embeddings to the target dataset vocabulary with a step of the task adaptive pretraining strategy achieves the best overall results. However, directly employing bag-of-words could also work surprisingly well. We also analyze the results with an interpretability method to better understand the predictions and identify classes incorrectly labeled in a dataset.
... Similarly, the response retrieval is the working behind the generation of an answer from the chatbot. The message generation component in a general architecture follows a classical Natural language Generation (NLG) pipeline [19]. To precisely describe the architecture RASA uses, we can simplify the general architecture and use the modular design by RASA and hence can be easily integrated with other systems [15]. ...
... In the last few years, comparisons have been made of different datasets by [46], which builds and analyzes two different corpora and in particular by [17], which presents a detailed description and an in-depth evaluation (using a dataset made by utterances concerning the weather) of the most common platforms. Another important analysis was made by [47], which proposed a comparison regarding the performance of four NLU services on very large datasets (with more than 20 domains and 50 intentions). ...
Full-text available
During the COVID-19 pandemic, the corporate online training sector has increased exponentially and online course providers had to implement innovative solutions to be more efficient and provide a satisfactory service. This paper considers a real case study in implementing a chatbot, which answers frequently asked questions from learners on an Italian e-learning platform that provides workplace safety courses to several business customers. Having to respond quickly to the increase in the courses activated, the company decided to develop a chatbot using a cloud-based service currently available on the market. These services are based on Natural Language Understanding (NLU) engines, which deal with identifying information such as entities and intentions from the sentences provided as input. To integrate a chatbot in an e-learning platform, we studied the performance of the intent recognition task of the major NLU platforms available on the market with an in-depth comparison, using an Italian dataset provided by the owner of the e-learning platform. We focused on intent recognition, carried out several experiments and evaluated performance in terms of F-score, error rate, response time, and robustness of all the services selected. The chatbot is currently in production, therefore we present a description of the system implemented and its results on the original users’ requests.
The use of natural language interfaces in the field of human-computer interaction is undergoing intense study through dedicated scientific and industrial research. The latest contributions in the field, including deep learning approaches like recurrent neural networks, the potential of context-aware strategies and user-centred design approaches, have brought back the attention of the community to software-based dialogue systems, generally known as conversational agents or chatbots. Nonetheless, and given the novelty of the field, a generic, context-independent overview on the current state of research of conversational agents covering all research perspectives involved is missing. Motivated by this context, this paper reports a survey of the current state of research of conversational agents through a systematic literature review of secondary studies. The conducted research is designed to develop an exhaustive perspective through a clear presentation of the aggregated knowledge published by recent literature within a variety of domains, research focuses and contexts. As a result, this research proposes a holistic taxonomy of the different dimensions involved in the conversational agents’ field, which is expected to help researchers and to lay the groundwork for future research in the field of natural language interfaces.
Full-text available
Medical imaging diagnosis is the most assisted method to help physicians diagnose patient diseases using different imaging test modalities. In fact, Deep learning aims to simulate human cognitive functions. It providing a paradigm shift in the field of medical imaging, due to the expanding availability of medical imaging data and to the advancing deep learning techniques. In effect, deep learning algorithms have become the approach of choice for medical imaging, from image acquisition to image retrieval, from segmentation to disease prediction. In our paper, we present a review that focuses on exploring the application of deep learning in medical imaging from different perspectives.
Model degradation is still a challenge in real-time applications such as chatbot systems. This work refers to a webchat service of a Brazilian energy utility company, whose central part is composed of a supervised model and a question-and-answer list. User queries not met by it go to an NLP-based clustering model, responsible for identifying unknown customer intents. Manual labeling is impractical in this case due to the large volume of data. This work proposes an automatic update strategy for this clustering model, necessary due to changes in customer behavior from time to time. A series of experiments showed considerable temporal variation in the number of user queries per customer intent, the allocation of queries from unknown intents to few clusters, and larger relative variations in cluster sizes for unknown rather than known intents over time. Based on these findings, a monitoring metric, together with a cut-off point, was proposed to be used as a trigger for updating the clustering model. This update task was demonstrated in a real situation, from which the discovery of new customer intents was confirmed by experts. This resulted in a significant recovery rate of 13.9% (2,251 messages), as 85.1% of the queries are already answered promptly by the central part of the chatbot system. These findings are valuable for the company to improve service quality and, ultimately, customer satisfaction.
Full-text available
Internet of Things (IoT) is emerging as a significant technology in shaping the future by connecting physical devices or things with internet. It also presents various opportunities for intersection of other technological trends which can allow it to become even more intelligent and efficient. In this paper we focus our attention on the integration of Intelligent Conversational Software Agents or Chatbots with IoT. Literature surveys have looked into various applications, features, underlying technologies and known challenges of IoT. On the other hand, Chatbots are being adopted in greater numbers due to major strides in development of platforms and frameworks. The novelty of this paper lies in the specific integration of Chatbots in the IoT scenario. We analyzed the shortcomings of existing IoT systems and put forward ways to tackle them by incorporating chatbots. A general architecture is proposed for implementing such a system, as well as platforms and frameworks, both commercial and open source, which allow for implementation of such systems. Identification of the newer challenges and possible future directions with this new integration, have also been addressed.
Conference Paper
Full-text available
Research in dialog management and natural language understanding are both approaching voice-based interaction. Coming from different perspectives they emphasize different components in the spoken dialog system processing chain. Although each approach is suitable to provide a satisfiable user experience, a combined approach could potentially improve towards a more convincing natural interaction with the user as discussed in this vision paper.
Conference Paper
Full-text available
Automatic speech recognition (ASR) technology has been developed to such a level that off-the-shelf distributed speech recognition services are available (free of cost), which allow researchers to integrate speech into their applications with little development effort or expert knowledge leading to better results compared with previously used open-source tools. Often, however, such services do not accept language models or grammars but process free speech from any domain. While results are very good given the enormous size of the search space, results frequently contain out-of-domain words or constructs that cannot be understood by subsequent domain-dependent natural language understanding (NLU) components. We present a versatile post-processing technique based on phonetic distance that integrates domain knowledge with open- domain ASR results, leading to improved ASR performance. Notably, our technique is able to make use of domain restrictions using various degrees of domain knowledge, ranging from pure vocabulary restrictions via grammars or N-Grams to restrictions of the acceptable utterances. We present results for a variety of corpora (mainly from human-robot interaction) where our combined approach significantly outperforms Google ASR as well as a plain open-source ASR solution. Copyright © 2014, Association for the Advancement of Artificial Intelligence ( All rights reserved.
Conference Paper
Full-text available
The last few years have seen a surge in the number of accurate, fast, publicly available dependency parsers. At the same time, the use of dependency parsing in NLP applications has increased. It can be difficult for a non-expert to select a good "off-The-shelf" parser. We present a comparative analysis of ten leading statistical dependency parsers on a multi-genre corpus of English. For our analysis, we developed a new web-based tool that gives a convenient way of comparing dependency parser outputs. Our analysis will help practitioners choose a parser to optimize their desired speed/accuracy tradeoff, and our tool will help practitioners examine and compare parser output.
Full-text available
With NLP services now widely available via cloud APIs, tasks like named entity recognition and sentiment analysis are virtually commodities. We look at what's on offer, and make some suggestions for how to get rich.
This book provides a comprehensive introduction to the conversational interface, which is becoming the main mode of interaction with virtual personal assistants, smart devices, various types of wearables, and social robots. The book consists of four parts: Part I presents the background to conversational interfaces, examining past and present work on spoken language interaction with computers; Part II covers the various technologies that are required to build a conversational interface along with practical chapters and exercises using open source tools; Part III looks at interactions with smart devices, wearables, and robots, and then goes on to discusses the role of emotion and personality in the conversational interface; Part IV examines methods for evaluating conversational interfaces and discusses future directions. · Presents a comprehensive overview of the various technologies that underlie conversational user interfaces; · Combines descriptions of conversational user interface technologies with a guide to various toolkits and software that enable readers to implement and test their own solutions; · Provides a series of worked examples so readers can develop and implement different aspects of the technologies.