Conference PaperPDF Available

Different measurements metrics to evaluate a chatbot system

Authors:

Abstract and Figures

A chatbot is a software system, which can interact or "chat" with a human user in natural language such as English. For the annual Loebner Prize contest, rival chatbots have been assessed in terms of ability to fool a judge in a restricted chat session. We are investigating methods to train and adapt a chatbot to a specific user's language use or application, via a user-supplied training corpus. We advocate open-ended trials by real users, such as an example Afrikaans chatbot for Afrikaans-speaking researchers and students in South Africa. This is evaluated in terms of "glass box" dialogue efficiency metrics, and "black box" dialogue quality metrics and user satisfaction feedback. The other examples presented in this paper are the Qur'an and the FAQchat prototypes. Our general conclusion is that evaluation should be adapted to the application and to user needs.
Content may be subject to copyright.
Bridging the Gap: Academic and Industrial Research in Dialog Technologies Workshop Proceedings, pages 89–96,
NAACL-HLT, Rochester, NY, April 2007. c
2007 Association for Computational Linguistics
Different measurements metrics to evaluate a chatbot system
Bayan Abu Shawar
IT department
Arab Open University
[add]
b_shawar@arabou-jo.edu.jo
Eric Atwell
School of Computing
University of Leeds
LS2 9JT, Leeds-UK
eric@comp .leeds.ac.uk
Abstract
A chatbot is a software system, which can
interact or “chat” with a human user in
natural language such as English. For the
annual Loebner Prize contest, rival chat-
bots have been assessed in terms of ability
to fool a judge in a restricted chat session.
We are investigating methods to train and
adapt a chatbot to a specific user’s lan-
guage use or application, via a user-
supplied training corpus. We advocate
open-ended trials by real users, such as an
example Afrikaans chatbot for Afrikaans-
speaking researchers and students in
South Africa. This is evaluated in terms of
“glass box” dialogue efficiency metrics,
and “black box” dialogue quality metrics
and user satisfaction feedback. The other
examples presented in this paper are the
Qur'an and the FAQchat prototypes. Our
general conclusion is that evaluation
should be adapted to the application and
to user needs.
1 Introduction
“Before there were computers, we could distin-
guish persons from non-persons on the basis of an
ability to participate in conversations. But now, we
have hybrids operating between person and non
persons with whom we can talk in ordinary lan-
guage.” (Colby 1999a). Human machine conversa-
tion as a technology integrates different areas
where the core is the language, and the computa-
tional methodologies facilitate communication be-
tween users and computers using natural language.
A related term to machine conversation is the
chatbot, a conversational agent that interacts with
users turn by turn using natural language. Different
chatbots or human-computer dialogue systems
have been developed using text communication
such as Eliza (Weizenbaum 1966), PARRY (Colby
1999b), CONVERSE (Batacharia etc 1999),
ALICE1. Chatbots have been used in different do-
mains such as: customer service, education, web
site help, and for fun.
Different mechanisms are used to evaluate
Spoken Dialogue Systems (SLDs), ranging from
glass box evaluation that evaluates individual
components, to black box evaluation that evaluates
the system as a whole McTear (2002). For exam-
ple, glass box evaluation was applied on the
(Hirschman 1995) ARPA Spoken Language sys-
tem, and it shows that the error rate for sentence
understanding was much lower than that for sen-
tence recognition. On the other hand black box
evaluation evaluates the system as a whole based
on user satisfaction and acceptance. The black box
approach evaluates the performance of the system
in terms of achieving its task, the cost of achieving
the task in terms of time taken and number of
turns, and measures the quality of the interaction,
normally summarised by the term ‘user satisfac-
tion’, which indicates whether the user “ gets the
information s/he wants, is s/he comfortable with
the system, and gets the information within accept-
able elapsed time, etc.” (Maier et al 1996).
The Loebner prize2 competition has been used
to evaluate machine conversation chatbots. The
Loebner Prize is a Turing test, which evaluates the
ability of the machine to fool people that they are
talking to human. In essence, judges are allowed a
short chat (10 to 15 minutes) with each chatbot,
and asked to rank them in terms of “naturalness”.
ALICE (Abu Shawar and Atwell 2003) is the
Artificial Linguistic Internet Computer Entity, first
1 http://www.alicebot.org/
2 http://www.loebner.net/Prizef/loebner-prize.html
89
implemented by Wallace in 1995. ALICE knowl-
edge about English conversation patterns is stored
in AIML files. AIML, or Artificial Intelligence
Mark-up Language, is a derivative of Extensible
Mark-up Language (XML). It was developed by
Wallace and the Alicebot free software community
during 1995-2000 to enable people to input dia-
logue pattern knowledge into chatbots based on the
A.L.I.C.E. open-source software technology.
In this paper we present other methods to
evaluate the chatbot systems. ALICE chtabot sys-
tem was used for this purpose, where a Java pro-
gram has been developed to read from a corpus
and convert the text to the AIML format. The Cor-
pus of Spoken Afrikaans (Korpus Gesproke Afri-
kaans, KGA), the corpus of the holy book of Islam
(Qur’an), and the FAQ of the School of Computing
at University of Leeds3 were used to produce two
KGA prototype, the Qur’an prototype and the
FAQchat one consequently.
Section 2 presents Loebner Prize contest, sec-
tion 3 illustrates the ALICE/AIMLE architecture.
The evaluation techniques of the KGA prototype,
the Qur’an prototype, and the FAQchat prototype
are discussed in sections 4, 5, and 6 consequently.
The conclusion is presented in section 7.
2 The Loebner Prize Competition
The story began with the “imitation game” which
was presented in Alan Turing’s paper “Can Ma-
chine think?” (Turing 1950). The imitation game
has a human observer who tries to guess the sex of
two players, one of which is a man and the other is
a woman, but while screened from being able to
tell which is which by voice, or appearance. Turing
suggested putting a machine in the place of one of
the humans and essentially playing the same game.
If the observer can not tell which is the machine
and which is the human, this can be taken as strong
evidence that the machine can think.
Turing’s proposal provided the inspiration for
the Loebner Prize competition, which was an at-
tempt to implement the Turing test. The first con-
test organized by Dr. Robert Epstein was held on
1991, in Boston’s Computer Museum. In this in-
carnation the test was known as the Loebner con-
test, as Dr. Hugh Loebner pledged a $100,000
grand prize for the first computer program to pass
3 http://www.comp.leeds.ac.uk
the test. At the beginning it was decided to limit
the topic, in order to limit the amount of language
the contestant programs must be able to cope with,
and to limit the tenor. Ten agents were used, 6
were computer programs. Ten judges would con-
verse with the agents for fifteen minutes and rank
the terminals in order from the apparently least
human to most human. The computer with the
highest median rank wins that year’s prize. Joseph
Weintraub won the first, second and third Loebner
Prize in 1991, 1992, and 1993 for his chatbots, PC
Therapist, PC Professor, which discusses men ver-
sus women, and PC Politician, which discusses
Liberals versus Conservatives. In 1994 Thomas
Whalen (Whalen 2003) won the prize for his pro-
gram TIPS, which provides information on a par-
ticular topic. TIPS provides ways to store,
organize, and search the important parts of sen-
tences collected and analysed during system tests.
However there are sceptics who doubt the ef-
fectiveness of the Turing Test and/or the Loebner
Competition. Block, who thought that “the Turing
test is a sorely inadequate test of intelligence be-
cause it relies solely on the ability to fool people”;
and Shieber (1994), who argued that intelligence is
not determinable simply by surface behavior.
Shieber claimed the reason that Turing chose natu-
ral language as the behavioral definition of human
intelligence is “exactly its open-ended, free-
wheeling nature”, which was lost when the topic
was restricted during the Loebner Prize. Epstein
(1992) admitted that they have trouble with the
topic restriction, and they agreed “every fifth year
or so … we would hold an open-ended test - one
with no topic restriction.” They decided that the
winner of a restricted test would receive a small
cash prize while the one who wins the unrestricted
test would receive the full $100,000.
Loebner in his responses to these arguments be-
lieved that unrestricted test is simpler, less expen-
sive and the best way to conduct the Turing Test.
Loebner presented three goals when constructing
the Loebner Prize (Loebner 1994):
“No one was doing anything about the
Turing Test, not AI.” The initial Loebner
Prize contest was the first time that the
Turing Test had ever been formally tried.
Increasing the public understanding of AI
is a laudable goal of Loebner Prize. “I be-
lieve that this contest will advance AI and
90
serve as a tool to measure the state of the
art.”
Performing a social experiment.
The first open-ended implementation of the
Turing Test was applied in the 1995 contest, and
the prize was granted to Weintraub for the fourth
time. For more details to see other winners over
years are found in the Loebner Webpage4.
In this paper, we advocate alternative evalua-
tion methods, more appropriate to practical infor-
mation systems applications. We have investigated
methods to train and adapt ALICE to a specific
user’s language use or application, via a user-
supplied training corpus. Our evaluation takes ac-
count of open-ended trials by real users, rather than
controlled 10-minute trials.
3 The ALICE/AIML chatbot architecture
AIML consists of data objects called AIML ob-
jects, which are made up of units called topics and
categories. The topic is an optional top-level ele-
ment; it has a name attribute and a set of categories
related to that topic. Categories are the basic units
of knowledge in AIML. Each category is a rule for
matching an input and converting to an output, and
consists of a pattern, which matches against the
user input, and a template, which is used in gener-
ating the Alice chatbot answer. The format struc-
ture of AIML is shown in figure 1.
< aiml version=”1.0” >
< topic name=” the topic” >
<category>
<pattern>PATTERN</pattern>
<that>THAT</that>
<template>Template</template>
</category>
..
..
</topic>
</aiml>
The <that> tag is optional and means that the cur-
rent pattern depends on a previous bot output.
Figure 1. AIML format
4 http://www.loebner.net/Prizef/loebner-prize.html
The AIML pattern is simple, consisting only of
words, spaces, and the wildcard symbols _ and *.
The words may consist of letters and numerals, but
no other characters. Words are separated by a sin-
gle space, and the wildcard characters function like
words. The pattern language is case invariant. The
idea of the pattern matching technique is based on
finding the best, longest, pattern match. Three
types of AIML categories are used: atomic cate-
gory, are those with patterns that do not have wild-
card symbols, _ and *; default categories are
those with patterns having wildcard symbols * or
_. The wildcard symbols match any input but can
differ in their alphabetical order. For example,
given input ‘hello robot’, if ALICE does not find a
category with exact matching atomic pattern, then
it will try to find a category with a default pattern;
The third type, recursive categories are those with
templates having <srai> and <sr> tags, which refer
to simply recursive artificial intelligence and sym-
bolic reduction. Recursive categories have many
applications: symbolic reduction that reduces com-
plex grammatical forms to simpler ones; divide
and conquer that splits an input into two or more
subparts, and combines the responses to each; and
dealing with synonyms by mapping different ways
of saying the same thing to the same reply.
The knowledge bases of almost all chatbots are
edited manually which restricts users to specific
languages and domains. We developed a Java pro-
gram to read a text from a machine readable text
(corpus) and convert it to AIML format. The chat-
bot-training-program was built to be general, the
generality in this respect implies, no restrictions on
specific language, domain, or structure. Different
languages were tested: English, Arabic, Afrikaans,
French, and Spanish. We also trained with a range
of different corpus genres and structures, includ-
ing: dialogue, monologue, and structured text
found in the Qur’an, and FAQ websites.
The chatbot-training-program is composed of
four phases as follows:
Reading module which reads the dialogue
text from the basic corpus and inserts it
into a list.
Text reprocessing module, where all cor-
pus and linguistic annotations such as
overlapping, fillers and others are filtered.
Converter module, where the pre-
processed text is passed to the converter to
consider the first turn as a pattern and the
91
second as a template. All punctuation is
removed from the patterns, and the pat-
terns are transformed to upper case.
Producing the AIML files by copying the
generated categories from the list to the
AIML file.
An example of a sequence of two utter-
ances from an English spoken corpus is:
<u who=F72PS002>
<s n="32"><w ITJ>Hello<c PUN>.
</u>
<u who=PS000>
<s n="33"><w ITJ>Hello <w NP0>Donald<c
PUN>.
</u>
After the reading and the text processing
phase, the text becomes:
F72PS002: Hello
PS000: Hello Donald
The corresponding AIML atomic category that
is generated from the converter modules looks like:
<category>
<pattern>HELLO</pattern>
<template>Hello Donald</template>
</category>
As a result different prototypes were developed,
in each prototype, different machine-learning tech-
niques were used and a new chatbot was tested.
The machine learning techniques ranged from a
primitive simple technique like single word match-
ing to more complicated ones like matching the
least frequent words. Building atomic categories
and comparing the input with all atomic patterns to
find a match is an instance based learning tech-
nique. However, the learning approach does not
stop at this level, but it improved the matching
process by using the most significant words (least
frequent word). This increases the ability of find-
ing a nearest match by extending the knowledge
base which is used during the matching process.
Three prototypes will be discussed in this paper as
listed below:
The KGA prototype that is trained by a
corpus of spoken Afrikaans. In this proto-
type two learning approaches were
adopted. The first word and the most sig-
nificant word (least frequent word) ap-
proach;
The Qur’an prototype that is trained by the
holy book of Islam (Qur’an): where in ad-
dition to the first word approach, two sig-
nificant word approaches (least frequent
words) were used, and the system was
adapted to deal with the Arabic language
and the non-conversational nature of
Qur’an as shown in section 5;
The FAQchat prototype that is used in the
FAQ of the School of Computing at Uni-
versity of Leeds. The same learning tech-
niques were used, where the question
represents the pattern and the answer rep-
resents the template. Instead of chatting for
just 10 minutes as suggested by the Loeb-
ner Prize, we advocate alternative evalua-
tion methods more attuned to and
appropriate to practical information sys-
tems applications. Our evaluation takes ac-
count of open-ended trials by real users,
rather than artificial 10-minute trials as il-
lustrated in the following sections.
The aim of the different evaluations method-
ologies is as follows:
Evaluate the success of the learning tech-
niques in giving answers, based on dia-
logue efficiency, quality and users’
satisfaction applied on the KGA.
Evaluate the ability to use the chatbot as a
tool to access an information source, and a
useful application for this, which was ap-
plied on the Qur'an corpus.
Evaluate the ability of using the chatbot as
an information retrieval system by com-
paring it with a search engine, which was
applied on FAQchat.
4 Evaluation of the KGA prototype
We developed two versions of the ALICE that
speaks Afrikaans language, Afrikaana that speaks
only Afrikaans and AVRA that speaks English and
Afrikaans; this was inspired by our observation
that the Korpus Gesproke Afrikaans actually in-
cludes some English, as Afrikaans speakers are
generally bilingual and “code-switch” comfortably.
We mounted prototypes of the chatbots on web-
sites using Pandorabot service5, and encouraged
5 http://www.pandorabots.com/pandora
92
open-ended testing and feedback from remote us-
ers in South Africa; this allowed us to refine the
system more effectively.
We adopted three evaluation metrics:
Dialogue efficiency in terms of matching
type.
Dialogue quality metrics based on re-
sponse type.
Users' satisfaction assessment based on an
open-ended request for feedback.
4.1 Dialogue efficiency metric
We measured the efficiency of 4 sample dia-
logues in terms of atomic match, first word match,
most significant match, and no match. We wanted
to measure the efficiency of the adopted learning
mechanisms to see if they increase the ability to
find answers to general user input as shown in ta-
ble 1. Matching Type D1 D2 D3 D4
Atomic 1 3 6 3
First word 9 15 23 4
Most significant 13 2 19 9
No match 0 1 3 1
Number of turns 23 21 51 17
Table 1. Response type frequency
The frequency of each type in each dialogue
generated between the user and the Afrikaans
chatbot was calculated; in Figure 2, these absolute
frequencies are normalised to relative probabilities.
No significant test was applied, this approach to
evaluation via dialogue efficiency metrics illus-
trates that the first word and the most significant
approach increase the ability to generate answers
to users and let the conversation continue.
Figure 2. Dialogue efficiency: Response Type
Relative Frequencies
4.2 Dialogue quality metric
In order to measure the quality of each re-
sponse, we wanted to classify responses according
to an independent human evaluation of “reason-
ableness”: reasonable reply, weird but understand-
able, or nonsensical reply. We gave the transcript
to an Afrikaans-speaking teacher and asked her to
mark each response according to these classes. The
number of turns in each dialogue and the frequen-
cies of each response type were estimated. Figure 3
shows the frequencies normalised to relative prob-
abilities of each of the three categories for each
sample dialogue. For this evaluator, it seems that
“nonsensical” responses are more likely than rea-
sonable or understandable but weird answers.
4.3 Users' satisfaction
The first prototypes were based only on literal
pattern matching against corpus utterances: we had
not implemented the first word approach and least-
frequent word approach to add “wildcard” default
categories. Our Afrikaans-speaking evaluators
found these first prototypes disappointing and frus-
trating: it turned out that few of their attempts at
conversation found exact matches in the training
corpus, so Afrikaana replied with a default “ja”
most of the time. However, expanding the AIML
pattern matching using the first-word and least-
frequent-word approaches yielded more favorable
feedback. Our evaluators found the conversations
less repetitive and more interesting. We measure
user satisfaction based on this kind of informal
user feed back.
Response Types
0.00
0.20
0.40
0.60
0.80
1.00
Dialogue 1
Dialogue 2
Dialogue 3
Dialogue 4
Repetion (%)
reasonable
Weird
Non sensical
Matching Types
0
0.2
0.4
0.6
0.8
Dialogu1
Dialogue 2
Dialogue 3
Dialogue 4
repetition (%)
Atomic
First word
Most
significant
Match
nothing Figure 3. The quality of the Dialogue: Response
type relative probabilities
93
5 Evaluation of the Qur'an prototype
In this prototype a parallel corpus of Eng-
lish/Arabic of the holy book of Islam was used, the
aim of the Qur’an prototype is to explore the prob-
lem of using the Arabic language and of using a
text which is not conversational in its nature like
the Qur’an. The Qur’an is composed of 114 soora
(chapters), and each soora is composed of different
number of verses. The same learning technique as
the KGA prototype were applied, where in this
case if an input was a whole verse, the response
will be the next verse of the same soora; or if an
input was a question or a statement, the output will
be all verses which seems appropriate based on the
significant word. To measure the quality of the
answers of the Qur’an chatbot version, the follow-
ing approach was applied:
1. Random sentences from Islamic sites were
selected and used as inputs of the Eng-
lish/Arabic version of the Qur’an.
2. The resulting transcripts which have 67
turns were given to 5 Muslims and 6 non-
Muslims students, who were asked to label
each turn in terms of:
Related (R), in case the answer was correct
and in the same topic as the input.
Partially related (PR), in case the answer
was not correct, but in the same topic.
Not related (NR), in case the answer was
not correct and in a different topic.
Proportions of each label and each class of us-
ers (Muslims and non-Muslims) were calculated as
the total number over number of users times num-
ber of turns. Four out of the 67 turns returned no
answers, therefore actually 63 turns were used as
presented in figure 4.
In the transcripts used, more than half of the re-
sults were not related to their inputs. A small dif-
ference can be noticed between Muslims and non-
Muslims proportions. Approximately one half of
answers in the sample were not related from non-
Muslims’ point of view, whereas this figure is 58%
from the Muslims’ perspective. Explanation for
this includes:
The different interpretation of the answers.
The Qur’an uses traditional Arabic lan-
guage, which is sometimes difficult to un-
derstand without knowing the meaning of
some words, and the historical story be-
hind each verse.
The English translation of the Qur’an is
not enough to judge if the verse is related
or not, especially given that non-Muslims
do not have the background knowledge of
the Qur’an.
Using chatting to access the Qur’an looks like
the use of a standard Qur’an search tool. In fact it
is totally different; a searching tool usually
matches words not statements. For example, if the
input is: “How shall I pray?” using chatting: the
robot will give you all ayyas where the word
“pray” is found because it is the most significant
word. However, using a search tool6 will not give
you any match. If the input was just the word
“pray”, using chatting will give you the same an-
swer as the previous, and the searching tool will
provide all ayyas that have “pray” as a string or
substring, so words such as: ”praying, prayed, etc.”
will match.
Another important difference is that in the
search tool there is a link between any word and
the document it is in, but in the chatting system
there is a link just for the most significant words,
so if it happened that the input statement involves a
significant word(s), a match will be found, other-
wise the chatbot answer will be: “I have no answer
for that”.
Answer types
0%
10%
20%
30%
40%
50%
60%
70%
Related Partialy
Related Not related
Answers
Proportion
Muslims
Non Muslims
Overall
Figure4. The Qur’an proportion of each answer
type denoted by users
6 Evaluation of the FAQchat prototype
To evaluate FAQchat, an interface was built,
which has a box to accept the user input, and a but-
ton to send this to the system. The outcomes ap-
6 http://www.islamicity.com/QuranSearch/
94
pear in two columns: one holds the FAQchat an-
swers, and the other holds the Google answers af-
ter filtering Google to the FAQ database only.
Google allows search to be restricted to a given
URL, but this still yields all matches from the
whole SoC website (http://www.comp.leeds.ac.uk)
so a Perl script was required to exclude matches
not from the FAQ sub-pages.
An evaluation sheet was prepared which con-
tains 15 information-seeking tasks or questions on
a range of different topics related to the FAQ data-
base. The tasks were suggested by a range of users
including SoC staff and research students to cover
the three possibilities where the FAQchat could
find a direct answer, links to more than one possi-
ble answer, and where the FAQchat could not find
any answer. In order not to restrict users to these
tasks, and not to be biased to specific topics, the
evaluation sheet included spaces for users to try 5
additional tasks or questions of their own choosing.
Users were free to decide exactly what input-string
to give to FAQchat to find an answer: they were
not required to type questions verbatim; users were
free to try more than once: if no appropriate an-
swer was found; users could reformulate the query.
The evaluation sheet was distributed among 21
members of the staff and students. Users were
asked to try using the system, and state whether
they were able to find answers using the FAQchat
responses, or using the Google responses; and
which of the two they preferred and why.
Twenty-one users tried the system; nine mem-
bers of the staff and the rest were postgraduates.
The analysis was tackled in two directions: the
preference and the number of matches found per
question and per user.
Which tool do you prefer?
0%
10%
20%
30%
40%
50%
60%
FAQchat Google
Tool
Avearge percentage
number
Staff
Student
Total
6.1 Number of matches per question
The number of evaluators who managed to find
answers by FAQchat and Google was counted, for
each question.
Results in table 2 shows that 68% overall of our
sample of users managed to find answers using the
FAQchat while 46% found it by Google. Since
there is no specific format to ask the question,
there are cases where some users could find an-
swers while others could not. The success in find-
ing answers is based on the way the questions were
presented to FAQchat.
Users
/Tool Mean of users find-
ing answers Proportion of find-
ing answers
FAQchat Google FAQchat Google
Staff 5.53 3.87 61% 43%
Student 8.8 5.87 73% 49%
Overall 14.3 9.73 68% 46%
Table 2: Proportion of users finding answers
Of the overall sample, the staff outcome shows
that 61% were able to find answers by FAQchat
where 73% of students managed to do so; students
were more successful than staff.
6.2 The preferred tool per each question
For each question, users were asked to state
which tool they preferred to use to find the answer.
The proportion of users who preferred each tool
was calculated. Results in figure 5 shows that 51%
of the staff, 41% of the students, and 47% overall
preferred using FAQchat against 11% who pre-
ferred the Google.
Figure5. Proportion of preferred tool
6.3 Number of matches and preference found
per user
The number of answers each user had found
was counted. The proportions found were the
same. The evaluation sheet ended with an open
section inviting general feedback. The following is
a summary of the feedback we obtained:
Both staff and students preferred using the
FAQchat for two main reasons:
1. The ability to give direct answers some-
times while Google only gives links.
2. The number of links returned by the
FAQchat is less than those returned by
Google for some questions, which saves
time browsing/searching.
95
Users who preferred Google justified their
preference for two reasons:
1. Prior familiarity with using Google.
2. FAQchat seemed harder to steer with care-
fully chosen keywords, but more often did
well on the first try. This happens because
FAQchat gives answers if the keyword
matches a significant word. The same will
occur if you reformulate the question and
the FAQchat matches the same word.
However Google may give different an-
swers in this case.
To test reliability of these results, the t=Test
were applied, the outcomes ensure the previous
results.
7 Conclusion
The Loebner Prize Competition has been used
to evaluate the ability of chatbots to fool people
that they are speaking to humans. Comparing the
dialogues generated from ALICE, which won the
Loebner Prize with real human dialogues, shows
that ALICE tries to use explicit dialogue-act lin-
guistic expressions more than usual to re enforce
the impression that users are speaking to human.
Our general conclusion is that we should NOT
adopt an evaluation methodology just because a
standard has been established, such as the Loebner
Prize evaluation methodology adopted by most
chatbot developers. Instead, evaluation should be
adapted to the application and to user needs. If the
chatbot is meant to be adapted to provide a specific
service for users, then the best evaluation is based
on whether it achieves that service or task
References
Abu Shawar B and Atwell E. 2003. Using dialogue
corpora to retrain a chatbot system. In Proceedings of
the Corpus Linguistics 2003 conference, Lancaster
University, UK, pp681-690.
Batacharia, B., Levy, D., Catizone R., Krotov A. and
Wilks, Y. 1999. CONVERSE: a conversational com-
panion. In Wilks, Y. (ed.), Machine Conversations.
Kluwer, Boston/Drdrecht/London, pp. 205-215.
Colby, K. 1999a. Comments on human-computer con-
versation. In Wilks, Y. (ed.), Machine Conversations.
Kluwer, Boston/Drdrecht/London, pp. 5-8.
Colby, K. 1999b. Human-computer conversation in a
cognitive therapy program. In Wilks, Y. (ed.), Ma-
chine Conversations. Kluwer, Bos-
ton/Drdrecht/London, pp. 9-19.
Epstein R. 1992. Can Machines Think?. AI magazine,
Vol 13, No. 2, pp80-95
Garner R. 1994. The idea of RED, [Online],
http://www.alma.gq.nu/docs/ideafred_garner.htm
Hirschman L. 1995. The Roles of language processing
in a spoken language interface. In Voice Communi-
cation Between Humans and Machines, D. Roe and J.
Wilpon (Eds), National Academy Press Washinton,
DC, pp217-237.
Hutchens, J. 1996. How to pass the Turing test by
cheating. [Onlin], http://ciips.ee.uwa.edu.au/Papers/,
1996
Hutchens, T., Alder, M. 1998. Introducing MegaHAL.
[Online],
http://cnts.uia.ac.be/conll98/pdf/271274hu.pdf
Loebner H. 1994. In Response to lessons from a re-
stricted Turing Test. [Online],
http://www.loebner.net/Prizef/In-response.html
Maier E, Mast M, and LuperFoy S. 1996. Overview.
In Elisabeth Maier, Marion Mast, and Susan Luper-
Foy (Eds), Dialogue Processing in Spoken Language
Systems, , Springer, Berlin, pp1-13.
McTear M. 2002. Spoken dialogue technology: ena-
bling the conversational user interface. ACM Com-
puting Surveys. Vol. 34, No. 1, pp. 90-169.
Shieber S. 1994. Lessons from a Restricted Turing
Test. Communications of the Association for Com-
puting Machinery, Vol 37, No. 6, pp70-78
Turing A. 1950. Computing Machinery and intelli-
gence. Mind 59, 236, 433-460.
Weizenbaum, J. 1966. ELIZA-A computer program
for the study of natural language communication be-
tween man and machine. Communications of the
ACM. Vol. 10, No. 8, pp. 36-45.
Whalen T. 2003. My experience with 1994 Loebner
competition, [Online],
http://hps.elte.hu/~gk/Loebner/story94.htm
96
... In a good tradition of MDD (Metrics-Driven Development) [27], [28] we used RAGAs library [29] to evaluate our RAG pipeline. The following three angles were considered. ...
Conference Paper
Full-text available
Courses offered in a Computer Science program of a large university differ greatly in their structure. Such a variety is driven by the diversity of backgrounds of faculty members, as well as by the breadth and depth of topics covered. The pragmatic goal of our research is to help students to adopt an effective learning strategy while engaging in a graduate program of several dozen very different courses. As a first step, we established a database of frequently asked questions regarding class logistics. Building on this foundation we launched the chatbot to be readily available to all program registrants. This paper can serve as a step-by-step introduction to How-To-Build-A-Chatbot while dealing with challenges and successes encountered. A student is able to quickly browse through relevant questions and responses. Additionally, a student can actively participate in a conversation with the bot, select among five language models, adjust semantic matching, and switch on generative mode. We used Ragas framework to formally evaluate and measure the performance of the bot, while focusing on its ability to simulate the human-like interaction. The experimental nature of exchange with the bot served as a strong motivation for students to keep exploring and learning. The evolution from a list-bound bot design into a more fluid approach involving language technologies is not trivial. As a benefit, we were able to offer students the personalized and engaging learning experience.
... Unlike task-completion conversational systems where their performance can be measured by task success rate, measuring the performance of chatbots is difficult (Shawar et al., 2007;Zhou et al., 2016). In the past, the Turing test and its extensions have been used to evaluate chitchat performance (Shieber, 1994). ...
Preprint
Conversational systems have come a long way since their inception in the 1960s. After decades of research and development, we've seen progress from Eliza and Parry in the 60's and 70's, to task-completion systems as in the DARPA Communicator program in the 2000s, to intelligent personal assistants such as Siri in the 2010s, to today's social chatbots like XiaoIce. Social chatbots' appeal lies not only in their ability to respond to users' diverse requests, but also in being able to establish an emotional connection with users. The latter is done by satisfying users' need for communication, affection, as well as social belonging. To further the advancement and adoption of social chatbots, their design must focus on user engagement and take both intellectual quotient (IQ) and emotional quotient (EQ) into account. Users should want to engage with a social chatbot; as such, we define the success metric for social chatbots as conversation-turns per session (CPS). Using XiaoIce as an illustrative example, we discuss key technologies in building social chatbots from core chat to visual awareness to skills. We also show how XiaoIce can dynamically recognize emotion and engage the user throughout long conversations with appropriate interpersonal responses. As we become the first generation of humans ever living with AI, we have a responsibility to design social chatbots to be both useful and empathetic, so they will become ubiquitous and help society as a whole.
... The resulting dataset, stored in the Parquet format, not only ensures efficient retrieval but also lays the foundation for robust analysis. Complementing this dataset creation framework is a sophisticated chatbot component, seamlessly integrating ChainLit [6] and Haystack libraries [7]. Utilizing a BM25 retriever for precise document retrieval and the Mistral 7B LLM for natural language interactions, the chatbot surpasses conventional question-answering systems. ...
... These chatbots, trained on extensive datasets, are now used by approximately 80% of companies for customer interaction [19] 1 . Human expertise combined with machine learning models began to evaluate these chatbot responses, ensuring accuracy and quality of the interactions [44]. However, these evaluations were primarily focused on surface-level metrics such as word accuracy and sentence comprehension [25]. ...
Preprint
As dialogue systems and chatbots increasingly integrate into everyday interactions, the need for efficient and accurate evaluation methods becomes paramount. This study explores the comparative performance of human and AI assessments across a range of dialogue scenarios, focusing on seven key performance indicators (KPIs): Coherence, Innovation, Concreteness, Goal Contribution, Commonsense Contradiction, Incorrect Fact, and Redundancy. Utilizing the GPT-4o API, we generated a diverse dataset of conversations and conducted a two-part experimental analysis. In Experiment 1, we evaluated multi-party conversations on Coherence, Innovation, Concreteness, and Goal Contribution, revealing that GPT models align closely with human judgments. Notably, both human and AI evaluators exhibited a tendency towards binary judgment rather than linear scaling, highlighting a shared challenge in these assessments. Experiment 2 extended the work of Finch et al. (2023) by focusing on dyadic dialogues and assessing Commonsense Contradiction, Incorrect Fact, and Redundancy. The results indicate that while GPT-4o demonstrates strong performance in maintaining factual accuracy and commonsense reasoning, it still struggles with reducing redundancy and self-contradiction. Our findings underscore the potential of GPT models to closely replicate human evaluation in dialogue systems, while also pointing to areas for improvement. This research offers valuable insights for advancing the development and implementation of more refined dialogue evaluation methodologies, contributing to the evolution of more effective and human-like AI communication tools.
... Studies have compared conversational agents with other interaction mechanisms. Shawar and Atwell [47] found that 47% of participants preferred using FAQchat over Google's search engine for search tasks, citing direct answers and fewer links as time-saving benefits. Conversely, users who favored Google mentioned their familiarity with the platform and the ease of keyword steering. ...
Article
Full-text available
This study investigates the momentary user experience (UX) during interactions with conversational agents, utilizing an electroencephalography (EEG) device alongside questionnaires. The aim is to examine the differences in user experience metrics between interactions through keyboard-based and voice-based inputs compared to a control group using a standard Google search. The experiment involved 36 participants divided into three groups. Group 1 used Google Search via a browser (control), Group 2 interacted with Google Assistant via keyboard, and Group 3 used voice commands with Google Assistant. All participants engaged in identical search tasks. EEG data captured metrics such as interest, excitement, engagement, relaxation, and stress. Participants also completed questionnaires to assess their subjective experiences after each task. The EEG results indicated variable engagement levels across the groups, with notable differences in stress and excitement metrics between interaction modalities. Group 3 (voice interaction) displayed greater fluctuation in engagement and excitement levels compared to Groups 1 and 2. The questionnaire data supported the EEG findings, suggesting differing levels of user satisfaction and task difficulty across the groups. Voice interaction was associated with higher engagement but also higher variability in emotional response. The modality of interaction with conversational agents influences momentary UX. These findings suggest that while voice-based interfaces may enhance engagement, they also require careful design to manage user emotions and expectations effectively. Further research is recommended to explore these interactions across a broader range of user demographics and settings.
... Previous studies have proposed various definitions of chatbots. For example, they are described as software agents that promote automatic dialogue through natural language processing" (Brandtzaeg and Folstad, 2017), artificial structures of human interaction using natural language as input and output" (Brennan, 2006), and "dialogue entities or computer programs driven by artificial intelligence respond to users' queries through text-based dialogue" (Shawar and Atwell, 2007). In this study, the chatbot was defined as a tool that delivers immediate responses to requests via text-based interactions, transforming business-consumer communication and interaction without temporal or spatial restrictions and eliminating the necessity for human involvement (Gatzioufa and Saprikis, 2022). ...
Article
Full-text available
The rapid rise of chat generative pre-trained transformer (Chat GPT) has brought huge opportunities for e-commerce platforms to use it for consumer communication and service. This paper proposes a research framework to expand the value-attitude-behavior model by integrating the theory of consumption values. The purpose is to explore key consumption values that affect consumers' attitudes and their continuance intention to adopt ChatGPT-driven chatbots (i.e., LazzieChat) in the e-commerce services context while examining the moderating effect of online shopping self-efficacy in this process. A questionnaire survey was conducted in Singapore, involving 305 participants who had experienced LazzieChat in Lazada. Use the structural equation modelling of Smarts 4 to analyze the data. The results show that consumer 'attitudes towards LazzieChat have significantly affected their continuance intention to adopt LazzieChat, and online shopping self-efficacy has moderated this relationship. In addition, the research confirms that emotional value and epistemic value are the main driving factors of consumer 'attitudes, followed by functional value, social value and epistemic value. Consumers' attitudes towards LazzieChat mediate the relationship between functional value, social value, epistemic value, and continuance intention to adopt LazzieChat. By developing a research framework that follows the logic of value-attitude-behavior model, this study provides theoretical and practical insights based on the theory of consumer values, which could guide e-commerce platforms to more ChatGPT-driven E-Commerce AI Chatbot (LazzieChat) 250 effectively cater to their consumers' continuance intention of adopting ChatGPT-driven e-commerce AI chatbot service.
... Un chatbot o agente conversacional (CA) es un sistema de software que puede interactuar o "charlar" con un usuario humano en un lenguaje natural de una manera fluida (Shawar & Atwell, 2007). Los chatbots pueden brindar asesoramiento técnico, responder preguntas, ejecutar acciones y recopilar datos y, por lo tanto, pueden ser ayudas valiosas en la educación en todos sus niveles (Dorado-Vicente et al., 2022). ...
Article
Full-text available
Los agentes conversacionales (CAs) han recorrido un largo camino desde su primera aparición en la década de 1960. Avances tecnológicos continuos, como la informática estadística y los grandes modelos del lenguaje, vienen logrando una interacción cada vez más natural y sin esfuerzo entre humanos. El presente estudio realizó una revisión bibliométrica sobre el uso de los chatbots en el contexto educativo, recopilando los documentos científicos de la base de datos Scopus. Los resultados destacan los vacíos de conocimiento de la investigación; asimismo, se evidencian múltiples direcciones para futuras investigaciones basadas en aprendizaje colaborativo (collaborative learning), agentes conversacionales encarnados (embodied conversational agents), agentes conversacionales pedagógicos (pedagogical conversational agents), procesamiento del lenguaje natural (natural language processing) y agentes racionales (rational agents). La investigación contribuye a la teoría científica sobre los chatbots y su incidencia en la educación al establecer un repaso bibliométrico del tema y mostrar implicaciones prácticas al resaltar las oportunidades de investigación y vacíos de conocimiento.
... Note that it appears the Loebner Prize Competition [53], an application of the Turing test, is often employed as a criterion to assess a chatbot. However, it has been argued this did not represent a good measurement of quality [54]. Table 3; where True Positive is abbreviated as TPi, FPi denotes False Positive, TNi denotes True Negative, and FNi denotes False Nega-tive. ...
Preprint
Full-text available
Chatbots substantially improve administrative support for educational institu- tions facing immense pressure during admissions. Chatbots not only automate repetitive tasks, handle large volumes of inquiries, collecting data from inter- actions but also provide an additional way for students to access information. Existing chatbots are built on traditional Artificial Intelligence (AI) approaches where the accuracy required for seamless real-time interactions is usually compro- mised. This article presents novel AEdBOT - an AI-based Educational ChatBOT system where a novel Deep Learning (DL)-based Hybrid model approach is proposed grounded on integrating informational retrieval and generative neural networks. Moreover, a novel Natural Language Processing (NLP) pipeline is developed on top of the open-source Rasa platform to aid with BERT (Bidi- rectional Encoder Representation Transformer) for dense feature extraction and DIET (Dual Intent and Entity Transformer) Classifier for intent classification and entity extraction from the natural language text. Furthermore, the customized dual fallback classifier algorithm is developed to provide the self-learning ability to a chatbot on out-of-scope inputs and acts as a recommendation system. The effec- tiveness of the proposed chatbot is established through two real-life datasets from educational institutes. For the first dataset, AEdBOT achieved 94.7%, 96.0%, 96.0%, and 95.1% precision, accuracy, recall, and F1-Score, respectively at an average mean response time of 216.43ms per query and a user-friendliness score of 77.5 on the System Usability Scale (SUS). The second dataset is used from the literature for comparative analysis, and AEdBOT attained 76.2%, 83.7%, 77.7%, and 79.1% accuracy, precision, F1-Score, and recall, respectively. Experiment results reveal that AEdBOT significantly improves response accuracy and outperforms state-of-the-art educational chatbots.
Article
Günümüzde, yapay zekâ destekli sohbet robotlarının eğitim ve öğretim ortamında potansiyel faydalarının olduğu aktarılmakta bununla birlikte bu sohbet robotlarının doğru olmayan bilgiler üretmesi, alanyazında tartışılan konular arasında güncelliğini korumaktadır. Bu sebeple, bu çalışmanın amacı, öncelikli olarak Bing Chat AI yapay zekâ sohbet robotunun sorulara verdiği cevaplarının doğru olup olmadığını ilgili literatür bağlamında değerlendirmek, ikinci olarak Bing Chat AI yapay zekâ sohbet robotunun Fransızca öğretmen adaylarına Fransız kültür edinci kazandırma bağlamındaki etkisini araştırmaktır. OBM’de (2000) yer alan Sosyokültürel Bilgi alt başlığındaki ve OBM’ye göre hazırlanan Le Nouveau Taxi A1, Francofolie A1, Campus A1 ve Écho A1 Fransızca ders kitaplarında yer alan kültürel öğeler incelenmiştir. Ortak olarak her ders kitabında Fransız Kültürü bilgi aktarımında “Fransız yiyecek ve içecekler, Fransız Edebiyatı, Fransız Tarihi ve Fransız Mimarisi” kültürel ögelerin olduğu gözlenmiş ve araştırmacılar bu dört farklı kültürel öğe ile ilgili yapay zekâ sohbet robotu Bing Chat AI’e sorular sormuştur. Çalışmada i) Bing Chat AI yapay zekâ sohbet robotunun sunacağı içeriklerin doğruluğu ilgili literatür bağlamında inceleneceği ii) Bing Chat AI yapay zekâ sohbet robotunun Fransız kültür edincinin kazanılmasında Fransızca öğretmen adaylarına rehber olup olamayacağı Bing Chat AI yapay zekâ sohbet robotunun verdiği cevaplar çerçevesinde ele alınmıştır. Sonuç olarak yapay zekâ destekli sohbet robotu Bing Chat’in, Fransız kültür edincini kazanmada Fransız öğretmen adaylarına kısmen rehber olabileceği, geleneksel yöntemlere yeni bir boyut getirebileceği söylenebilir.
Chapter
The study sought to investigate the service quality of chatbots on telecom service providers' websites and apps. The research model was based on an integration of the expectation confirmation model of IS, the E-S-Qual model, and the technology anxiety construct. The study used a quantitative research design. Data were collected from 263 chatbot users; however, after data cleaning, 255 responses were retained. Data analysis was done using structural equation modelling. The results show significant relationships between confirmation and perceived usefulness, as well as confirmation and satisfaction. Additionally, efficiency and privacy constructs of the E-S-Qual model have significant effects on satisfaction. Post-use confirmation, efficiency, and privacy have the most significant impacts on chatbot usage satisfaction of customers on telecom service providers' websites and apps. Technology anxiety could not significantly mediate the relationship between the E-S-Qual model constructs and post-use confirmation.
Conference Paper
Full-text available
This paper presents two chatbot systems, ALICE and Elizabeth, illustrating the dialogue knowledge representation and pattern matching techniques of each. We discuss the problems which arise when using the Dialogue Diversity Corpus to retrain a chatbot system with human dialogue examples. A Java program to convert from dialog transcript to AIML format provides a basic implementation of corpus-based chatbot training.. We conclude that dialogue researchers should adopt clearer standards for transcription and markup format in dialogue corpora to be used in training a chatbot system more effectively.
Article
Full-text available
Empirical and theoretical investigations of the nature and structure of human dialogue have been a topic of research in artificial intelligence and the more human areas of linguistics for decades: there has been much interesting work but no definitive or uncontroversial findings. the best performance overall has probably been Colby’s PARRY (Colby,1973) since its release on the (then ARPA) net around 1973. It was robust, never broke down, always had something to say and, because it was intended to model paranoid behaviour, its zanier misunderstandings could always be taken as further evidence of mental disturbance rather than the processing failures they were.
Article
Full-text available
Spoken dialogue systems allow users to interact with computer-based applications such as databases and expert systems by using natural spoken language. The origins of spoken dialogue systems can be traced back to Artificial Intelligence research in the 1950s concerned with developing conversational interfaces. However, it is only within the last decade or so, with major advances in speech technology, that large-scale working systems have been developed and, in some cases, introduced into commercial environments. As a result many major telecommunications and software companies have become aware of the potential for spoken dialogue technology to provide solutions in newly developing areas such as computer-telephony integration. Voice portals, which provide a speech-based interface between a telephone user and Web-based services, are the most recent application of spoken dialogue technology. This article describes the main components of the technology---speech recognition, language understanding, dialogue management, communication with an external source such as a database, language generation, speech synthesis---and shows how these component technologies can be integrated into a spoken dialogue system. The article describes in detail the methods that have been adopted in some well-known dialogue systems, explores different system architectures, considers issues of specification, design, and evaluation, reviews some currently available dialogue development toolkits, and outlines prospects for future development.
Chapter
I am neither a computational linguist nor an AI linguist nor a linguist of any kind. I am a psychiatrist interested in using computer programs to conduct psychotherapy — traditionally called talk-therapy, harking back to Socrates who remarked “the cure of the soul has to be effected by certain charms — and these charms are fair words”.
Chapter
It was heartening to see so many men and women at the Villa Serbelloni eager to work on this seemingly insuperable and hence challenging problem of humancomputer conversation. From the variety of issues, arguments, opinions, jokes, etc., I have selected a few that stand out in my memory. The order presented here does not reflect the order in which they occurred.
Chapter
I propose to consider the question, “Can machines think?”♣ This should begin with definitions of the meaning of the terms “machine” and “think”. The definitions might be framed so as to reflect so far as possible the normal use of the words, but this attitude is dangerous. If the meaning of the words “machine” and “think” are to be found by examining how they are commonly used it is difficult to escape the conclusion that the meaning and the answer to the question, “Can machines think?” is to be sought in a statistical survey such as a Gallup poll.
Article
We report on the recent Loebner prize competition inspired by Turing's test of intelligent behavior. The presentation covers the structure of the competition and the outcome of its first instantiation in an actual event, and an analysis of the purpose, design, and appropriateness of such a competition. We argue that the competition has no clear purpose, that its design prevents any useful outcome, and that such a competition is inappropriate given the current level of technology. We then speculate as to suitable alternatives to the Loebner prize. Engineering and Applied Sciences Accepted Manuscript
Article
ELIZA is a program operating within the MAC time-sharing system of MIT which makes certain kinds of natural language conversation between man and computer possible. Input sentences are analyzed on the basis of decomposition rules which are triggered by key words appearing in the input text. Responses are generated by reassembly rules associated with selected decomposition rules. The fundamental technical problems with which ELIZA is concerned are: (1) the identification of key words, (2) the discovery of minimal context, (3) the choice of appropriate transformations, (4) generation of responses in the absence of key words, and (5) the provision of an editing capability for ELIZA “scripts”. A discussion of some psychological issues relevant to the ELIZA approach as well as of future developments concludes the paper. © 1983, ACM. All rights reserved.