Conference PaperPDF Available

Bottester: Testing Conversational Systems with Simulated Users

Authors:

Abstract and Figures

Recently, conversation agents have attracted the attention of many companies such as IBM, Facebook, Google, and Amazon which have focused on developing tools or API (Appli-cation Programming Interfaces) for developers to create their own chat-bots. In this paper, we focus on new approaches to evaluate such systems presenting some recommendations resulted from evaluating a real chatbot use case. Testing conversational agents or chatbots is not a trivial task due to the multitude aspects/tasks (e.g., natural language understanding, dialog management and, response generation) which must be considered separately and as a mixture. Also, the creation of a general testing tool is a challenge since evaluation is very sensitive to the application context. Finally, exhaustive testing can be a tedious task for the project team what creates a need for a tool to perform it automatically. This paper opens a discussion about how conversational systems testing tools are essential to ensure well-functioning of such systems as well as to help interface designers guiding them to develop consistent conversational interfaces.
Content may be subject to copyright.
Bottester: Testing Conversational Systems with
Simulated Users
Marisa Vasconcelos
IBM Research
Sao Paulo, Brazil
marisaav@br.ibm.com
Heloisa Candello
IBM Research
Sao Paulo, Brazil
heloisacandello@br.ibm.com
Claudio Pinhanez
IBM Research
Sao Paulo, Brazil
csantop@br.ibm.com
Thiago dos Santos
IBM Research
Sao Paulo, Brazil
thiagodo@br.ibm.com
ABSTRACT
Recently, conversation agents have attracted the attention of
many companies such as IBM, Facebook, Google, and Ama-
zon which have focused on developing tools or API (Appli-
cation Programming Interfaces) for developers to create their
own chat-bots. In this paper, we focus on new approaches
to evaluate such systems presenting some recommendations
resulted from evaluating a real chatbot use case. Testing con-
versational agents or chatbots is not a trivial task due to the
multitude aspects/tasks (e.g., natural language understanding,
dialog management and, response generation) which must be
considered separately and as a mixture. Also, the creation of
a general testing tool is a challenge since evaluation is very
sensitive to the application context. Finally, exhaustive test-
ing can be a tedious task for the project team what creates a
need for a tool to perform it automatically. This paper opens
a discussion about how conversational systems testing tools
are essential to ensure well-functioning of such systems as
well as to help interface designers guiding them to develop
consistent conversational interfaces.
Author Keywords
chatbot, conversational agents, testing
ACM Classification Keywords
H.5.m Information interfaces and presentation (e.g., HCI):
Miscellaneous
INTRODUCTION
Conversational systems have been gaining popularity thanks
to advances in artificial intelligence and in other technologies
such as speech recognition. Chatbots and other text-based
agents are being used in several domains such as customer
service, education, financial advising, and others. Companies
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. To copy otherwise, or republish, to post on servers or to redistribute to
lists, requires prior specific permission and/or a fee. IHC’17 , Brazilian Symposium on
Human Factors in Computing Systems. Octoberl 23–27, 2017, Joinville, SC, Brazil.
Copyright © 2017 SBC. ISBN XXX-XX-XXXX-XXX-X (pendrive).
such as Google, IBM and Facebook are jumping into the de-
velopment of platforms for building conversational interfaces
and also integrating various apps into chatbots.
As any software system, testing is required to detect problems
in the system and compare the current version with previous
versions [3]. However, conversational interfaces are complex
and composed of many modules (e.g., dialog management,
natural language processor, etc.) what makes testing not triv-
ial because of the diversity of interaction parameters and their
interrelation and temporal dynamics.
Furthermore, a chatbot evaluation is very dependent on the
context of the application, for instance, an education chat-
bot has different test cases from a travel agent one. The goal
of the chatbot is also relevant since task-oriented chatbots
have specific goals which guide their interaction, while non-
oriented chatbots do not have a goal and aim to establish free
conversations with the users. Moreover, testing chatbots with
human beings is a costly, time-consuming task, and tedious.
Previous work [1, 2] have proposed metrics which are mostly
interested in testing separately modules of a chatbot system
[3]. Other group of studies has focused only on the comple-
tion rate of task-oriented systems [6] while other researchers
have performed qualitative evaluations with small groups of
people [1, 5, 7] to assess user satisfaction. The famous Loeb-
ner competition1has also been used to evaluate the ability of
chatbots to have human-like conversations.
We here explore the testing of chatbot systems focusing on
the user perspective, that is, by observing the resultant inter-
action of all chatbot modules. Thus, we propose to simulate
users with a chatbot tester tool which interacts with chatbots
and collects measures about the interactions. This is the ap-
proach we explored in a tool we call Bottester.
This is an initial report about indicatives of quality and user
satisfaction using metrics collected during interactions of the
Bottester with the chatbot system. The testing system simu-
lates a large number of interactions with the chatbot system,
automatically creating dialogues which resemble real user in-
teraction. The testing system then computes automatically
1http://www.loebner.net/Prizef/loebner-prize.html
metrics which are related to user satisfaction as well as some
related to the overall performance of the system. As a use
case, we experiment with a finance chatbot system (CognIA)
implemented to help a user to make basic investment deci-
sions.
COGNIA CHATBOT SYSTEM
We developed a chatbot specialized in financial advice,
named CognIA, which gives investment advice to people
with limited finance knowledge. The design of CognIA was
thought to be a mix of a free-conversation system and a goal-
driven system which means that it does not have the sole ob-
jective to complete a task: it has to keep the user engaged and
willing to come back for future interactions.
The CognIA implementation was designed as a multi-agent
architecture composed of three chatbots: Cognia with acts as
a moderator, PoupancaGuru which is responsible to advise
about savings accounts and CDBGuru which is specialized
in certificates of deposit. CognIA has a database consisting
over 491 question-answer pairs in Portuguese language with
question variations (e.g., “Tell me about savings?” or ‘What
is savings’) summing up to 38 different intents.
THE BOTTESTER TOOL
Due to the space of possible test cases for CognIA was in-
creasing, too big to be explored manually, the CognIA needed
to be tested automatically. Thus, we needed a tool to per-
form and simulate users interacting exhaustively with the sys-
tem. The development of the Bottester tool was therefore per-
formed after CognIA was already implemented. The advan-
tage of this approach was that to simulate the users we could
use a corpus of data containing frequent questions and an-
swers which was already collected and built.
Figure 1 shows the Bottester interface. The inputs for
Bottester (left part of the interface) are files containing all
the questions to be submitted to the tested system, the respec-
tive expected answers, and the configuration parameters for
the tests and the connection protocols to the chatbot system.
There are optional parameters such as the number of times
the input scenario is to be executed, a sleep time to simulate
a delay between the user inputs. On the right, the interface
shows the submitted questions which were sent to the chat-
bot after pressing the play button. The bottester also have a
dashboard module to show the results.
Evaluation Metrics
As mentioned earlier, the idea of the Bottester tool is to sim-
ulate a real user interacting with the system. Indeed, we are
collecting our metrics at the interface level which means that
we cannot point out which chatbot module may have a perfor-
mance problem but rather identify the effects of that problem
for the user interaction. The Bottester can be used during
all different stages of the development phase, for instance,
to compare different versions of the same chatbot system. In-
deed, the first time designers and developers use the bottester,
they can create scenarios based in previous user studies to
create the first ground truth questions and answers to the new
Figure 1. Bottester interface.
messages. Otherwise, the bottester will be biased to develop-
ers and designers and might not attend user needs and aim to
good user experience.
Initially, we focused on improving the content presentation
of the tool and thus we start measuring how the answers were
presented in the chatbot interface. For that, we collect the
size (in characters and words) of each answer given by the
chatbot. The metric indicates how verbose are the answers
pointing to the project team which ones should be shortened.
Conciseness is often important since the chatbot can be ac-
cessed through several types of devices and thus performance
is in many cases dependent on what the screen can display.
Moreover, we can test the accuracy of the chatbot systems
natural language intent (speech-act) classifier. Since we built
the application, we have access to each bot agent knowledge
base and then the ground-truth of all questions. In the current
version of the tool, we consider that the answer is correct only
if there is a total match between the answer and the expected
answer defined in the input file. Eventually, other similarity
metrics (e.g. perplexity and distance measures) will be imple-
mented to account for partially correct answers. The number
of correct and incorrect answers gives an idea of how rele-
vant and appropriate is the response for each question, what
indirectly assess the performance of the natural language pro-
cessor module.
Orthogonally, we also measure the number of repetitive an-
swers. A high number of repetitive answers can suggest that
the bot has a limited knowledge base (e.g., if there is a sig-
nificant number of “I don’t know answers’) or may have a
problem in their NLP/classification module. Figure 2 shows
a word cloud created from the answers from CognIA. We can
observe that the chatbot has word savings in many of its an-
swers.
Finally, the Bottester tool also collects the response time ex-
perienced by the user which is the time interval between the
question submission and the response arrival. The response
time is a key metric because it is often related to the user’s
perception of the quality of service. This measure is an ag-
gregate evaluation since it depends on several aspects such as
the chatbot system server capabilities and its load, network
delays, and response processing at the user side. We are es-
Figure 2. Word cloud generated using CognIA answers.
Metric Assessment
Mean answer size Answers conciseness
Answers frequency Knowledge base limitation
Word frequency Vocabulary limitation
#of (in)correct answers Ability to understand the language
Mean response time Overall user perception of the service
quality
Response time per question Delay of each answer
Response time per agent User perception of each agent capacity
Table 1. Metrics used by the Bottester tool.
pecially interested in identifying whether the delay exists and
therefore if the user experience can be being impaired by that.
Future features of the testing tool will allow identifying where
the delay is, which will help developers to understand why the
delay was caused and in which part of the code. On the other
hand, fast response times are problematic since the user may
not to read if the scrolling bar moves too fast and may become
frustated with the system [4].
TESTING COGNIA
Initially, we performed a sanity check of the CognIA system
verifying whether all questions and answers implemented in
it were being responded. We mainly focused on the number of
correct/incorrect answers but we also used the response time
to identify what types of questions were taking a significant
Figure 3. Bottester results interface
Example Scenario
Cognia: Hello, how can I help you? typo testing
Bottester: I would like to invesg on CDB
CDBGuru: Which amount would like to invest?
wrong answer
Bottester:1000 sandwichs
Cognia: Sorry, rephrase your question
Table 2. Example of dialog between the Bottester and the CognIA.
amount of time. We also separated some of the metrics per
each of the three chatbot agents and per type of question (see
Figure 3) and also per type of question. This separation al-
lows us to assess and correlate how the user experiences each
agent. Table 1 summarizes the metrics used by the Bottester
and Figure 3 shows the visualization of the same metrics.
Currently, we are refining the set of questions used in the san-
ity check by generating typos in the words of each question
to be sure the intent classifier can withstand simple variations
in the way the user formulates questions. Moreover, we ran-
domly select words from the answers and replace by their
synonyms. We intend with those two classes of experiments
to mimic better user interactions. Table 2 shows an example
of dialog in which we test different kinds of errors.
CONCLUSION
Testing a chatbot is not trivial since the space of possible
inputs is extremely large: each dialog is interactive which
means that each conversation may happen only once. Manual
testing is time-consuming, often inaccurate, and, in scenarios
of repetition, a very tedious task for developers. Automated
testing or performing tests using a tool is not only faster but
also allow the coverage of an extensive number of scenarios.
The idea is not to replace experiments with real users but to
make the system robust enough and with no critical problems
before user experiments. Automatic testing can easily detect
conversation flow stoppers such as the bot answering “I dont
know”, nonsense statements, or repeating utterances, which
clearly will affect the user experience. Knowing those issues,
experience and content designers can identify and fix those
faster and earlier in the process.
Some of the main issues to develop a chat testing tool are is to
measure user satisfaction, task success, and dialog cost. This
is even more problematic in non-task orientation situations,
where it is challenging to define and predict the contents and
topics of user utterances and therefore specify conversation
scenarios for testing user satisfaction. Our testing tool can
assist developers and designers in evaluating single and multi-
agent chatbots in those contexts. As future work, we will
evaluate the effectiveness of the Botttester tool to compare its
results with usability tests performed on real users.
REFERENCES
1. Karolina Kuligowska. 2015. Commercial chatbot:
performance evaluation, usability metrics, and quality
standards of embodied conversational agents.
Professional Center for Business Research (2) (February
2015), 1–16.
2. Chia-Wei Liu, Ryan Lowe, Iulian Serban, Michael
Noseworthy, Laurent Charlin, and Joelle Pineau. 2016.
How NOT To Evaluate Your Dialogue System: An
Empirical Study of Unsupervised Evaluation Metrics for
Dialogue Response Generation. In Proc. of the
Conference on Empirical Methods in Natural Language
Processing.
3. Michael McTear, Zoraida Callejas, and David Griol.
2016. Evaluating the Conversational Interface. Springer
International Publishing, 379–402.
4. Jakob Nielsen. 1993. Usability Engineering. Morgan
Kaufmann Publishers Inc.
5. Bayan Abu Shawar and Eric Atwell. 2007. Different
Measurements Metrics to Evaluate a Chatbot System. In
Proceedings of the Workshop on Bridging the Gap:
Academic and Industrial Research in Dialog
Technologies (NAACL-HLT-Dialog ’07).
6. Marilyn Walker, Diane Litman, Candace Kamm, and
Alicia Abella. 1997. PARADISE: A Framework for
Evaluating Spoken Dialogue Agents. In Proceedings of
EACL ’97.
7. Zhou Yu, Leah Nicolich-Henkin, Alan Black, and
Alexander Rudnicky. 2016. A Wizard-of-Oz Study on A
Non-Task-Oriented Dialog Systems That Reacts to User
Engagement. In Proc. of the 17th Annual Meeting of the
Special Interest Group on Discourse and Dialogue
(SIGDIAL).
... They can be used as a cost-effective way to evaluate and improve dialogue systems throughout the design process. Several of these automated systems are deterministic and follow predefined scripts that have to be written by the designers [8,9,20,47,61], ideally based on the results of earlier user tests [61]. Other implementations of simulated users are trained on previously collected datasets of conversations between people and the conversational agent, which are then generalized to new dialogue situations. ...
... They can be used as a cost-effective way to evaluate and improve dialogue systems throughout the design process. Several of these automated systems are deterministic and follow predefined scripts that have to be written by the designers [8,9,20,47,61], ideally based on the results of earlier user tests [61]. Other implementations of simulated users are trained on previously collected datasets of conversations between people and the conversational agent, which are then generalized to new dialogue situations. ...
Chapter
In this paper, we explore the use of large language models, in this case the ChatGPT API, as simulated users to evaluate designed, rule-based conversations. This type of evaluation can be introduced as a low-cost method to identify common usability issues prior to testing conversational agents with actual users. Preliminary findings show that ChatGPT is good at playing the part of a user, providing realistic testing scenarios for designed conversations even if these involve certain background knowledge or context. GPT-4 shows vast improvements over ChatGPT (3.5). In future work, it is important to evaluate the performance of simulated users in a more structured, generalizable manner, for example by comparing their behavior to that of actual users. In addition, ways to fine-tune the LLM could be explored to improve its performance, and the output of simulated conversations could be analyzed to automatically derive usability metrics such as the number of turns needed to reach the goal. Finally, the use of simulated users with open-ended conversational agents could be explored, where the LLM may also be able to reflect on the user experience of the conversation.
... Further research on mutation testing has shown how test suites generated by BOTIUM struggle to detect numerous mutants affecting conversational properties [30], [41]. To alleviate the cost of manual chatbot testing, in 2017 researchers from IBM Research Lab proposed Bottester [20], a tool that simulates user interactions with a chatbot. The tool takes in input a specification of the conversational flows and additional parameters to simulate actual interactions, such as a sleep time to separate each user input. ...
Preprint
Full-text available
Chatbots are software typically embedded in Web and Mobile applications designed to assist the user in a plethora of activities, from chit-chatting to task completion. They enable diverse forms of interactions, like text and voice commands. As any software, even chatbots are susceptible to bugs, and their pervasiveness in our lives, as well as the underlying technological advancements, call for tailored quality assurance techniques. However, test case generation techniques for conversational chatbots are still limited. In this paper, we present Chatbot Test Generator (CTG), an automated testing technique designed for task-based chatbots. We conducted an experiment comparing CTG with state-of-the-art BOTIUM and CHARM tools with seven chatbots, observing that the test cases generated by CTG outperformed the competitors, in terms of robustness and effectiveness.
... BoTest [6] created divergent inputs (word order errors, incorrect verb tense, synonyms) from an initial utterance set, and Guichard et al. [7] generated divergent examples by lexical substitutions that retain the same meaning based on BoTest. Bottester [8] simulated users who interact with chatbots and collected some interaction metrics, including the answer frequency, the response time, and the precision of the intent recognition. The authors in [9] suggested that it is important to test chatbots in all possible ways before putting them into use. ...
Conference Paper
Full-text available
The functionalities of AI-powered mobile apps or systems heavily depend on the given training dataset. The challenge, in this case, is that a learning system will change its behavior due to a slight change in the dataset. Current alternative approaches for evaluating these apps either focus on individual performance measurements such as accuracy, etc. Inspired by principles of the decision tree test method in software engineering, this paper provides a tutorial discussion on intelligent AI test modeling chat systems including basic concepts, validation process, testing scopes, approaches, and needs. The report is about an intelligent AI test modeling chatbot system that is built and implemented based on an innovative 3D AI test model for AI-powered functions in intelligent mobile apps to support model-based AI function testing, test data generation, auto test scripting and execution, and adequate test coverage analysis.
Article
Full-text available
Citation: Gao, J.; Agarwal, R.; Garsole, P. AI Testing for Intelligent Chatbots-A Case Study. Software 2025, 4, 12. https://doi. Abstract: The decision tree test method works as a flowchart structure for conversational flow. It has predetermined questions and answers that guide the user through specific tasks. Inspired by principles of the decision tree test method in software engineering, this paper discusses intelligent AI test modeling chat systems, including basic concepts, quality validation, test generation and augmentation, testing scopes, approaches, and needs. The paper's novelty lies in an intelligent AI test modeling chatbot system built and implemented based on an innovative 3-dimensional AI test model for AI-powered functions in intelligent mobile apps to support model-based AI function testing, test data generation, and adequate test coverage result analysis. As a result, a case study is provided using a mental health and emotional intelligence chatbot system, Wysa. It helps in tracking and analyzing mood and helps in sentiment analysis.
Conference Paper
Full-text available
O uso crescente de agentes conversacionais (chatbots) levanta questões complexas de design, implementação e, especialmente, testes. Conduzimos uma revisão sistemática da literatura e uma abordagem de snowballing para caracterizar quais ferramentas e métodos apoiam atividades de teste neste domínio de aplicação. Como resultado, evidenciamos diversas ferramentas que poderiam apoiar atividades de testes em chatbots, e percebemos que era necessário haver um consenso na área. A principal contribuição deste trabalho é a caracterização de ferramentas e métodos de teste de última geração que suportam a construção e validação de chatbots.
Article
Full-text available
We investigate evaluation metrics for end-to-end dialogue systems where supervised labels, such as task completion, are not available. Recent works in end-to-end dialogue systems have adopted metrics from machine translation and text summarization to compare a model's generated response to a single target response. We show that these metrics correlate very weakly or not at all with human judgements of the response quality in both technical and non-technical domains. We provide quantitative and qualitative results highlighting specific weaknesses in existing metrics, and provide recommendations for future development of better automatic evaluation metrics for dialogue systems.
Article
Full-text available
The aim of this paper is to explore commercial applications of chatbots, as well as to propose several measurement metrics to evaluate performance, usability and overall quality of an embodied conversational agent. On the basis of these metrics we examine existing Polish-speaking commercial chatbots that a) work in the B2C sector, b) reach the widest possible range of users, and c) are presumably the most advanced commercial deployments of their creators. We analyse various aspects of functioning of each embodied conversational agent: visual look, form of implementation on the website, speech synthesis unit, built-in knowledge base (with general and specialized information), presentation of knowledge and additional functionalities, conversational abilities and context sensitiveness, personality traits, personalization options, emergency responses in unexpected situations, possibility of rating chatbot and the website by the user. Our study reveals the current condition of Polish market of commercial virtual assistants and emphasizes the importance of a multidimensional evaluation of any commercial chatbot deployment.
Conference Paper
Full-text available
A chatbot is a software system, which can interact or "chat" with a human user in natural language such as English. For the annual Loebner Prize contest, rival chatbots have been assessed in terms of ability to fool a judge in a restricted chat session. We are investigating methods to train and adapt a chatbot to a specific user's language use or application, via a user-supplied training corpus. We advocate open-ended trials by real users, such as an example Afrikaans chatbot for Afrikaans-speaking researchers and students in South Africa. This is evaluated in terms of "glass box" dialogue efficiency metrics, and "black box" dialogue quality metrics and user satisfaction feedback. The other examples presented in this paper are the Qur'an and the FAQchat prototypes. Our general conclusion is that evaluation should be adapted to the application and to user needs.
Conference Paper
We investigate evaluation metrics for end-to-end dialogue systems where supervised labels, such as task completion, are not available. Recent works in end-to-end dialogue systems have adopted metrics from machine translation and text summarization to compare a model's generated response to a single target response. We show that these metrics correlate very weakly or not at all with human judgements of the response quality in both technical and non-technical domains. We provide quantitative and qualitative results highlighting specific weaknesses in existing metrics, and provide recommendations for future development of better automatic evaluation metrics for dialogue systems.
Book
This book provides a comprehensive introduction to the conversational interface, which is becoming the main mode of interaction with virtual personal assistants, smart devices, various types of wearables, and social robots. The book consists of four parts: Part I presents the background to conversational interfaces, examining past and present work on spoken language interaction with computers; Part II covers the various technologies that are required to build a conversational interface along with practical chapters and exercises using open source tools; Part III looks at interactions with smart devices, wearables, and robots, and then goes on to discusses the role of emotion and personality in the conversational interface; Part IV examines methods for evaluating conversational interfaces and discusses future directions. · Presents a comprehensive overview of the various technologies that underlie conversational user interfaces; · Combines descriptions of conversational user interface technologies with a guide to various toolkits and software that enable readers to implement and test their own solutions; · Provides a series of worked examples so readers can develop and implement different aspects of the technologies.
Chapter
The evaluation of conversational interfaces is a continuously evolving research area that encompasses a rich variety of methodologies, techniques, and tools. As conversational interfaces become more complex, their evaluation has become multifaceted. Furthermore, evaluation involves paying attention not only to the different components in isolation, but also to interrelations between the components and the operation of the system as a whole. This chapter discusses the main measures that are employed for evaluating conversational interfaces from a variety of perspectives.