Content uploaded by Heloisa Candello
Author content
All content in this area was uploaded by Heloisa Candello on May 11, 2019
Content may be subject to copyright.
Chapter
2
Designing Conversational Interfaces
Heloisa Candello, Claudio Pinhanez1
Abstract
Recent advances in artificial intelligence, natural language processing, and mobile
computing, together with the rising popularity of chat and messaging environments,
have enabled a boom in the deployment of interactive systems based on conversation
and dialogue. This course explores the design and evaluation of conversational
interfaces, here defined as interfaces that rely mostly on dialogue between human and
computational agents, both in speech and text. The course outlines the current state of
conversational interfaces and the basics of the underlying enabling technology, and
explores more deeply design issues and methods which address specific challenges of
interfaces based on dialogue.
1. Introduction
After many broken promises since the 1970s, conversational computer systems are
finally starting to being incorporated to the daily routine of people. Speech-based
interfaces such as Siri and Cortana are now available to everyone who has a
smartphone. Large scale deployment of text-based conversational systems has already
taken place in China, to many e-commerce and government websites, and voice-based
commands are becoming a common accessory to mainstream automobiles. Language-
based interaction with computers seems to have become a reality of the 2010s and
beyond.
But how people prefer to converse with computers? When is text better than speech?
Should be the computer voice be mechanic like the starship’s computer in the early
“Star Trek” television series or seductive like in the 2013 movie “Her”? Should the
machine’s voice be emotionless like HAL in the movie “2001” or hysterical like the
robot in the classic 1960s series “Lost in Space”? Should computers talk in their own
language like C3PO in the “Star Wars” saga or the cute robots in “Wall-E” and leave
humans to learn it? Should conversational systems be allowed to control their voices to
deceive humans like the dangerously cute Ava of “Ex_Machina” (see Figure 1)? Should
1"Heloisa"Candelo"and"Claudio"Pinhanez"are"researchers"from"IBM"Research"Brazil"
those systems use the word “I” or the passive voice? Should they look intelligent or
stupid? Feminine or masculine terms? Extroverted or introverted? Emoticons? Humor?
Talking faces? Multiple agents or one?
As we see, there are plenty of new challenges in the design of this new generation of
computer interfaces based on conversational text and speech. In many ways the scale of
the departure from traditional graphic interfaces is so big that it is likely that the
rulebook of computer interface design will have to be rewritten. Even the skills needed
in the designing team are likely to be deeply changed with an increasing emphasis in
verbally talented professionals instead of the visually oriented people. Computer
interface design is entering a transformation time as exciting as the introduction of the
GUI-based interfaces in the 1990s.
The goal of this course is to present the state-of-art in conversational interface design,
discuss the challenges and opportunities, and explore some special design techniques in
practice. Also, some actual cases of conversational systems designed and deployed by
the authors of the course are examined and used to drive the discussion. The course
focuses on text-based conversational systems since the underlying technology is readier
for deployment than speech-based systems which are still mostly restricted to very
narrow domains. Nevertheless, most of the key design aspects are similar in speech and
text interfaces, as well as the design and evaluation processes of conversational
interfaces. More important, both speech and text computerized interfaces seem to
require the same level of radical departure from the methods professionals have been
using to design computer interfaces since the 1990s. Every single precept of HCI may
be challenged in this new context of conversing with computers and we planned this
course as a basic introduction to the new practical, theoretical, scientific, technological,
and ethical challenges.
The course starts with a review of the earlier efforts in bringing conversation to
computer interfaces, followed by a review of current systems and interface solutions
Figure 1. Ava in the Ex_Machina movie.
deployed in practice or in being investigated in research laboratories. We then suggest a
typology of conversational interfaces and their associated challenges to better organize
the discussion, and provide some basic insights of the underlying natural language
understanding and generation technology. The main part of the course covers a range of
new or adapted design methods and ideas which have been used to address the
challenges of conversational interfaces, including examples of systems which deployed
new and interesting solutions to the challenges. Rather than exhaustive, we concentrate
in small number of topics to allow a reasonable level of depth in the discussion. We
conclude by briefly discussing some cognitive and social challenges created by
conversational interfaces, including privacy and ethical issues.
2 The First Wave of Conversational Interfaces
!
In this section, we describe briefly some of the early conversational systems: STUDENT
(1964); ELIZA (1964-1966); SHRDLU (1968-1970); Dr. Sbaitso (1992); A.L.I.C.E.
(1995); and CLIPPY (1997). Alan Turing, in his classic paper “Computing Machinery
and Intelligence” from the 1950’s, is one of the earliest discussions about machines that
talk and think. In his view if a machine could maintain a conversation (on a teletype)
like a human being it was reasonable to say that machines could think. Since then, the
possibility of having a machine to carry on conversations sounded plausible.
The first successfully known program to use natural language was STUDENT written by
Daniel Bobrow as his Ph.D. research in 1964. It was projected to solve the kind of word
problems of high school algebra books. It was created to understand a great deal of the
input, rather than just pattern-matching techniques. It was able to compute basic
answers to questions, a remarkable achievement given the technology of its time
(Norvig 1992).
In the same time ELIZA, a computer program created by Weizenbaum (1966),
concentrated on keywords, using pattern matching techniques by processing users’
responses to scripts. The most famous script of ELIZA was DOCTOR which simulated a
Rogerian psychotherapist having conversations with a patient. Rogerian psychotherapy
is based on rephrasing the patient’s statement as a question, and that was basically what
DOCTOR did. People who used the system took seriously the system advices, even
though they knew that they were talking to a machine (Weizenbaum 1976). Some
people contend that DOCTOR was the first computer to pass the Turing test.
From the same research university, MIT, Terry Winograd worked on SHRDLU from
1968 to 1970. It was a computer system that understood basic English statements and
executed commands in English dialogues. People could give instructions to manipulate
objects and the system would ask for clarifications in case its heuristics programs could
not understand a sentence through use of context and physical knowledge (Winograd
1971).
A commercial version of early conversational interfaces is Dr. Sbaitso in 1992, an
artificial intelligence speech synthesis for MS DOS-based personal computers. It was
distributed with sound cards by Creative Labs with the aim to demonstrate the digitized
voices the cards were able to produce. Dr. Sbaitso answered questions as a stereotype
psychologist and was based on pattern-matching techniques similar to ELIZA. Users
could type questions and Dr. Sbaitso answered them not only in text but also in a
mechanical voice produced by an internal synthesizer.
A.L.I.C.E.!invented! by!Dr.! Richard!S.! Wallace! in!1995!is! an!awarded!chatbot! that! won!
three! times! the! Loebner! Prize! (formal! contest! of! a! Turing! test).! A.L.I.C.E.! (Artificial!
Intelligence! Markup! Language)! was! also! inspired! by! ELIZA! and! applied! pattern-
matching! algorithms! according! to! human’s! input.! It! is! an! open! source! software!
developed!by!more!than!500!volunteers.!According!to!Wallace!(2001,!2009),!it!aspires!
to!be!an!intelligent!and! self-aware! robot! and!most! of! the! content!in! A.L.I.C.E.! “brain”!
aims!to!keep!users!talking!for!maximizing!dialogue!length.!!See!Figure!2.!
Two years later Clippit (1997), often known as Clippy the Office assistant, was included
in Microsoft Office for Windows and Mac. Clippy was an animated paperclip character
which popped up onto the user’s screen to give advice and offer help about content
related to Microsoft Office. It was created by Kevan J. Atteberry, an illustrator. Users
in general did not like Clippy intrusiveness and Microsoft removed it from the Windows
system around 2008. The concept of Clippy, which had a previous instance called BOB,
seems to have been the result of a misreading of the research results from Clifford Ness
and Byron Reeves at Stanford University (Pratley 2004; Stanford University News
Service 1995). Ness and Reeves research points that people react to computers with the
same emotion in which they react to human beings. Inspired by this research, Microsoft
decided to add human traits to the computer system, by using Clippy as way to
anthropomorphize the computer, what did not work well, as people already react with
emotions to system even when those do not look like humans. The most likely reason
for people hating Clippy was suggested by Ness in an interview (Chapman 2010) saying
that Clippy violated some key social rules, “he would interrupt you […] which was fine
the first time and okay the second but the twentieth time, after you indicated you didn’t
need help […] he would still pop up again.”
In 1987, Apple released the video concept of a Knowledge Navigator (Kownav 2008), a
software agent which looked like a human, advised users of day appointments, helped a
Figure 2: Typical interaction with A.L.I.C.E. (1995)
user reviewing past letters and articles, showed visual simulations of Amazon area in
Brazil, while discussing issues with a colleague via video call and answering personal
calls. Knowledge Navigator would require text-to-speech capabilities and speech
understanding beyond its time (possibly beyond current technology, but nevertheless it
became a key reference for what the next generation of conversational systems needed
to achieve.
In spite of their technological limitations, early conversational interfaces were important
because they showed the first signs that conversational systems have fundamental
differences in the way people respond and want to interact with them. They
demonstrated the need to understand better the context of the user and to be able to
answer open questions with valuable information, as well as how important is for them
to not violate social conventions and norms.
3. Current State of Conversational Interfaces
!
In this section we provide an overview of current conversational interfaces highlighting
commercially available systems and how they are changing the way people interact with
machines. First, we discuss conversational systems which are part of and interface
larger systems like SIRI, Cortana, Google Now, and Alexa. Second, we discuss
conversational systems based on deep reading and understanding technologies such as
IBM’s Watson. And lastly, we explore the emerging conversational systems built on top
of instant messengers, such as Facebook’s M, Telegram, WeChat and KIK.
It is interesting that there is a considerable gap between the first, “heroic”, early
conversational systems of the 1960-1980s and the recent commercially available
systems mentioned in the previous paragraph. Although the research continued
vigorously in the area, several factors contributed to few releases of conversational
systems between the early generation and the present. Among them, the spectacular
failure of Clippy, the popularization of graphical interfaces, the widespread use of and
frustration with IVR (Interactive Voice Response) systems, and, more generally, with
the demise of Artificial Intelligence as a whole.
Nevertheless, some important advances in fundamental technologies were happening.
With the aim of machines process a web of data, improving human-machine
communication, Tim Berners-Lee coined the term Semantic Web, “in which information
is given well-defined meaning, better enabling computers and people to work in
cooperation”. (Berners-Lee et al. 2001). Computer agents, according to the same
authors, using Semantic Web protocols can assist the evolution of human knowledge,
making meaningful analysis of any concept anyone can invent, linked to a universal
Web. Additionally, to data analysis and understanding concepts, semantic web agents
would sense context and making well-suited plans.
The first personal conversation assistant launched in USA was Apple’s SIRI (McTear et
al. 2016) which illustrates characteristics of the semantic web agent introduced by
Berners-Lee et al. (2001). SIRI was launched in 2011 and it was built upon a system
called CALO (Cognitive Assistant that Learns and Organizes), a DARPA project to help
military commanders with both information overload and office chores in the SRI Lab.
(Bosker 2013). SIRI extracts information from Wikipedia, Yelp, Rotten Tomatoes,
Shazam. It is activated by voice, answers questions, sends messages, places calls and
make dinner reservations, but, more importantly, it was widely released as part of the
operational system of iPhones. In a blink of an eye, hundreds of millions of people were
given a speech-based conversational system to interact with, and so they did in a variety
of ways (Luger and Sellen 2016).
Google Now, released in 2012, is a voice-enabled personal assistant developed by
Google. Google NOW uses a natural language user interface (speech and text) to answer
questions, make recommendations, and perform actions requesting information from
web services. Nowadays, a new release is called Google Assistant which is available
not only on mobile devices and Chrome computer browsers but also in Google home,
smart voice-enabled wireless speaker, and Android wear watches. Google Assistant uses
Google’s natural language processing algorithm and proactively gives recommendations
to users based on their search habits, common locations, and repeated calendar
appointments. The current interface is structured on information cards that show the
most relevant information for users (e.g. updates, weather forecast, and
recommendations). Third-party apps available on the user’s phone may also display
recommendations on cards. The system is based on Google’s Knowledge Graph which
aims to understand real-world entities and their relationships to one another. (Google
2016).
Cortana is a personal assistant for Microsoft’s Windows-based phones. Launched in
2014 for Windows Phone 8.1, Cortana was inspired by the character “Cortana” in the
Halo videogame series. Cortana has a Notebook which stores personal information pre-
approved by users. It also learns by user’s phone usage, location and communication. It
is activated by voice and text input. Cortana also tries to understand the basic context of
a conversation. For example, the user is able to say "call it" or "give me directions" after
asking for the best restaurant in an area, and Cortana responds with directions to the
restaurant searched in the previous part of the query. (Warren, 2014). Cortana relies on
Bing’s backend services and is backed up by thousands of servers in the background.
Amazon’s ECHO is a voice-enabled wireless speaker released in 2014. The speaker is a
screen-less cylinder, just over 9 inches tall and 3.25 inches in diameter. Amazon’s
ECHO was an inside project of the “secret” Lab126 also known by the development of
the Kindle e-reader. It was formerly called Amazon Flash. (Brustein 2016). ECHO has
seven microphones which use beam-forming technology and noise cancellation. Users
can ask questions even when a music is in the background, due to the far-field voice
recognition. ECHO is activated by a wake up word – Alexa, which is the name of the
voice-service installed in ECHO. Alexa adapts to user’s speech patterns, vocabulary and
personal preferences. It is connected to third-part services such as Yelp, Uber, Google
Calendar, and Audible. Alexa also connects to smart home devices and follow user’s
commands, for instance to switch on lamps.
In 2011, IBM demonstrated the Watson system in the TV quiz show Jeopardy and beat
the two reigning champions in the most complex natural language question-and-answer
competition in the world. (Kelly and Hamm 2013). Watson was a result of more than
10 years of intense research and development at the IBM Research division, starting
with the development of the UIMA language processing technologies (Ferrucci and Ally
2004). Watson analyzes unstructured data, understands complex questions and present
answers and solutions. Watson has evolved into a technology platform which uses
natural language processing and machine learning to reveal insights from Big Data.
Unlike the first group of conversational assistants described before, Watson capabilities
are available for developers in API services hosted in the WDC (Watson Developer
Cloud). There are WDC APIs for supporting and develop conversational interfaces in
areas such as language (e.g. Conversation, Dialog and Natural Language Classifier
APIs) and speech (Speech to Text, Text to Speech), as well as other modalities such as
visual (Visual Recognition) and data insights (Alchemy Data News, Tredeoff Analytics).
Conversation systems built on Watson technology are being applied mostly in
professional contexts such as helping military members transition to civilian life asking
questions to Watson system; helping doctors to diagnose diseases and best treatment
options, training retail sales employees with voice or text input interface; and help
veterinarians asking questions to Watson as they would ask another colleague.
(Olavsrud 2014).
A third group of conversational systems is increasingly becoming available,
characterized by direct integration to smartphone messenger systems. Facebook’s M is a
virtual assistant launched in 2015 initially for a few San Francisco’s Facebook users. M
started as a hybrid service, powered both by artificial intelligence as well as Facebook
employees, called M trainers. The human employees help answering questions which
the system could not. (Hampel 2015). In April 2016, Facebook launched the Messenger
platform service which allows developers to create chatbots which can interact with
Facebook users. Facebook’s chatbots interact with users by a combination of natural
language questions and point-and-click interfaces. The chatbots are proactive, e.g.
sending news of user’s interest, or reacting to user’s intents. Some examples are
chatbots which purchase products (for instance, for the popular 1-800-Flowers); answer
fashion related questions (using Spring); access broadcast news based on user’s choices
(Wall Street Journal and CNN); and connect doctors to user’s networks in about 141
specialties to answer health questions and provide articles (using HealthTap).
Telegram bots were launched in 2015 by the Telegram instant messaging service based
in Germany. It has been an open-source project, available as API services in the
Telegram Bot Platform. Telegram bots started being created by messaging users to
connect with other users, like dating services do. Users also were connecting to bots to
know about movies or bus timetables, according to Pavel Durov, Telegram’s founder.
(Olson 2016). Considering this scenario, they decided to build a platform optimized for
bot developers. Telegram bot’s interaction are in IRC-style slash commands, and
support texts and emojis. Examples of Telegram chatbots are @triviabot for answering
trivia questions; @forbesbot which pings users with new stories or runs a search; and
@my_ali_bot which lets people to browse products on Chinese e-commerce site
MyAliExpress.
China seems to be the place where chatbots’ adoption and use is most advanced today.
China’s popular WeChat messaging platform can take payments, scan QR codes, and
easily integrate chatbot systems. WeChat seamlessly integrates e-mail, chat, videocalls
and easy sharing of large multimedia files. Seamlessly users can book flights or hotels
using a mixed, multimedia interaction with active bots. WeChat was first released in
2011 by Tecent, a Chinese online-gaming and social-media firm, and today more than
700 million people use it, it is one of the most popular messaging apps in the world
(Economist 2016). WeChat has a mixture of real-live customer service agents and
automated replies. (Olson 2016). See Figure 3.
KiK is an instant messenger with 300 million users created by a group of former
University of Waterloo students. Since 2014 KIK Interactive is working on a
conversational bot platform. Instead of considering bots as assistants, Ted Livingston,
Kik’s founder, believes that“[chat] is the new distribution mechanism for all things
digital — the portal through which we will access businesses, news, entertainment,
games, and personal services, as well as one another. Interaction will be powered by a
chat like the operating system for society” as paraphrased by Rosenberg (2016). KiK
has a Bot Shop where users can communicate with new bots via a chat app. Bots can be
accessed by scanning QRcodes available at the KiK bot shop. There users can find bots
for entertainment - Katy Perry’s Mad Love – to find a song that represents you; for
lifestyle – H&M and Sephora – to have their own personal shopper; and for games –
Movie Line Game, Where AM I? – where users try to guess information by having a
conversation.
This is only a small sample of the frenetic activity in the area from the early 2010.
Several other companies in the bot industry deserved to be mentioned in this section
such as Slack, Twilio, Operator, TaskRabbit, and Magic.
Figure 3: WeChat Jobot, a dating assistant.
4 Types of Conversational Interfaces
We consider conversational interfaces any artificially intelligent system which interacts
with humans through dialogues. Intelligent machines or systems are known by various
names such as virtual personal assistants, intelligent assistants, and cognitive advisers.
In this course, for organizing purposes we define 4 types of conversational interfaces as
described in the next paragraphs. There are two primary types of conversational
interfaces, the ones where the user input is through audio and the ones where it is via
text. Those two types may be used in a stand-alone dialogue or embodied in a human-
like character or robot. That defines four categories:
1. Speech-based dialogues – We adopt the term voice assistants (VA) for interfaces
based on speech where there is no human embodiment of the system. Those are
interfaces where the main input is user voice, and output, audio. Those interfaces
need to have a speech recognition component to recognize and translate user
voice to the system. VAs help users to accomplish tasks on their smart devices,
for example setting an alarm, engaging in general or specific conversation, e.g.
“Here’s the weather today” or “Ok. 7 a.m. setting your alarm.” Typical
examples of those systems are: Apple’s SIRI (Figure 4); Amazon’s Alexa
(Amazon Echo); Google Now; Microsoft’s Cortana and systems based on
IBM’s Watson speech APIs
2. Text-based dialogues – We use the term chatbot to refer to conversational agents
that interact with users using natural language in text format. Those can have the
same functions as intelligent voice assistants mentioned before but activated
only by text. Dialogue-based interfaces might have point-and-click interaction,
images, and video, and not only written text. Nowadays, there are chatbots such
as TechCrunch (a Facebook Messenger bot) specialized on Tech news (see
Figure 5); KAI a platform with questions asked by customers during their bank
experiences; and others which are generalists such as Cleverbot aimed at
answering questions from different topics. Some chatbots are used in retailing,
Figure 4: Speech-based SIRI answers: How’s the weather today?
such as H&M bot available at the KIK botshop, which has its own visual
language and point-and-click elements. (Figure 6).
3. Interactive Virtual Agents – Interactive Virtual Agents (IVA), also known as
embodied conversational agents are those that have similar properties as humans
in face-to-face conversation. (Cassell 1999). IVAs are animated virtual bodies
that look like humans physically and mimic basic human behaviors. Interactive
virtual agents are present in games such as in The SIMS and Façade (Mateas
Figure 5: TechCrunch bot
Figure 6: H&M bot
Figure 7: Façade (Mateas 1997).
1997) and customer services helping people to solve issues such as Anna,
IKEA’s virtual assistant
4. Conversational Robots – Conversation machines which are embodied in actual
physical bodies and/or having human-like behaviors are considered here. The
level of human likeness is diverse. According to Mori (2012) people are averse
to a high degree of human similarity. Strait (2015) did a study which supports
this hypothesis. At the moment, robots have mixed features of human and
machines. Conversational robots are present in various contexts: hospitality
(Connie, the Hilton robot concierge), entertainment (Sphero, BB-8), Supertoys
(Lego Mindstorms, Furby, CogniToys), education and special needs education
(Rapiro, NAO) and elderly care (Paro). See figures 8 e Figure 9.
These 4 categories do not have the aim to cover all the wide range of conversational
interfaces types available today but provide a basic framework to explore the most used
contexts. In this course we focus mostly on the design of voice assistants, and chatbots,
but most of what is discussed is also applicable to interactive virtual agents and
conversational robots. A key difference is that in the latter case it is possible and
desirable to employ body and face expressions. (Cassell 1999).
Figure 8: Robots with low and moderate human likeness,
and a human. (Strait 2015)
Figure 9: The ASK NAO initiative developed to communicate
with special education (Pyros 2016)
5 Technology Context
How are conversational systems built? What are the most common technologies used to
build them? Although a comprehensive review and description of the technology
employed by conversational interfaces is beyond the scope of this course, a basic
understanding of how the interface is constructed in a computer is necessary to guide
the collaboration of designers and developers of chatbots and other dialog-based
systems.
In many ways, the technology behind most of today’s commercial deployments is not
too different from the one used by the early chatbots described in section 2. But the first
key dimension in differentiating capabilities of different technologies and platforms
refers to whether the dialogue is driven by the user, known as user-initiative, by the
computer, known as system-initiative, or by both, or mixed-initiative (Zue and Glass
2000). It is therefore essential that the needs of the application and the interface, in
terms of initiative, match the capabilities of the platform.
But independently of whether the user or the system has the initiative, most
conversational systems today are built using an intent-action approach. Basically, the
system is created by defining a basic set of user and the systems utterances and how
they should match to each other. In user-initiative systems (for example, typical Q&A
systems), groups of questions from the user are mapped into a single answer from the
systems, sometimes with some variations. The term intent is used to describe the goal of
the group of questions, so the basic task of the conversational platform is to identify the
intent of a given question written or spoken by the user, and then output its associated
answer or action.
In system-initiative systems, the designers and developers of the conversational system
have to provide sets of typical user answers for each question the system is going to
make. Based on the intent of the answer of the user, an action is produced often with the
help of basic natural language parsing technology to help extract by the system needed
information such as numbers, choices, etc. Notice that in both cases, as well as in
mixed-initiative systems, the users’ utterances have always to go through a process of
matching with a set of available intents for that context. The intent matching is often the
most important source of problems in the development of conversational systems, due
to the complexity and difficulty of analyzing natural language.
There are many different technologies and platforms that can be used for intent
matching. A common approach is to use template-based systems in which the intent is
determined by the presence of manually defined terms or groups of words in the user
utterance. When the template matches the utterance, the associated intent is identified.
Template-based systems, although often the simplest way to start developing a
conversational system, suffer from two key problems. First, it is hard to capture in
simple templates the many nuances of human language. Second, as the number of
templates increase, typically beyond one hundred or more, it becomes very difficult to
track the source of errors and debug successfully the system.
An alternative approach which is gaining increasing popularity is to use machine
learning-based intent recognizers. In this approach, the conversational system
developers provide a set of examples in natural language for each intent and use
machine learning (ML) techniques to train an automatic classifier that is used in the run-
time. Different types of classifiers can be used, such as Bayesian networks, support
vector machines (SVM), and the currently very popular deep neural networks (DNN).
The main differences, advantages, and properties of those technologies are beyond the
scope of this course. Suffice to say that often the key element of success when using
machine-learning based intent recognizers is the quality and comprehensiveness of the
data set provided to the ML classifier. Designers should not underestimate the
importance of the often time-consuming task of collecting and organizing the training
dataset.
An alternative method of creating conversational systems, still in the research stage,
bypasses the definition of the intent-action sets and uses a large corpus of real dialogues
to learn from the scratch how the conversation system should behave, as in the works of
Sordoni et alli (2015), Serban et alli (2016), and Weston (2016). Notice that some of
those works use corpus with hundreds of millions of conversation samples, or,
conversely, very narrow domains, so this is an approach which can only be applied with
current technology to the special cases which meet those constraints.
In practice, most conversational systems developers employ either the framework
provided by the messaging platform or APIs available from AI service vendors.
Examples of the former case are Facebook’s wit.ai and the WeChat developers platform.
Useful API platforms for building dialogue systems are IBM’s WDC and Microsoft’s
LUIS, as well as an increasing number of systems provided by startups. Similarly, there
are many open source projects providing different types of software for the creation of
conversational systems. Given the volatility of this ecosystem, we refrain to point to
specific projects, but developers should look for advice and suggestions from reliable
online sources.
6 Designing the Experience of Conversational Interfaces
The conversation experience with human and non-human beings is complex. Humans
are not aware of machine capabilities; and machine interfaces not always show what
they know for humans. Hidden interactions do not set knowledge limits and user
expectations. Considering sci-fi movies, humans expect a machine to behave like a
human being. Chris Ness and Youngme Moon (2000) made a set of experiments during
10 years showing people behave with computers as they would behave with real people.
They identified that people rely on social categories when interacting with computers.
Perception experiments of people listening the same information narrated by male and
female voice-tutors representing a machine, resulted in diverse participant interpretation
(e.g. male-voice was interpreted as more competent and friendlier). In another
experiment, they identified people are polite to computers. People are politer when
asked to direct evaluate another person in a face-to-face setting (e.g. How do you like
my new haircut?) because they do not want to hurt people’s feelings, even when they
know it is not person but a computer.
Novielli et al. (2010) evaluated people’s perception of interactive virtual agents (IVA),
examining the relationship between input (written vs speech-based) and social attitude
users towards the agent. Speech-based interaction warms-up the user’s attitude towards
the agent. This effect was more evident for users with background in humanities,
computer scientists tended to be colder and formal to agents and tested the limits of the
artificial intelligence with tricky questions. On the other hand, the evaluation of the
system depends mainly on the agent’s ability to answer appropriately to the user’s
requests and IVA’s persuasion strength. Li et al. (2016) did an experiment which adds
strength to the hypothesis of people expecting robots to behave like humans. They did
an experiment consisted of instructions spoken by a robot followed by 26 trails. Each
trail had three parts: 1. Robot asks participant to touch it. 2. Participant touches robot's
body part. 3. Robot teaches participant medical term for the body part. An affective Q-
Sensor on the participant’s finger captured emotional arousal during the experiment. As
a result, participants were more emotionally aroused when the robot asked to touch
itself in areas that people usually do not touch, like eyes and buttocks. (Science Daily
2016).
This seems to point to a general trait where users’ experience with machines is expected
to happen like it would with real people. How to address this complexity of
communication with all the characteristics people expect from interactions with
humans, greetings, understanding, politeness? And how to address this complexity in a
dialogue where visual clues are not the main channel of communication and are not
usually explored in text visual styles?
Designing affordances for human-computer communication was a concern when
designing traditional systems with buttons and navigation flow. As discussed by
Norman (1988):
...the term affordance refers to the perceived and actual properties of the thing,
primarily those fundamental properties that determine just how the thing could
possibly be used. [...] Affordances provide strong clues to the operations of
things. Plates are for pushing. Knobs are for turning. Slots are for inserting
things into. Balls are for throwing or bouncing. When affordances are taken
advantage of, the user knows what to do just by looking: no picture, label, or
instruction needed. (Norman 1988, p.9)
As presented before, a new category of affordances is emerging, in which user
interactions do not rely only in visual clues. But what makes a conversation between
machine and human successful? How users know what the computer knows already
about them? And how systems could avoid user’s perceptions of suspicious behavior?
User experience of conversation interfaces are based mainly on verbal communication
(text and audio) and not rely only on visual communication. Sometimes, visual clues are
not visible for example in speech systems, or do not vary (text-based conversations).
The question is how to make communication clues easier for people to understand what
those systems know and are able to talk about?
In this section, we present the main challenges to design the user experience of
interacting with conversational systems. How the user experience and the design
process are changed when interfaces are based on speech and/or text and do not rely
exclusively on graphical elements and buttons? There are plenty of questions and new
issues here but we focus here in four issues which we deem key to the design of
conversational interfaces: (1) the persona portrayed in the interface; (2) the text used in
the utterances of the interface; (3) the conversation flow and the modalities involved;
and (4) trust and algorithm transparency.
6.1 Designing the Persona of a Conversational Interface
There is a lot of evidence that users tend to assign human traits such as gender,
personality, and emotions to all kinds of systems, including cars, television sets and
traditional computers systems (Reeves and Nass 1996). However, designers should be
aware that when it comes to conversational systems, it is almost unavoidable that the
user will perceive all kinds of typical human traits in the conversational system.
Conversation is an essential part of the social experience and the human brain has
evolved to, and devote considerable resources to, process speech to figure out what kind
of person is producing it, as amply discussed by Nass and Brave (2005). Moreover, as
exemplified by many experiments compiled by Reeves, Nass, and Brave in both books,
there is a strong interdependence between the human traits perceived by the user and the
overall user experience and in the ability of users to effectively achieve their goals when
using a conversational interface. Conversational is social interaction even among people
and computers, and designers should expect that users will assign and respond to human
traits of conversational systems.
The main consequence of this strong tendency of users to see people behind
conversational systems is that interface designers have no alternative but to consider and
include human traits as part of their list of design tasks. Ignoring this issue can lead to
people perceiving conversational systems with “personas” which do not match the
needs of the task and user, with daring consequences as discussed later.
A comprehensive review and discussion of the influence and design of the persona of
conversational systems, i.e., the collection of human traits expressed or controlled by its
interface, is beyond the scope of this course. Instead we focus on some key issues; a
more comprehensive can be found in (Nass and Brave 2005).
Similarity attraction
Decades of studies in social interaction have shown that people are amazingly good at
determining whether other people are like them as discussed, for instance, in the
classical works by Tajfel (1981). Even in situations of very limited data available people
are able to identify whether their interlocutors are or not similar to them in terms of
gender, race, ethnicity, social class, personality, and interests. At the same time, people
have a remarkable preference towards interacting with and evaluating positively people
who are similar to them (Tajfel 1974). As shown by Nass and his team (Nass and Brave
2005, Reeves and Nass 1996), such behaviors are triggered by conversation even when
people’s interlocutor is a computer, and despite people’s awareness that they are
interacting with a computer instead of a person. Human beings just cannot avoid
ascribing human traits to anything that talks or writes like a human being.
The key consequence of similarity attraction is that people will tend to like and trust
more conversational interfaces which are like them. In spite some important exceptions
discussed later, users prefer to interact with conversational systems which they
recognize as similar to themselves: male users tend to prefer “masculine” conversational
systems, while women are more likely to prefer “feminine” systems; extroverted people
prefer “extroverted” systems; and similar to race, ethnicity, emotions, and education
levels (Nass and Brave 2005).
The similarity attraction principle brings to the forefront of the design of conversational
interfaces three issues we have been only lightly dealt with in traditional computer
interfaces: (1) personification, or how to develop a computer system able to express and
manage different personas; (2) user modeling, or how to make computers understand
people and recognize their key human characteristics; and (3) personalization, or how to
make a computer system choose the right persona according to the user and task.
Although the latter two issues have seen attention from the HCI community in the last
decades, the similarity attraction principle create needs of user modeling and
personalization far above the requirements of traditional computer interfaces.
Conversation involves “reading” the interlocutor and adapting to he or she while
conveying a coherent personality. This is new to interface design and posit significantly
new challenges. However, as mentioned before, the similarity attraction principle is not
a definitive rule for the design of conversational interfaces and some of notable
exceptions are discussed next. Nevertheless, it is an important guideline to be kept in
mind during all phases of the design and evaluation process.
Persona consistency
People which manifest inconsistent personalities and traits are often perceived by their
interlocutors as incapable, unpredictable, or liars. Human brain has evolved to manage
complex social interaction, and creating a coherent mental model of the surrounding
human beings is extremely helpful in social tasks. The same is true for computer
systems and in particular, for conversational systems. Nass and Brave (2005) describe a
series of experiments where users, when interacting with conversational systems with
inconsistent personas tend to not only to dislike more those systems (in comparison to
coherent personas), but also that inconsistency is extremely impactful to the
accomplishment of the task by users Almost all evidence supports that a conversational
system, either speech or text-based, is better off when it portraits a “persona” which is
consistent through time, which matches content and form, and which is coherent with
the task and the user.
For instance, in the case of interactive virtual agents (IVAs) it seems to be very
important to match the complexion of the portrayed agent and the race and ethnicity of
the voice or text produced. In an experiment described by Nass and Najmi (2002),
subjects listened to descriptions of products recorded by Caucasian Americans and first-
generation Koreans, which were cross-matched with faces of Koreans and Caucasian
Australians. When subjects heard descriptions with Korean accents matched to
Caucasian faces, and vice-versa, they react negatively, not only disliking the voices but
also rating less favorably the products described.
Gender and race
If we consider how much gender identity is a key human trait for people, the design of
the persona of a conversational system should start with an unusual question in interface
design: should the computer agent be perceived as a male or a female? In the case of
speech-based systems, that question is obvious since it is very hard to synthesize
genderless voice. But since men and women tend to write in different ways (Newman et
al. 2008), the gender issue is also important for the design of text-based systems.
Interestingly, even computers have no difficulty on identifying gender in text-based
systems (Peersman et al. 2011).
How much defining the gender of a conversational system matters? In an experiment by
where people had to take advice from a speech-based system about a dilemma, Lee et
al. (2000) show that not only male users like more interacting with “male” computers
but also that they trust them more (and vice-versa). So determining the gender of the
user and matching him or her with the corresponding computer “gender” can have
significant impacts on the success of a conversational system.
However, social and cultural dimensions are also at play here. An often cited case is the
BMW recall of a car with a navigation system with a female voice in Germany:
basically, German male drivers did not want to take orders from women. The recall was
not necessary in the United States where female co-navigation seemed to be more
generally accepted. But more important, there is significant evidence that the preferred
gender of an adviser is heavily dependent on the task at hand. People tend to stereotype
typical areas of expertise by gender (Lee 2003), for instance, preferring to take advice
about cars from men and about beauty products from women.
People also are very good in determining their interlocutor’s race and place of origin in
conversations (Giles and Scherer 1979), remarkably through accents and use of slang,
so designers of conversational systems should be careful about how the text and speech
produced sounds like to users. In particular, they should check whether text produced by
different writers is inconsistent, what, as observed before, creates inconsistent personas
which tend to be viewed negatively by users.
Personality
Although it is not easy to define personality, and even more to measure it effectively,
human beings quickly tend to determine whether other people are extroverted or
introverted, judging or intuiting, kind or unkind, etc. (Fisk and Taylor 1991). For
instance, extroverted people tend to use strong words in their language, such as
“definitely” and “absolutely”, while introverts prefer terms such as “maybe’ and
“perhaps (Nass and Lee 2001). Nass et al. (1995) reported that this is also true when
computers are interacting with computer systems, notably conversational ones.
Personality is an area where the similarity attraction is particularly strong (Nass and Lee
2001). Extroverted people like more and react better to extroverted interfaces, introverts
prefer “shier” conversational systems, and so on. The main design implication is a
strong need of personalization of the system, where different text, more extroverted or
introverted, is produced according to the user’s personality. However, determining the
users’ own personality is far from easy, normally requiring use of extensive
questionnaires. Although it has been shown that it is possible to extract from large
social media feeds (Badenes et al. 2014), easily assignment of personality traits as
quickly as humans do is beyond today’s computer abilities. However, if determining the
users’ personality is not possible, the experience of traditional media is that extroverted
voices and text are more easily likable — think Bugs Bunny.
Pinhanez (2014) describes one of the few efforts on a design methodology for
determining the most likable personality of a computer system. Contextualized as a
service design problem, a method is presented and discussed based on identifying
critical conflict situations and exploring which MBTI personalities (Myers 1998) is
more applicable for each situation.
Emotions and humor
Emotions play a fundamental factor both in the control of the mind and body of a person
and in human relations. Human beings not only have special internal mental states,
which we often label as their emotions, but also display them almost uncontrollably.
Because of the social nature of conversational interfaces, the users’ emotions can easily
conflict with characteristics, traits, and behaviors of the conversational system.
Therefore recognizing the emotions displayed by the user (and assuming they are a
reasonable indication of the internal emotions) and react appropriately can be very
important for a conversational system. But the converse is also true, conversational
systems may, and often should, employ expression of emotions, as a way to be more
effective.
Brave and Nass (2003) provides a good review of the basic theories of emotion and
some of their uses in human-computer interaction. A key finding is that emotions play a
role in users’ interface preferences and, more importantly, in their ability to achieve
certain tasks: often, people in good mood are able to solve complex tasks better; at the
same time, in some situations people in good mood are more likely to fall prey to
stereotyping other people.
An area which has received considerable amount of research is the interaction of voice-
based systems with car drivers. In general, drivers in good mood drive better than
grungy ones (Groeger 2000). However, there is also a fundamental play in car assistants
of the similarity attraction principle, as demonstrated by Nass et al. (2005). In an
experiment using driving simulators, the researchers found that if there is a mismatch
between the mood of the driver (induced artificially before the experiment) and the
emotions expressed by the voice of the car assistant, not only likeness of the car
assistant system decreases, but the number of accidents double. This result not only
indicates the importance of emotion recognition and its use in critical conversational
systems, but also signal that designers of conversational systems should be careful when
employing emotion-loaded utterances. Designers interested in technologies which could
enable their conversational system to determine the users’ emotion should refer to the
well-established field of affective computing (Picard 1997).
People have shown no restrains in recognizing and assigning emotions to computers
(Reeves and Nass 1996). It is relatively simple to create text with expressive emotional
traits, as well as synthetic voice. Assuming that the conversational system is able to
produce recognizable emotion expressions, the designer question is when they should be
used. The first guideline is to use the similarity attraction principle if there is data
available about the users’ emotions and mood. As exemplified by the case of car driving
described before, users in good mood are receptive to happy voices, and angry people
should get neutral voice. But there is also evidence that consistency between emotion
expression and the nature of the content is also important. Bad news should not be
uttered by happy voices. In an experiment conducted by Brave and Nass (2003), pairing
happy stories with sad voices decreased the perception of the happiness in the story,
comparing when it was read by a happy voice. The converse was also true. Designers
should be aware that employing emotion in conversational systems can lead to great
gains but also to spectacular failures.
Finally, should humor be employed in conversational interfaces? Surprisingly, there is
good evidence that the use of humor by computer interfaces is mostly positive (Picard
1997). But designers should be aware of many constraints, starting with the type of
humor, which should be innocent, that is, light and not provocative. Self-deprecating,
hostile, sarcastic, intellectual, and even word-play should not be used in general. Also,
jokes get old that moment they are uttered, so controlling the users’ exposure to the
same joke is essential (keep in mind the failures of Microsoft’s Clippy). And, of course,
humor should be matched carefully with the type of content, and with the mood of the
users: no one wants to hear a joke when calling 911.
Humanness and the “I” dilemma
Should a conversational system refer to itself using the term “I” or use the passive
voice? In many cultures and languages, using the first person strongly signals
humanhood. Many of the most commonly deployed chatbots are not shy of claiming
this humanity by the liberal use of the first person terms, but there is some evidence that
in many cases users’ do not feel comfortable with this. In an experiment by Huang et al.
(2001), involving a speech-based system in the context of an auction website, it was
found that recorded voices are more accepted by users to employ the first person than
synthetic voices.
However, given the relative lack of further research in this area, especially considering
text conversational system, designers are advised to test the different options with users
to determine how comfortable they are, and what the effects are, according to the use of
first person references.
6.2 Designing the Text
The main output of a conversational interface is words and sentences, either in text or in
voice form. As addressed in more details later, it is also possible to include images,
pictures, drawings, music, and special effects in conversational interfaces, notably in
chatbots. However, by their own nature, such interfaces are centered in text, begging the
question of how it is going to be designed and produced.
Although GUI-based systems also employ text in menus and messages, the design of
those systems is centered on the visual design and the flow of windows and messages
(Pavlus 2016). As discussed later, conversational systems pose new opportunities and
challenges in terms of the user experience flow and its design. But a key difference that
designers should be aware is the centrality of text and verbal expression in conversation,
in detriment of imagery and visual language. This is a fundamental change in terms of
the skills required for the design of the system: unlike GUI interfaces, visual design is
not the central piece of the design process. Interaction designers with acute verbal and
textual skills are essential to the success of the process.
For example, assume that the design process has led to the use of a female, extroverted,
humorous persona to the most desirable for a given conversational text-based system.
This guideline requires the creation of text which not only has those qualities but it is
also consistent within itself (it has to uphold the consistency principle). Being able to
write text with such qualities, on any given subject, is closer to the skills of journalists
and screenplay writers than to typical designers. Ideally, design teams of conversational
systems should include talent with those writing skills.
Pinhanez (2014) and (2015) discusses some design methodology to expose and explore
the design of web interfaces for conflict resolution which are based in traditional theatre
techniques. The methodology employs conflict battles, improvisational pantomimes
where users and designers enact repeatedly typical conflicts between users and
attendants in call centers (see figure 10). A variety of theatrical techniques are used to
foster the creativity, such as asking the participants to interact without words, to refrain
from using gestures, or to converse with blinds. The result of the pantomimes is then
captured in a comics storyboard a comics story or photonovel which describes and
captures in detail the conflict pantomimes. For each pantomime, the comics storyboard
shows the visual elements of the interaction, the emotional reactions of the user, and
balloons with the thinking and strategy of the customer and employees (see figure 10)
An alternative which is increasingly becoming an option is to use computational
systems which transform written text and speech directly into different persona traits.
For instance, by manipulating high pitch and speed, it is possible to make a male
recorded voice sound like a female (Reeves and Nass 1996). Similarly, word
distribution and use vary considerably between male- and female-produced texts
(Argamon et al. 2003): males use more noun specifiers while females use more
pronouns. Using techniques similar to those used in automatic detection of genre in text
Figure 10. Scene from a conflict pantomime and its corresponding comics storyboard.
Source: (Pinhanez 2016)
(Kessler et al. 1997), it is possible to apply pattern matching algorithms to find and
modify the text to transform its gender. As the popularity of chatbots increases and the
need of automatic persona generation and transformation raises, we can expect that
many of those technical difficulties to be addressed by research and being deployed as
open source libraries and API services. However, the need of well-written text is not
going to go away and we expect a growing number of screen writers to join
conservational interface design teams.
6.3 Designing the Conversation Flow
Traditional user interfaces have menus, simple text, icons, and links which define an
intrinsic planned navigation flow created by designers. Information architecture is “a
creation of systemic, structural, and orderly principles to make something work — the
thoughtful making of either artifact, or idea, or policy that informs because it is clear.”
(Wurman, 1997). However, structured and orderly principles often in practice vary
according context and are not always linear or predictable. Organizing information on
graphical user interfaces, required designers to plan ahead which way users would take
to provide the best user experience. Several schemes and structures are available to
organize information (Rosenfeld and Morville, 1998). Sequence diagrams, hierarchies
and networks were the basic structures for designing all kinds of interactive
experiences, from games to websites. Nowadays, a network structure is a common
structure, although it has been argued by many researchers as not being the best way to
organize information:
Although the goal of this organization is to exploit the Web's power of linkage
and association to the fullest, web like structures can just as easily propagate
confusion. Ironically, associative organizational schemes are often the most
impractical structure for Web sites because they are so hard for the user to
understand and predict. Webs work best for small sites dominated by lists of
links and for sites aimed at highly educated or experienced users looking for
further education or enrichment and not for a basic understanding of a topic.
(Lynch, 2008).
With the increase amount of information nowadays, dense structures are part of our
everyday life. Methods, such as card sorting (Nawaz 2012) help to organize and
evaluate the information architecture of an interface. When designing websites or apps,
Scenarios and Storyboards (Carroll, 2000; Llitjós 2013) based on journey maps
(Stickdorn &Schneider, 2011) and blueprints (Polaine et al., 2013) assist in predicting
the user experience.
However, in a conversational interface, the conversation flow is not linear; it might take
different courses according to circumstances that influence dialogue. One of the most
interesting features of human conversation is this ability to explore sidetracks and easily
go back to the main conversation objective. For instance, while people are making a
decision, such as in planning a trip, people can ask clarifying questions, explore a
similar case, get delighted by photos and comments, consult a friend, and then go back
to make the trip decisions. It is almost impossible to predict in what sequence a user will
interact to a machine and how this machine could provide satisfactory user experience.
Traditional design methods might help to envision a graphical user interface
applications and detect topics that will embody the conversational system but have clear
limitations in supporting the design of the conversation flow of a conversational system.
We explore now two of the main challenges in designing the conversation flow and
some design methods which may help the process.
6.3.1 Conversation initiative
Conversation is a specialized form of interaction, according to Suchman (2007, p.101)
“a distinguishing feature of ordinary conversation is the local, moment-by-moment
management of the distribution of turns, of their size, and what gets done in them, those
things being accomplished in the course of each current speaker’s turn.” Management
of turns e subject change in each course is a situation which occurs in real life
conversations based on circumstances (internal and external) to speaker’s in a dialogue.
Machines are not prepared, nowadays, to fully understand context and change the
course of conversations as humans. Managing dialogues with machines is challenge,
and this challenge increases even more when more than one conversational agent is part
of the same conversation.
Some of those challenges in the dialogue flow were addressed by Zue and Glass (2000).
According to them, as mentioned before, we have system-initiative, user-initiative and
mixed-initiative systems. In the first case, system-initiative systems restrict user options,
asking direct questions, such as: “Please, say just the departure city.” Doing so, those
types of systems are more successful and easy to answer to. On the other hand, user-
initiative systems are the ones where users have freedom to ask what they wish. In this
context, users may feel uncertain of the capabilities of the system and starting asking
questions or requesting information or services which might be quite far from the
system domain leading to user frustration. In their view, there is also a mixed-initiative
approach, that is, a goal-oriented dialogue which users and computers participate
interactively using a conversational paradigm. Challenges of this last classification are
to understand interruptions, human utterances and unclear sentences that were not
always goal-oriented.
The key dilemma is: should we ask users to modify their behaviors and interact with the
system in a structured way? Or should we let users to be more comfortable with systems
that have characteristics of humans? Or both? In our experience this is in fact one of the
earliest and often the most important decision faced by designers of conversational
systems and should be explored in the design process by, for instance, using a Wizard of
Oz approach (Mateas 1999). Designers should also consider examining similar systems
which use human beings, such as in call centers, and analyze recorded conversations to
determine which is the dominant initiative strategy of users and attendants.
Another component of this decision is the technology available for the deployment of
the platform. As discussed in section 5, different conversational platforms support
different initiative models, so the designer may face application contexts where the
initiative strategy is predetermined by the platform. In this situation she should focus on
finding and identifying patterns of dialogue which make sense given a fixed initiative
strategy. For instance, if the only available platform has a Q&A structure (a typical
user-initiative), the designer should consider answers which lead to specific questions
from the users if more guidance to the user is needed. In any case, since the decision of
the initiative strategy is closely tied to the deployment platform capabilities, it is
important to involve the developing team in the process and, often, make that decision
as early as possible in the design process.
6.3.2 Multimodal conversation
New navigation styles and supporting technology are emerging to what is often referred
to as natural dialogue systems (Berg 2014), that is, mixed initiative conversational
systems with multimodal and traditional GUI interface capabilities. For instance, several
of the currently available chatbots employ point-and-click methods to simplify user
interaction but, often more importantly, restrict user openness to avoid user frustrations.
Scientific literature about those new approaches is limited, but there is active discussion
in online communities, where designers and writers have been sharing their expertise
and experiences of dealing with chatbots.
Grover (2016), the project manager of WeChat, has suggested that new chatbots should
learn with China conversational bots, already in the market since early 2010s, to be
more user friendly using mixed-approaches to minimize user steps to achieve goals. For
instance, often in the middle of conversations, when options are limited, the user is
prompted with a menu integrated to the chat (see figure 11). Grover argues that
conversational interfaces should be integrated and use diverse multimedia approaches to
provide the maximum user experience. Not only dialogue, but images, maps and ways
of capture information in an easier way should be considered by designers. For
example, in WeChat it is common the use of QRCodes and voice commands instead of
texting on business apps.
Mariansky (2016), product designer at Meekan, suggests that chatbots should help users
to know more of the system capabilities, adopting a system-initiative approach as
introduced before. Users should learn how to interact with the robot during the dialogue
session with suggestions given by the robot. The robot should suggest tasks he is able to
Figure 11: Metro’s WeChat interface. Source: Dan Grover (2016).
perform in the introduction of the dialogue, after that it should give a description of
what the robot was doing and suggest next questions and/or how to get help,
subsequently the robot should unlock new features, functions and advanced tips based
on user’s interaction history. Mariansky (2016) also value the point-and-click
interactions to perform some tasks as multiple selections, document browsing and map
search. Sometimes, one button can save user to type a long line of text (see figure 12).
Newton (2016) and Libov (2015), have been noticing a shift from chabots powered by
voice – SIRI, Alexa, Cortana - to text-based chatbots. Bot makers are creating chatbots
that look like SMS, possibly due to messaging apps being a very familiar and
widespread used interface. According to a Pew Research report (Smith 2015) 97% of
smartphone owners used text messaging in US at least once over the course of the study,
making text-messaging the most widely-used basic feature app. GUI interfaces are
based on interaction rules that can change from app to app, but conversational systems
keep the same interaction pattern - “The text I type is displayed on the right, the text
someone else typed is on the left, and there’s an input field on bottom for me to compose
a message.” (Libov 2015). He also claims text conversation is more comfortable than
voice-conversation, although voice is more convenient:
“It can be indexed and searched efficiently, even by hand. It can be translated.
It can be produced and consumed at variable speeds. It is asynchronous. It can
be compared, diffed, clustered, corrected, summarized and filtered
algorithmically. It permits multiparty editing. It permits branching
conversations, lurking, annotation, quoting, reviewing, summarizing, structured
responses, exegesis, even fan fic. The breadth, scale and depth of ways people
use text is unmatched by anything.”(Libov 2015)
It is true that use of text has advantages over voice in many scenarios, and some might
argue that convenience might be more valuable than comfort. Also, text-based systems
can be less synchronous than voice-based systems. It is easy in chat to catch up with an
Figure 12: Options and buttons for Meekan, a scheduling robot assistant.
Source: http://meekan.com/slack/
ongoing or past conversation by simply scrolling back, what is often cumbersome to do
in voice systems. As in the case of the decision about the best initiative strategy of a
conversational system, designers are likely to face the decisions of text vs. voice and
text-only or multimodal early in the process, and, as in the former discussion, they
should work closely with the development team. To support this decision process,
designers should consider doing field or ethnographical studies to understand better the
contexts where the users will access the application and the possible interferences a text
or voice system may cause.
6.4 Trust and algorithm transparency
Designing computational systems which people can trust has always been a challenge
for interface designers. Given that many of the new, interesting applications of
conversational systems involve advice (financial, medical) or decision making (travel,
personal info), as seen in the examples presented before, creating conversational
systems which users feel confident to trust is an important new concern for designers of
those systems.
Trust and transparency are often tied together in everyday human life, and the same
seems to be true in users’ relations with computer. Weiser (1994) coined the term
Invisible Interface in a keynote speech stating that “the highest ideal is to make a
computer so imbedded, so fitting, so natural, that we use it without even thinking about
it”. Since then, ubiquitous and hidden interfaces are terms which have been used by
scientists and designers to describe algorithmically-driven experiences with interfaces.
Systems give recommendations based on parameters gathered from users (e.g. Netflix
and Amazon) and learn parameters to help doctors diagnose diseases (e.g IBM Watson
Health) to name a few. End-users are not aware of how these systems work and most of
the functions are not visible to them. The aim of the majority of those systems is to
provide the ultimate user experience with tailored content and interaction in an implicit
way. Qualms of how to design this ultimate user-experience is therefore a key design
concern. Gajendar (2016) argues “that nuances of user-product relationship have been
subtly distorted”, considering that algorithms influence the way people behave and
make choices and most of the time people are not aware of their influence. Human-
Computer Interaction professionals should understand what the parameters are and how
inferences are made into the algorithms, creating empathy with smart algorithms,
knowing how the algorithm thinks, and how a human would interact with it. HCI
professionals can contribute and give humanistic and social insights to computational
algorithms qualities. (Gajendar 2016).
Empathizing with algorithms might be a way for designers and HCI professionals to
create better user experiences with artificially intelligent systems. Care should be taken
when making those algorithms transparent for end-users. Research shows that people
are averse to algorithms in favor of human beings after seeing algorithms perform and
commit mistakes (Dietvorst 2015), even when the algorithms perform better than
humans doing the same task. The authors perform five studies with students and asked
them to perform a lab-based experiment. Participants observed human or algorithms
making forecasts in admitting new applicants into an academic institution. They then
decided whether to tie their incentives to the future predictions of the algorithm or the
human. Participants were less confident in the algorithm forecast when seeing the
algorithm making some mistakes, even when in the overall task they were better than
human beings. Similarly, Eslami et al (2015) examined users’ perceptions of the
Facebook News feed curation algorithm. Users felt betrayed when discovering an
algorithm that they were unaware of, and they made incorrect inferences about their
relationships that did not appear in the News Feed.
Moreover, experiments show that user profiles might influence people’s interpretation
of algorithms. Arshad (2015) compared the confidence of expert and non-expert users in
varying levels of uncertainty presented in a prediction case study of water-pipe failure.
Participants did three groups of tasks and received a viewgraph of overlapping and non-
overlapping uncertainty presentations as supplementary material for decision-making.
Showing this supplementary material improved user confidence and uncertainty with
unknown probabilities decreasing user confidence, although uncertainty with known
probabilities can increase expert user confidence but the same is not true for non-
experts.
In emergency situations, Robinette et al (2016) found that people over-trust robots.
Authors asked participants to follow a robot in an emergency evacuation scenario after
following the robot to a meeting room. Participants could choose to follow the robot to
exit from the high risk scenario or a lighted emergency exit sign. Twenty-six students
participated in the study, all of them followed the robot in the emergency scenario,
despite observing the robot perform poorly in a task mere minutes before. In this study,
participants did not have the choice between human guidance and robot guidance that
could make a difference in the results.
Conventions, norms and patterns from everyday real conversations are applied when
designing those systems to result in adoption and match user’s expectations. Wendy
Ju (2008) describes implicit interactions in a framework of interactions between humans
and machines. The framework is based on the theory of implicit interactions which
posits that people rely on conventions of interaction to communicate queries, offers,
responses, and feedback to one another. Conventions and patterns drive our
expectations of interactive device behaviors. This framework help designers create
interactions that are more socially appropriate. According to the author, we have
interfaces which are based on explicit interaction and implicit ones. The explicit are the
interactions, are interfaces where people rely on explicit input and output, implicit
interactions are the ones that occur without user awareness of the computer behavior.
Design methodology and theory for creating trustable systems is still in infancy, at best.
In traditional GUI interfaces, it is very hard to capture whether the user is confident or
not with a system while using it in a real situation. A key advantage of conversational
systems is that the verbal expression of the user can be captured naturally as part of the
process, and analyzed offline (or even in real-time by the system) to determine whether
the user is trusting the system, using text analytics. Moreover, questions to investigate
the level of trust can be easily embedded in the conversation itself without disrupting it.
This new data is likely to allow researchers to investigate trust and algorithm
transparency to a level not possible before, so we should expect a lot of progress in this
area in the next few years.
7 Final Remarks
Conversational interfaces are becoming increasingly popular, leaving the realm of
futuristic lab demos and sci-fi movies. Anyone with a computer or a smart device can
now interact with chatbots, for the purpose of getting personal services, asking for
advice, playing games, and even to find a partner. Most of the conversational interfaces
are available on mobile apps and some can be accessed by interacting messaging
platforms, both in text and speech format.
A lot can be learned from past experiences with conversational systems. Those systems
were created in a time where technology was not so advanced and available for ordinary
citizens. The first wave of conversational systems created stereotypes of how a machine
can maintain a conversation but determined useful roadmaps for the advancement of the
fundamental technology and building blocks. From 2010 technology has evolved to a
point where sophisticated machine learning algorithms are improving conversational
systems’ abilities, enabling users to use conversational systems as partners to make
decisions.
However, care should be taken as conversational systems are not often transparent about
their data and privacy policies, and their users are not aware of the amount of data those
systems are capturing from them. Privacy and ethical issues should be considered when
designing those invisible interfaces. Studies have shown that people get very
disappointed after discovering systems capture data without their permission (Gajendar
2016; Dietvorst 2015). Non-expert users have even a more negative reaction
discovering how those smart systems work after using it than expert users (Arshad
2015; Eslami 2015).
Design challenges of how to design those systems are evident, considering that people
react to those system as they would react to other people (Reeves and Nass 1996), as
discussed extensively before. Understanding human behavior and reaction is key to
designing better conversational interfaces. As we saw, the new design need of
determining the persona of conversational interfaces requires a range of
interdisciplinary professionals: journalists, screen players, programmers, scientists and
designers. Text and voice are paramount elements on conversational interfaces, not only
visual elements which used to be the main affordances.
We are witnessing the starting of a new human-computer interaction era. Text-based
interfaces, point-and-click interfaces and voice-based systems are interaction modes that
are going to be part of the same conversational system. The diversity of user-input and
system output modes can make user experience more attractive and immersive. As users
can ask questions to systems, possibilities open to know more about the user, her
knowledge, her characteristics and personal traits, and even her reactions to system
outputs.
In point-and-click interfaces, HCI professionals had to capture metrics and design
studies to understand how people experienced interactions. Now those reactions can be
captured directly from the dialogues between systems and users. On the other hand,
capturing navigation flow on conversational interfaces is a challenge for designers,
mainly when interfaces are not based on system-initiative (Zue and Glass 2000). The
space of the conversation, moment-by-moment as explained by Suchman (2009) is
where interaction happens and decisions of dialogue management are crucial. These
conversation system characteristics are even more important when more than one
chatbot participates in the dialogue with users.
We are still in the early stages of understanding those nuances of creating and deploying
conversational interfaces and of creating new design methods. We hope this short-
course introduces readers to the exciting challenges of designing conversational
interfaces.
References
Argamon, S., Koppel, M., Fine, J. and Simoni, A. Gender, genre, and writing style in
formal written texts. Texts, 23, 3 (2003), 321-346.
Arshad, S.Z. et al. 2015. Investigating User Confidence for Uncertainty Presentation in
Predictive Decision Making. Proceedings of the Annual Meeting of the Australian
Special Interest Group for Computer Human Interaction (New York, NY, USA,
2015), 352–360.
Badenes, H., Bengualid, M. N., Chen, J., Gou, L., Haber, E., Mahmud, J., Nichols, J.
W., Pal, A., Schoudt, J., Smith, B. A., Xuan, Y., Yang, H. and Zhou, M. X. System
U: automatically deriving personality traits from social media for people
recommendation. In Proc. Proceedings of the 8th ACM Conference on
Recommender systems, ACM (2014), 373-374.
Berg, M. M. Modelling of Natural Dialogues in the Context of Speech-based
Information and Control Systems. AKA, 2014.
Berners-Lee, T. et al. 2001. The semantic web. Scientific american. 284, 5 (2001), 28–
37.
Bosker, B. 2013. SIRI RISING: The Inside Story Of Siri’s Origins -- And Why She
Could Overshadow The iPhone. The Huffington Post. Retrieved August 16, 2016
from http://www.huffingtonpost.com/2013/01/22/siri-do-engine-apple-
iphone_n_2499165.html
Brave, S. and Nass, C. Emotion in human–computer interaction. Human-Computer
Interaction (2003), 53.
Brustein, J. 2016. The Real Story of How Amazon Built the Echo. Bloomberg.com.
Retrieved August 19, 2016 from http://www.bloomberg.com/features/2016-amazon-
echo/
Carroll, J.M. 2000. Five reasons for scenario-based design. Interacting with computers.
13, 1 (2000), 43–60.
Cassell, J., T. Bickmore, M. Billinghurst, L. Campbell, K. Chang, H. Vilhjálmsson, and
H. Yan. “Embodiment in Conversational Interfaces: Rea.” In Proceedings of the
SIGCHI Conference on Human Factors in Computing Systems, 520–27. CHI ’99.
New York, NY, USA: ACM, 1999. doi:10.1145/302979.303150.
Chapman, P. 2010. Learning from Clippy: an interview with Clifford Nass. rgbFilter.
http://www.rgbfilter.com/?p=10350.
Dietvorst, B.J. et al. 2015. Algorithm aversion: People erroneously avoid algorithms
after seeing them err. Journal of Experimental Psychology: General. 144, 1 (2015),
114–126.
Eslami, M. et al. 2015. “I always assumed that I wasn’t really that close to [her]”:
Reasoning about Invisible Algorithms in News Feeds. (2015), 153–162.
Fiske, S. T. and Taylor, S. E. Social Cognition: From Brains to Culture. SAGE, 2013.
Ferrucci, D. and Lally, A. UIMA: an architectural approach to unstructured information
processing in the corporate research environment. Natural Language Engineering,
10, 3-4 (2004), 327-348.
Gajendar, U. 2016.Empathizing with the Smart and Invisible: Algorithms!” Interactions
23, no. 4 (June 2016): 24–25. doi:10.1145/2935195.
Giles, H. and Scherer, K. R. Social markers in speech. Cambridge,[Eng.]; New York:
Cambridge University Press. 1979.
Google. The Knowledge Graph – Inside Search – Google. Retrieved August 19, 2016
from
https://www.google.com/intl/es419/insidesearch/features/search/knowledge.html
Groeger, J. A. Understanding driving: Applying cognitive psychology to a complex
everyday task. Psychology Press, 2000.
Grover, D. 2016. Bots won’t replace apps. Better apps will replace apps.: 2016.
http://dangrover.com/blog/2016/04/20/bots-wont-replace-apps.html. Accessed:
2016-08-14.
Hempel. J. 2015. Facebook Launches M, Its Bold Answer to Siri and Cortana. WIRED.
Retrieved August 17, 2016 from http://www.wired.com/2015/08/facebook-launches-
m-new-kind-virtual-assistant/
Huang, A., Lee, F., Nass, C., Paik, Y. and Swartz, L. Can voice user interfaces say “I”?
An experiment with recorded speech and TTS. Unpublished manuscript (2001).
Ju, W. G. 2008. The design of implicit interactions. Stanford University.
Kelly, J.E. and Hamm, S. 2013. Smart Machines: IBM's Watson and the Era of
Cognitive Computing. Columbia Business School Publishing.
Kessler, B., Numberg, G., Sch, H., #252 and tze. Automatic detection of text genre. In
Proc. Proceedings of the eighth conference on European chapter of the Association
for Computational Linguistics, Association for Computational Linguistics (1997),
32-38.
Knownav. 2008. Knowledge Navigator. Retrieved August 13, 2016 from
https://www.youtube.com/watch?v=QRH8eimU_20
Lee, E.-J. 2003. Effects of “gender” of the computer on informational social influence:
the moderating role of task type. International Journal of Human-Computer Studies.
58, 4 (2003), 347–362.
Lee, E.J. et al. 2000. Can computer-generated speech have gender?: an experimental test
of gender stereotype. CHI’00 extended abstracts on Human factors in computing
systems (2000), 289–290.
Libov, J.Futures of Text.Whoops. Accessed August 15, 2016.
http://whoo.ps/2015/02/23/futures-of-text.
Llitjós, Ariadna Font. 2013. IBM Design – A new Era at IBM. Lean UX leading
theway.https://submissions.agilealliance.org/system/attachments/attachments/000/00
0/306/original/IBM_Design_Thinking_Agile_2013.pdf
Lynch, P.J. 2008. Web style guide. Yale University Press.
Luger, E. and Sellen, A. "Like Having a Really Bad PA": The Gulf between User
Expectation and Experience of Conversational Agents. In Proc. Proceedings of the
2016 CHI Conference on Human Factors in Computing Systems, ACM (2016),
5286-5297.
Mariansky, M. All Talk and No Buttons: The Conversational UI · An A List Apart
Article. Accessed August 3, 2016. http://alistapart.com/article/all-talk-and-no-
buttons-the-conversational-ui.
Mateas, M. 1999. An Oz-centric review of interactive drama and believable agents.
Artificial intelligence today. Springer. 297–328.
McTear, M. et al. 2016. The Conversational Interface. Springer International
Publishing.
Mori, M. et al. 2012. The uncanny valley [from the field]. IEEE Robotics & Automation
Magazine. 19, 2 (2012), 98–100.
Myers, I. B. MBTI Manual: a Guide to the Development and Use of the Myers-Briggs
Type Indicator. Consulting Psychologists Press, Palo Alto, Calif., 1998.
Nass, C., Moon, Y., Fogg, B., Reeves, B. and Dryer, C. Can computer personalities be
human personalities? In Proc. Conference companion on Human factors in
computing systems, ACM (1995), 228-229.
Nass, C., and Moon, Y. 2000. Machines and Mindlessness: Social Responses to
Computers. Journal of Social Issues 56, no. 1 (January 1, 2000): 81–103.
doi:10.1111/0022-4537.00153.
Nass, C. and Lee, K. M. Does computer-synthesized speech manifest personality?
Experimental tests of recognition, similarity-attraction, and consistency-attraction.
Journal of Experimental Psychology: Applied, 7, 3 (2001), 171.
Nass, C. and Najmi, S. Race vs. culture in computer-based agents and users:
Implications for internationalizing websites. Unpublished manuscript, Stanford
University, California. 2002.
Nass, C., Jonsson, I.-M., Harris, H., Reaves, B., Endo, J., Brave, S. and Takayama, L.
Improving automotive safety by pairing driver emotion and car voice emotion. In
Proc. CHI '05 Extended Abstracts on Human Factors in Computing Systems, ACM
(2005), 1973-1976.
Nass, C. I. and Brave, S. 2006. Wired for Speech: How Voice Activates the
Human-Computer Relationship. The Electronic Library. 24, 2 (Mar. 2006), 0–0.
Nawaz, A. 2012. A comparison of card-sorting analysis methods. 10th Asia Pacific
Conference on Computer Human Interaction (Apchi 2012). Matsue-city, Shimane,
Japan (2012), 28–31.
Newman, M.L. et al. 2008. Gender differences in language use: An analysis of 14,000
text samples. Discourse Processes. 45, 3 (2008), 211–236.
Newton, C. 2016. Bots Are Here, They’re Learning — and in 2016, They Might Eat the
Web.” The Verge, January 6, 2016.
http://www.theverge.com/2016/1/6/10718282/internet-bots-messaging-slack-
facebook-m.
Norman, D.A. 1988. The Design af Everyday Things. New York. Doubleday.
Norvig, P. 1992. Paradigms of artificial intelligence programming:case studies in
Common Lisp. San Francisco, California: Morgan Kaufmann. pp. 109–149.
ISBN 1-55860-191-0.
Novielli, N. et al. 2010. User attitude towards an embodied conversational agent:
Effects of the interaction mode. Journal of Pragmatics. 42, 9 (Sep. 2010), 2385–
2397.
Olavsrud. T. 2014. 10 IBM Watson-Powered Apps That Are Changing Our World.
CIO. Retrieved August 17, 2016 from http://www.cio.com/article/2843710/big-
data/10-ibm-watson-powered-apps-that-are-changing-our-world.html
Olson. P. 2016. Get Ready For The Chat Bot Revolution: They’re Simple, Cheap And
About To Be Everywhere. Forbes. Retrieved August 19, 2016 from
http://www.forbes.com/sites/parmyolson/2016/02/23/chat-bots-facebook-telegram-
wechat/
Pavlus, J. 2016. The Next Phase Of UX: Designing Chatbot Personalities. Retrieved
August 20, 2016 from http://www.fastcodesign.com/3054934/the-next-phase-of-ux-
designing-chatbot-personalities.
Peersman, C. et al. 2011. Predicting age and gender in online social networks.
Proceedings of the 3rd international workshop on Search and mining user-generated
contents (2011), 37–44.
Picard, R. W. Affective Computing. The MIT Press, Cambridge, Massachusetts, 1997.
Pinhanez, C. Borg-Human Interaction Design. In Proc. 4th Service Design and Service
Innovation Conference, LiU Electronic Press (2014), 100-109.
Pinhanez, C. Designing Conflict Resolution in Self-Service Interfaces. Unpublished
manuscript. Sao Paulo, Brazil. September 22, 2015.
Polaine, A. et al. 2013. Service design. From Implementation to Practice. New York:
Reosenfeld Media. (2013).
Pratley, C. Clippy and User Experiences: 2016.
https://blogs.msdn.microsoft.com/chris_pratley/2004/05/05/clippy-and-user-
experiences/.
Pyros, A. 2014. Robotics in Special Education: the Future is NAO. Certified Autism
Specialist. Retrieved August 11, 2016 from
http://www.certifiedautismspecialist.com/robotics-special-education-future-nao/
Reeves, B. and Nass, C. 1996. How people treat computers, television, and new media
like real people and places. CSLI Publications and Cambridge university press
Cambridge, UK.
Robinette, P. et al. 2016. Overtrust of Robots in Emergency Evacuation Scenarios. The
Eleventh ACM/IEEE International Conference on Human Robot Interaction
(Piscataway, NJ, USA, 2016), 101–108.
Rosenberg. S. 2016. How Kik Predicted The Rise of Chat Bots — Backchannel.
Medium. Retrieved August 18, 2016 from https://backchannel.com/how-kik-
predicted-the-rise-of-chat-bots-2eaf9027b86e
Rosenfeld, L. and Morville, P. 2002. Information architecture for the world wide web.
O’Reilly Media, Inc.
ScienceDaily. Touching a Robot Can Elicit Physiological Arousal in Humans:
Participants Were More Hesitant to Touch a Robot’s Intimate Parts When
Instructed. ScienceDaily. Accessed August 8, 2016.
https://www.sciencedaily.com/releases/2016/04/160405093057.htm.
Serban, I. V. S., Sordoni, A., Bengio, Y., Courville, A. and Pineau, J. Building end-to-
end dialogue systems using generative hierarchical neural network models. In Proc.
Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI-16)
(2016).
Smith, A. 2015. U.S. Smartphone Use in 2015. Pew Research Center: Internet, Science
& Tech.
Sordoni, A., Galley, M., Auli, M., Brockett, C., Ji, Y., Mitchell, M., Nie, J.-Y., Gao, J.
and Dolan, B. A Neural Network Approach to Context-Sensitive Generation of
Conversational Responses. In Proc. Proceedings of Human Language Technologies:
The 2015 Annual Conference of the North American Chapter of the ACL (2015).
Stanford University News service. 1995. Social science research influences computer
product design: 2016.
http://web.stanford.edu/dept/news/pr/95/950106Arc5423.html.
Stickdorn, M. et al. 2011. This is service design thinking: Basics, tools, cases. Wiley
Hoboken, NJ.
Strait, M. et al. 2015. Too much humanness for human-robot interaction: Exposure to
highly humanlike robots elicits aversive responding in observers. Proceedings of the
33rd Annual ACM Conference on Human Factors in Computing Systems (2015),
3593–3602.
Suchman, L. 2007. Human-machine reconfigurations: Plans and situated actions.
Cambridge University Press.
Tajfel, H. 1974. Social identity and intergroup behaviour. Social Science
Information/sur les sciences sociales. (1974).
Tajfel, H. 1981. Human groups and social categories: Studies in social psychology.
CUP Archive.
The Economist. 2016. WeChat’s world. The Economist. Retrieved August 18, 2016
from http://www.economist.com/news/business/21703428-chinas-wechat-shows-
way-social-medias-future-wechats-world
The Sims - The Sims 4 Available Now - Official Site. Retrieved August 18, 2016 from
https://www.thesims.com/
Turing, A.M. 1950. Computing machinery and intelligence. Mind. 59, 236 (1950), 433–
460.
Wallace, R.S. 2009. The Anatomy of A.L.I.C.E. Parsing the Turing Test. R. Epstein et
al., eds. Springer Netherlands. 181–210.
http://link.springer.com/chapter/10.1007/978-1-4020-6710-5_13.
Warren, T. 2014. The story of Cortana, Microsoft’s Siri killer. The Verge. Retrieved
August 16, 2016 from http://www.theverge.com/2014/4/2/5570866/cortana-
windows-phone-8-1-digital-assistant
Weiser, M. 1994. Creating the Invisible Interface: (Invited Talk). In Proceedings of the
7th Annual ACM Symposium on User Interface Software and Technology, 1 – .
UIST ’94. New York, NY, USA: ACM, 1994. doi:10.1145/192426.192428.
Weizenbaum, J. 1966. ELIZA—a Computer Program for the Study of Natural Language
Communication Between Man and Machine. Commun. ACM. 9, 1 (Jan. 1966), 36–
45.
Weizenbaum, J. 1976. Computer Power and Human Reason: From Judgment to
Calculation. W. H. Freeman & Co.
Weston, J. Dialog-based Language Learning. arXiv preprint arXiv:1604.06045 (2016).
Winograd, T. 1971. Procedures as a Representation for Data in a Computer Program for
Understanding Natural Language. (Jan. 1971).
http://dspace.mit.edu/handle/1721.1/7095.
Wright, W. and I. Maxis Software. (1989). “SimCity.” [Software], Windows PC/Mac.
Maxis Software. Information available from:
http://www.mobygames.com/game/simcity [Accessed 10.08.16]
Wurman, R. S.1997. Information Architects (1st ed.). Graphis Inc. ISBN 1-888-00138-
0.
Zue, V. W., & Glass, J. R. 2000. Conversational interfaces: Advances and challenges.
Proceedings of the IEEE, 88(8), 1166-1180.