Content uploaded by Andreea Ioana Niculescu
Author content
All content in this area was uploaded by Andreea Ioana Niculescu on Apr 26, 2022
Content may be subject to copyright.
DigiMo – towards developing an emotional intelligent
chatbot in Singapore
Andreea I. Niculescu
Institute for Infocomm Research
Singapore
andreea-n@i2r.a-stra.edu.sg
Ivan Kukanov
Institute for Infocomm Research
Singapore
ivan_kukanov@i2r.a-star.edu.sg
Bimlesh Wadhwa
National University of Singapore
Singapore
bimlesh@nus.edu.sg
ABSTRACT
The paper is a work in progress report on the development of
DigiMo, a chatbot with emotional intelligence. The chatbot
development is based on a data collection and annotations of real
dialogues between local Singaporeans expressing genuine
emotions. The models were trained with Cakechat, an open source
sequence-to-sequence deep neural network. Perplexity
measurements from automatic testing as well as feedback from 6
expert evaluators confirmed that the chatbot answers have high
accuracy. Future research directions and development are briefly
discussed.
Author Keywords
Authors’ choice; Natural language interaction; deep
learning; data annotation; emotion; chatbot; expert
evaluation.
CSS Concepts
• Human-centered computing~Human computer
interaction (HCI); Natural language interfaces;
INTRODUCTION
Chatbots are conversational software agents that use natural
language to interact with human users. Chatbots are being
developed since the ’60s – ELIZA, the chatbot psychologist
being one famous example [14] - but only recently, there is
a growing interest in many industry sectors for this
particular technology. It is estimated that around 80% of all
businesses globally would like to use chatbots by 2020 [2].
The interest is motivated by increased customer demand for
services accessible over messaging platforms. Studies show
that customers prefer to contact service providers over
instant messages rather than over phone or email [4].
Additionally, bots have 24/7 availability and are efficient in
handling repetitive tasks, thereby cutting down significant
costs for companies. As a result, chatbots are an appealing
asset for most organizations.
Another desirable feature the industry wants chatbots to
have is emotional intelligence. Emotional intelligence goes
beyond informational or transaction tasks and enables
chatbots to be successfully deployed as customer service
assistants. Such chatbots would “understand” users’
feelings and respond accordingly. An example chatbot that
uses emotion detection and reacts emphatically is Replika
[11]. Replika learns a pattern of behavior from the user over
time aiming to become a virtual second-self. Replika offers
only emotional support and does not have any other task-
oriented functionality.
Following the global trend, many chatbots were developed
and are currently used in Singapore, for example JIM, the
DBS virtual bank recruiter [5]; AskJamie, e-governance
virtual assistant[9]; Kris, Singapore Airline’s chatbot for
flights and travel queries [13]; the Bus Uncle, the joke-
loving bus schedule assistant commuters [1]; SARA, the
virtual assistant for tourists [10]. While being informative,
helpful and even witty, these applications lack emotion
embedding and empathic reaction when interacting with
users.
When chatbots mimic humans, they can effectively provide
emotional support. Emotionally intelligent chatbots could
help for example in addressing sensitive issues e.g. enabling
humans to anonymously report an improper behaviour
without conversing with a human. Among many challenges
to design an emotionally intelligent chatbot, perhaps the
hardest is to model the emotion across the conversation for
effective courteous response generation.
In this study, we present our work-in-progress in
developing an emotional intelligent chatbot for Singapore
users that would eventually combine emotional
“intelligence” with task-oriented capabilities. The chatbot
development is based on the data of real dialogues between
local Singaporeans expressing genuine emotions. It is part
of a larger project on integrative approaches to emotion
recognition from multi-modal cues called “Digital
Emotions” [6]. In our next research stage, we plan to
answer specific questions such as: (i) how the data collected
and annotated can be best incorporated into our training
model to maximize our chatbot’s emotional ‘skills’? and (ii)
how to ascertain the value of the effect of the local data
used for training, on the overall satisfaction of Singaporean
users?
DATA COLLECTION
While standard information requests and transactional tasks
may show very similar patterns across businesses around
the world, emotion expression is intrinsically related to a
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full
citation on the first page. Copyrights for components of this work owned by others
than ACM must be honored. Abstracting with credit is permitted. To copy otherwise,
or republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Request permissions from Permissions@acm.org.
CHI 2020, April 25–30, 2020, Honolulu, HI, USA.
© 2020 Copyright is held by the owner/author(s). Publication rights licensed to
ACM. ACM ISBN 978-1-4503-6708-0/20/04...$15.00.
DOI: https://doi.org/10.1145/3313831.XXXXXXX
*update the above block and DOI per your rightsreview confirmation (provided after acceptance)
person’s culture and personality. Therefore, it is
fundamental to understand, how local people express
emotions when they chat with each other.
Research in this area usually deploys open data collections
based on movie subtitles [12] or Twitter corpora [8]. The
advantage is that the vast amount of data can generate solid
models. On the other hand, content of such data might not
be optimal for the development of an emotional intelligent
chatbot: firstly, movie subtitles are transcribed spoken
interaction with enacted, i.e. not real emotions. Secondly,
transcribed spoken interactions are significantly different
from chat interactions. Thirdly, twitter data are public
messages to which people react with comments: even
though in real time, the interaction is rather asynchronous.
Therefore, in our study we opted for a different approach:
we collected conversations, in English, exchanged between
local Singaporeans, and we used it to adjust pre-trained
models on twitter data for our chatbot.
For the data collection purpose, experts were engaged to
lead dialogue conversations on 3 pre-selected topics with
potential emotional load: customer experiences, weight
management & nutrition, and events with psychological
impact. Participants were encouraged to talk about their
own experiences during a chat session that lasted 30 min.
A total of 60 participants interacted over a chat platform
with our experts - a customer representative, a psychologist,
and a nutritionist. 60 dialogues with a total 7027 turns were
collected. The sub-topics covered in the dialogues included
experiences in retail and restaurant services, issues
concerning home renovation, work, unemployment,
personal relationships and family matters (e.g. death of a
relative, getting married and going through break-ups),
education (e.g. studies, course training, school), hobbies,
army, health issues (e.g. lack of sleep, depression, illness,
weight management, nutrition, etc.), holiday and experience
abroad. The data was collected and annotated over a period
of three months. An example extracted from our data
collection is given below:
“09:41 Participant: they just assumed I couldn’t afford it
09:41 Expert: Omg that is bad
09:41 Expert: and you walked away?
09:42 Participant: Yup
09:42 Participant: i did something childish and totally immature after that
09:42 Expert: what was it?
09:42 Participant went to withdrew $1000 and went back in and showed
them that I COULD HAVE BOUGHT IT IF I WANTED…”
ANNOTATION SCHEME
Humans can express emotions over single or mixed
channels. These channels make use of vocal cues, words or
facial expressions. Emotions expressed through mixed
channels, (e.g. vocal cues & words or vocal cues & facial
expressions, etc.) are easier to interpret, as they are less
ambiguous. Single channel emotion on the other side, such
as words in written communication can be more
challenging to “decode”.
To help our annotators uncover emotions in our data, we
elaborated an annotation scheme based on Ekman’s six
emotions model [7]: anger, disgust, fear, happiness,
sadness, and surprise. The scheme was enhanced with
additional values that basic emotions could embrace (see
Figure 1). Its role was to help annotators identify the correct
emotion expressed in the dialogues. Three intensity levels
for emotion expression were defined such as low (1),
middle (2) and high (3).
Figure 1 Emotion Annotation Scheme
We also defined the expression mode, i.e. whether the
emotion was expressed empathetically. The scheme was
developed in an iterative process based on the first dialogue
samples from the newly collected data. An elaborated
guideline with numerous annotation examples was handed
to the annotators at the end of the data collection. Table 1
shows an example extracted from our guideline. It shows
the annotation of ‘surprise’.
Table 1 Examples of annotations for ‘surprise’
ANNOTATION RESULTS
2 annotators performed annotations. The inter-agreement
reliability calculations were performed using Krippendorf
alpha measurements on 10% of the entire data. The results
indicate a high inter-agreement rate of 0.817%.
The data collection enabled us to tag about 801 emotions
with most frequent emotions being angry (29%) and happy
(28%) as shown in Figure 2.
EXPERT EVALUATION
To test how a perplexity of 29 translates into human
judgment, we performed a test on a subset of emotions. We
<SURPRISE( =”1”( VALUE=NEG:(oh,( I( didn’t( know( that( >( -" the"
surprised" is" highlighted" by" “oh”." "The" intensity" value" is" 1" since"
there" are" no" other" markers" of" emphasis" (no" adjectives," no"
uppercase"letters,"no"exclamation"mark.(
<SURPRISE(=”2”(VALUE=POS(MODE=(EMPHATIC:(wow(great(news(
>( ( -" here" the" intensity=2" because" of" “wow”" and" the" adjective"
“great”" however," it" is" expressed"empathically," so" the" intensity" is"
below"3"(
<SURPRISE=”3”( VALUE=POS>( OMG,( Really??</SURPRISE>(((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((
<SURPRISE=”3”( VALUE=NEG>(He( died??( OMG( what( a( shock!(
</SURPRISE>(*Upper"case"letters,"double"question"or"exclamation"
mark"a"higher"level"of"emotion"–"this"this"case:”3”(
chose ‘happy’ (positive), ‘angry’ (negative) and ‘no
emotion’ (neutral statements); we also chose ‘disgust’
(negative) as an emotion most difficult to detect and
correctly annotate in our corpus. We manually generated
12x4 user input sentences that would be neutral or express
emotions such as happy, angry and disgust. Then, we
programmed DigiMo to respond automatically to each of
these sentences using the same repertoire of emotions in an
experimental 4x4 matrix setting – see Table 2. In this way,
a total number of 192 (48x4) questions & answers were
generated. Further, six experts - three linguists and three
chatbot developers - were asked to classify DigiMo’s
replies as suitable, neutral1 and unsuitable.
A brief examination of emotion expressed by our
Singaporean participants revealed several interesting
observations, such as:
Happiness is often expressed using the verb “to feel” in
combination with superlative positive adjectives: “I am
feeling really happy”, “I feel a sense of satisfaction”, “very
excited / proud / good /glad”, “absolutely amazing/ damn
siok”, “excellent”, “magic” and “wonderful”.
Other verbs often used in sentences annotated as “happy”
are “to like” and “to love”. Happy expressions are
sometimes combined with transliteration of laughs. Upper
case letters and exclamation marks are used to emphasize
the expression of happiness.
Anger is expressed similar to happiness but using the
opposite polarity: verbs + superlative negative adjectives
using upper case letters & exclamation marks for emphasis.
Interestingly, anger seems to deploy a larger variation of
negative verbs such as “to refuse”, “to fail”, “to anger”, “to
upset”, “to insult”, “to annoy”, “to feel+negative adjectives
in superlative form: “really disappointed”, “so angry”,
“very rude” and also “bad”, “unfair”, “really sucked”,
“spoiled mood”. Apart from larger linguistic diversity,
statements expressing anger tend to be longer and more
descriptive than their “happy” counterparts spreading
sometimes over more than one dialogue turn.
Figure 2 Emotion distribution frequencies
Sadness is often expressed in conjunction with the verb “to
feel”+adjective (“sad”, “lonely”, “much worse”, “guilty”,
“uncomfortable”, “regretful”) or using the passing
construction: “I was affected”. Often descriptions of events
1 Neutral are responses suitable in certain contexts only.
are presented as “It was”+“depressing”, “though”, “awful”,
“a pity”. Feelings of despair are also described using
expressions like “to cry”/ “I was in tears” or through
rhetorical questions and observations: “Why am I here?”, “I
have no meaning”. Sad emotions are sometimes
accompanied by a sad smiley.
Surprise is mostly marked through interjections, such as
“oh”, “ah” and “wow” “OMG” followed by a statement of
appreciation and disapproval. Surprise statements may also
mention the cause of surprise: “I didn’t know” or “I thought
[…]” in situation when the exact opposite was believed to
be true. Often surprise statements are ending with one more
exclamation or question marks. Surprise can be also
expressed using phrases like “I was surprised / puzzled /
shocked“, “better than I expected” or “surprisingly”.
Disgust is more difficult to spot, as there are no distinctive
linguistic patterns than would separate it from “Anger”.
Disgust often expresses sarcasm: “Only Asian love to ask
such questions!” It appears in statements containing
criticisms of observed behaviors or events that are highly
disliked or hold in contempt. However, these events do not
harm the observer directly. Negative adjectives are used,
such as “bad urine smell” “pushy”, “unfriendly”, “poor”,
“selfish”, “infamous airline”. Only the context can
determine whether it is an expression of anger or disgust.
ARCHITECTURE
To train DigiMo - our chatbot - we used Cakechat [3], an
open source project. Cakechat implements a sequence-to-
sequence dialogue model (HRED) obtained from pre-
trained models. The models were trained on carefully pre-
processed 11 GB Twitter data.
We fed our own data collection to Cakechat, using a
dictionary of 50k words. Statements with no emotional
expression were labeled as ‘neutral’. The learning rate was
set at 0.01. For the encoded/decoded sequence maximum of
30 tokens are allowed. As such, we trimmed and readjusted
the turns to fit this condition. Both encoder and decoder
contain 2 GRU layers with 512 hidden units each.
As, this study is a work-in-progress, we didn’t manage to
implement yet an emotion classifier for the user input.
However, we were able to test the chatbot answers by
manually adding the emotion classification of each user
input and testing the system over the terminal: <user_input:
“I am really scared of this exam” [FEAR]. The architecture
of DigiMo is depicted in Figure 3.
AUTOMATIC EVALUATION
We tested our model experimenting with different epochs
and batch sizes. During training, we controlled context-
sensitive and context-free perplexity values. The best
results – context free & context sensitive perplexity2, both
with a low value of 29 - were achieved with a system
2 Perplexity is a measure that shows how well a probability model predicts
test data; in this case, it shows how good is the language model. A lower
perplexity means a better model.
Figure 3 DigiMo architect
configuration using a batch size of 128, 15 epochs, free-
context perplexity and context size equal to 3.
EXPERT EVALUATION
To test how a perplexity of 29 translates into human
judgment, we performed a test on a subset of emotions. We
chose ‘happy’ (positive), ‘angry’ (negative) and ‘no
emotion’ (neutral statements); we also chose ‘disgust’
(negative) as an emotion most difficult to detect and
correctly annotate in our corpus. We manually generated
12x4 user input sentences that would be neutral or express
emotions such as happy, angry and disgust. Then, we
programmed DigiMo to respond automatically to each of
these sentences using the same repertoire of emotions in an
experimental 4x4 matrix setting – see Table 2. In this way,
a total number of 192 (48x4) questions & answers were
generated. Further, six experts - three linguists and three
chatbot developers - were asked to classify DigiMo’s
replies as suitable, neutral3 and unsuitable. To calculate
DigiMo’s answer accuracy we used (Nsuitable + Nneutral) /
Ntotal *100.
From a total of 1152 evaluations (192x6), 665 were
suitable, 236 unsuitable and 251 unsuitable, thus the total
accuracy of our sample data is: 78.20%.
Table 2 Expert evaluation of answer accuracy
Highest average values were achieved for chatbot answers
expressing happiness 85.75%, the highest values being
achieved for combinations matching happy emotions for
both user and DigiMo (95.83%).
An interesting case is the combination of user_angry and
chatbot_happy (88.88%). This combination, despite being a
3 Neutral are responses suitable in certain contexts only.
mismatch achieved many accurate responses. A closer look
revealed the fact that this combination creates sarcastic but
at the same time suitable responses, e.g. (User: “my phone
is out of battery”, DigiMo: “awesome”). Lowest average
values were achieved for chatbot answers loaded with angry
emotions (67.01%), the lowest value being achieved by the
combination: user_happy & chatbot_angry (52.77%).
DISCUSSION
Even though our chatbot is still under development, it
shows very promising results. It responds emotionally
appropriate to our input thanks to our data collection and
careful annotation. Our next target is to develop and test an
emotion classifier and incorporate the models into a task-
oriented chatbot deployed in customer service.
Unlike sentiment analysis, emotion detection focuses on
detecting several emotion categories going beyond the
binary negative/positive classification. This classification
complexity makes emotion detection a more challenging
task. Secondly, relying on text only, i.e. in the absence of
speech, which is the typical case of text-based chatbot,
makes the emotion detection much harder. Intensity varies
according to user personality, chatting habits and
culture. In Singapore, we would expect people to express
emotions rather moderately, which is the typical way for
Asian culture known for the tendency towards introvert
pattern of expressions.
As mentioned earlier, in the long term, we investigate
following research questions:
(i) How the data collected and annotated can be best
incorporated into our training model to maximize our
chabot’s emotional ‘skills’? At the moment, we took into
account only ‘emotion’ relevant dialogues – however,
context could play an additional important role. We aim to
investigate how much context would be required and how
to efficiently incorporate it in our training model.
(ii) How to evaluate the effects of the local data collected
and used for training, on the overall satisfaction of
Singaporean users? In other words, is it worth collecting
and modeling a chatbot using local data or would any type
of data do a similar job? These questions will be answered
in our future research.
ACKNOWLEDGMENTS
We thank to our experts for participating in our study. This
research was supported by SERC Strategic Fund from Science &
User
Chatbot
Neutral
Happy
Disgust
Angry
Avg.
Neutral
90.27%
88.88%
68.05%
61.11%
77.07%
Happy
75%
95.83%
72.22%
52.77%
73.95%
Disgust
83.33%
69.44%
79.16%
79.16%
77.77%
Angry
81.94%
88.88%
90.27%
75%
84.02%
Avg.
82.63%
85.75%
77.42%
67.01%
78.20%
Engineering Research Council (SERC), A*STAR (project no.
a1718g0046).
REFERENCES
[1] Bus Uncle. Retrieved January 5th 2020 from
https://www.busuncle.sg/
[2] Business Insider. Retrieved January 5th 2020 from
https://www.businessinsider.com/80-of-businesses-want-
chatbots-by-2020-2016-12?IR=T
[3] Cakechat Github. Retrieved January 5th 2020 from
https://github.com/lukalabs/cakechat#network-architecture-
and-features
[4] Chatbots Magazine. Retrieved January 5th, 2020 from
https://chatbotsmagazine.com/the-role-of-emotional-
intelligence-in-ai-1e078ac0e328
[5] DBS website. Retrieved January 5th from 2020 from
https://www.dbs.com/newsroom/DBS_introduces_Jim_South
east_Asias_first_virtual_bank_recruiter
[6] Digital Emotions. Retrieved January 5th 2020 from
http://projectdigitalemotion.net/
[7] Paul Ekman. 1999. Basic Emotions. In: Handbook of
cognition and emotions. T. Dalgleish & M.J. Power (eds.),
Willey, US, 45-60
[8] Boris Galitsky. 2019. Developing Enterprise Chatbots:
Learning Linguistic Structures. Springer.
[9] GovTech Singapore. Retrieved January 5th 2020 from
https://www.tech.gov.sg/products-and-services/ask-jamie/
[10] A.I. Niculescu, K.H. Yeo, L.F. D’Haro, S. Kim, R. Jiang,
R.E. Banchs. 2014. Design and evaluation of a conversational
agent for the touristic domain. In: Proc. of APSIPA
[11] Replika. Retrieved January 5th, 2020 from https://replika.ai/
[12] C. Segura, A. Palau, J. Luque, M. R. Costa-Jussà, R.E.
Banchs. 2019. Chatbol, a Chatbot for the Spanish “La Liga”.
In: D'Haro L., Banchs R., Li H. (eds.) 9th Int. Workshop on
Spoken Dialogue System Technology. LNEE, vol. 579.
Springer, Singapore. 319-330
[13] SilverKris. Retrieved January 5th 2020 from
https://www.silverkris.com/meet-kris-the-new-beta-chatbot-
for-singapore-airlines/
[14] Joseph Weizenbaum.1966. Eliza computer program for the
study of natural language communication between man and
machine. Communications of the ACM, vol. 9, no. 1: 36–45