The Design and Implementation of XiaoIce, an Empathetic Social Chatbot

Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.


This paper describes the development of the Microsoft XiaoIce system, the most popular social chatbot in the world. XiaoIce is uniquely designed as an AI companion with an emotional connection to satisfy the human need for communication, affection, and social belonging. We take into account both intelligent quotient (IQ) and emotional quotient (EQ) in system design, cast human-machine social chat as decision-making over Markov Decision Processes (MDPs), and optimize XiaoIce for long-term user engagement, measured in expected Conversation-turns Per Session (CPS). We detail the system architecture and key components including dialogue manager, core chat, skills, and an empathetic computing module. We show how XiaoIce dynamically recognizes human feelings and states, understands user intents, and responds to user needs throughout long conversations. Since the release in 2014, XiaoIce has communicated with over 660 million users and succeeded in establishing long-term relationships with many of them. Analysis of large-scale online logs shows that XiaoIce has achieved an average CPS of 23, which is significantly higher than that of other chatbots and even human conversations.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
For automatic chatting systems, it is indeed a great challenge to reply the given query considering the conversation history, rather than based on the query only. This paper proposes a deep neural network to address the context-aware response ranking problem by end-to-end learning, so as to help to select conversationally relevant candidate. By combining the multi-column convolutional layer and the recurrent layer, our model is able to model the semantics of the utterance sequence by grasping the semantic clue within the conversation, on the basis of the effective representation for each sentence. Especially, the network utilizes attention pooling to further emphasis the importance of essential words in conversations, thus the representations of contexts tend to be more meaningful and the performance of candidate ranking is notably improved. Meanwhile, due to the adoption of attention pooling, it is possible to visualize the semantic clues. The experimental results on the large amount of conversation data from social media have shown that our approach is promising for quantifying the conversational relevance of responses, and indicated its good potential for building practical IR based chat-bots.
Full-text available
Since the late 1990s when speech companies began providing their customer-service software in the market, people have gotten used to speaking to machines. As people interact more often with voice and gesture controlled machines, they expect the machines to recognize different emotions, and understand other high level communication features such as humor, sarcasm and intention. In order to make such communication possible, the machines need an empathy module in them which can extract emotions from human speech and behavior and can decide the correct response of the robot. Although research on empathetic robots is still in the early stage, we described our approach using signal processing techniques, sentiment analysis and machine learning algorithms to make robots that can "understand" human emotion. We propose Zara the Supergirl as a prototype system of empathetic robots. It is a software based virtual android, with an animated cartoon character to present itself on the screen. She will get "smarter" and more empathetic through its deep learning algorithms, and by gathering more data and learning from it. In this paper, we present our work so far in the areas of deep learning of emotion and sentiment recognition, as well as humor recognition. We hope to explore the future direction of android development and how it can help improve people's lives.
Full-text available
We propose Neural Responding Machine (NRM), a neural network-based response generator for Short-Text Conversation. NRM takes the general encoder-decoder framework: it formalizes the generation of response as a decoding process based on the latent representation of the input text, while both encoding and decoding are realized with recurrent neural networks (RNN). The NRM is trained with a large amount of one-round conversation data collected from a microblogging service. Empirical study shows that NRM can generate grammatically correct and content-wise appropriate responses to over 75% of the input text, outperforming state-of-the-arts in the same setting, including retrieval-based and SMT-based models.
Conference Paper
Full-text available
Latent semantic models, such as LSA, intend to map a query to its relevant documents at the semantic level where keyword-based matching often fails. In this study we strive to develop a series of new latent semantic models with a deep structure that project queries and documents into a common low-dimensional space where the relevance of a document given a query is readily computed as the distance between them. The proposed deep structured semantic models are discriminatively trained by maximizing the conditional likelihood of the clicked documents given a query using the clickthrough data. To make our models applicable to large-scale Web search applications, we also use a technique called word hashing, which is shown to effectively scale up our semantic models to handle large vocabularies which are common in such tasks. The new models are evaluated on a Web document ranking task using a real-world data set. Results show that our best model significantly outperforms other latent semantic models, which were considered state-of-the-art in the performance prior to the work presented in this paper.
Conference Paper
Full-text available
A chatbot is a software system, which can interact or "chat" with a human user in natural language such as English. For the annual Loebner Prize contest, rival chatbots have been assessed in terms of ability to fool a judge in a restricted chat session. We are investigating methods to train and adapt a chatbot to a specific user's language use or application, via a user-supplied training corpus. We advocate open-ended trials by real users, such as an example Afrikaans chatbot for Afrikaans-speaking researchers and students in South Africa. This is evaluated in terms of "glass box" dialogue efficiency metrics, and "black box" dialogue quality metrics and user satisfaction feedback. The other examples presented in this paper are the Qur'an and the FAQchat prototypes. Our general conclusion is that evaluation should be adapted to the application and to user needs.
Full-text available
We present a new ranking algorithm that combines the strengths of two previous methods: boosted tree classification, and LambdaRank, which has been shown to be empirically optimal for a widely used information retrieval measure. Our algorithm is based on boosted regression trees, although the ideas apply to any weak learners, and it is significantly faster in both train and test phases than the state of the art, for comparable accuracy. We also show how to find the optimal linear combination for any two rankers, and we use this method to solve the line search problem exactly during boosting. In addition, we show that starting with a previously trained model, and boosting using its residuals, furnishes an effective technique for model adaptation, and we give significantly improved results for a particularly pressing problem in web search—training rankers for markets for which only small amounts of labeled data are available, given a ranker trained on much more data from a larger market.
We consider incorporating topic information into a sequence-to-sequence framework to generate informative and interesting responses for chatbots. To this end, we propose a topic aware sequence-to-sequence (TA-Seq2Seq) model. The model utilizes topics to simulate prior human knowledge that guides them to form informative and interesting responses in conversation, and leverages topic information in generation by a joint attention mechanism and a biased generation probability. The joint attention mechanism summarizes the hidden vectors of an input message as context vectors by message attention and synthesizes topic vectors by topic attention from the topic words of the message obtained from a pre-trained LDA model, with these vectors jointly affecting the generation of words in decoding. To increase the possibility of topic words appearing in responses, the model modifies the generation probability of topic words by adding an extra probability item to bias the overall distribution. Empirical studies on both automatic evaluation metrics and human annotations show that TA-Seq2Seq can generate more informative and interesting responses, significantly outperforming state-of-the-art response generation models.
Conversational systems have come a long way after decades of research and development, from Eliza and Parry in the 60's and 70's, to task-completion systems as in the ATIS project, to intelligent personal assistants such as Siri, and to today's social chatbots like XiaoIce. Social chatbots' appeal lies in not only their ability to respond to users' diverse requests, but also in being able to establish an emotional connection with users. The latter is done by satisfying the users' essential needs for communication, affection, and social belonging. The design of social chatbots must focus on user engagement and take both intellectual quotient (IQ) and emotional quotient (EQ) into account. Users should want to engage with the social chatbot; as such, we define the success metric for social chatbots as conversation-turns per session (CPS). Using XiaoIce as an illustrative example, we discuss key technologies in building social chatbots from core chat to visual sense to skills. We also show how XiaoIce can dynamically recognize emotion and engage the user throughout long conversations with appropriate interpersonal responses. As we become the first generation of humans ever living with AI, social chatbots that are well-designed to be both useful and empathic will soon be ubiquitous.
The popularity of image sharing on social media reflects the important role visual context plays in everyday conversation. In this paper, we present a novel task, Image-Grounded Conversations (IGC), in which natural-sounding conversations are generated about shared photographic images. We investigate this task using training data derived from image-grounded conversations on social media and introduce a new dataset of crowd-sourced conversations for benchmarking progress. Experiments using deep neural network models trained on social media data show that the combination of visual and textual context can enhance the quality of generated conversational turns. In human evaluation, a gap between human performance and that of both neural and retrieval architectures suggests that IGC presents an interesting challenge for vision and language research.
Conference Paper
We study the problem of question retrieval in community question answering (CQA). The biggest challenge within this task is lexical gaps between questions since similar questions are usually expressed with different but semantically related words. To bridge the gaps, state-of-the-art methods incorporate extra information such as word-to-word translation and categories of questions into the traditional language models. We find that the existing language model based methods can be interpreted using a new framework, that is they represent words and question categories in a vector space and calculate question-question similarities with a linear combination of dot products of the vectors. The problem is that these methods are either heuristic on data representation or difficult to scale up. We propose a principled and efficient approach to learning representations of data in CQA. In our method, we simultaneously learn vectors of words and vectors of question categories by optimizing an objective function naturally derived from the framework. In question retrieval, we incorporate learnt representations into traditional language models in an effective and efficient way. We conduct experiments on large scale data from Yahoo! Answers and Baidu Knows, and compared our method with state-of-the-art methods on two public data sets. Experimental results show that our method can significantly improve on baseline methods for retrieval relevance. On 1 million training data, our method takes less than 50 minutes to learn a model on a single multicore machine, while the translation based language model needs more than 2 days to learn a translation table on the same machine.
Conference Paper
Social media is an increasingly important part of modern life. We investigate the use of and usability of Twitter by blind users, via a combination of surveys of blind Twitter users, large-scale analysis of tweets from and Twitter profiles of blind and sighted users, and analysis of tweets containing embedded imagery. While Twitter has traditionally been thought of as the most accessible social media platform for blind users, Twitter's increasing integration of image content and users' diverse uses for images have presented emergent accessibility challenges. Our findings illuminate the importance of the ability to use social media for people who are blind, while also highlighting the many challenges such media currently present this user base, including difficulty in creating profiles, in awareness of available features and settings, in controlling revelations of one's disability status, and in dealing with the increasing pervasiveness of image-based content. We propose changes that Twitter and other social platforms should make to promote fuller access to users with visual impairments.
We present persona-based models for handling the issue of speaker consistency in neural response generation. A speaker model encodes personas in distributed embeddings that capture individual characteristics such as background information and speaking style. A dyadic speaker-addressee model captures properties of interactions between two interlocutors. Our models yield qualitative performance improvements in both perplexity and BLEU scores over baseline sequence-to-sequence models, with similar gain in speaker consistency as measured by human judges.
Sequence-to-sequence neural network models for generation of conversational responses tend to generate safe, commonplace responses (e.g., \textit{I don't know}) regardless of the input. We suggest that the traditional objective function, i.e., the likelihood of output (responses) given input (messages) is unsuited to response generation tasks. Instead we propose using Maximum Mutual Information (MMI) as objective function in neural models. Experimental results demonstrate that the proposed objective function produces more diverse, interesting, and appropriate responses, yielding substantive gains in \bleu scores on two conversational datasets.
The recent progress on image recognition and language modeling is making automatic description of image content a reality. However, stylized, non-factual aspects of the written description are missing from the current systems. One of such styles is descriptions with emotions, which is commonplace in everyday communication, and influences decision-making and interpersonal relationships. We design a system to describe an image with emotions, and present a model that automatically generates captions with positive or negative sentiments. We propose a novel switching recurrent neural network with word-level regularization, which is able to produce emotional image captions using only 2000+ training sentences containing sentiments. We evaluate the captions with different automatic and crowd-sourcing metrics. Our model compares favourably in common quality metrics for image captioning. In 84.6% of cases the generated positive captions were judged as being at least as descriptive as the factual captions, of these positive captions 88% were confirmed by the crowd-sourced workers as having the appropriate sentiment.
We present a novel response generation system that can be trained end to end on large quantities of unstructured Twitter conversations. A neural network architecture is used to address sparsity issues that arise when integrating contextual information into classic statistical models, allowing the system to take into account previous dialog utterances. Our dynamic-context generative models show consistent gains over both context-sensitive and non-context-sensitive Machine Translation and Information Retrieval baselines.
Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.7 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a strong phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which beats the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Empathic computing is an emergent paradigm that enables a system to understand human states and feelings and to share this intimate information. The new paradigm is made possible by the convergence of affordable sensors, embedded processors and wireless ad-hoc networks. The power law for multi-resolution channels and mobile-stationary sensor webs is introduced to resolve the information avalanche problems. As empathic computing is sensor-rich computing, particular models such as semantic differential expressions and inverse physics are discussed. A case study of a wearable sensor network for detection of a falling event is presented. It is found that the location of the wearable sensor is sensitive to the results. From the machine learning algorithm, the accuracy reaches up to 90% from 21 simulated trials. Empathic computing is not limited to healthcare. It can also be applied to solve other everyday-life problems such as management of emails and stress.
A case of artificial paranoid has been synthesized in the form of a computer simulation model. The model and its embodied theory are briefly described. Several excerpts from interviews with the model are presented to illustrate its paranoid input-output behavior. Evaluation of the success of the simulation will depend upon indistinguishability tests.
ELIZA is a program operating within the MAC time-sharing system of MIT which makes certain kinds of natural language conversation between man and computer possible. Input sentences are analyzed on the basis of decomposition rules which are triggered by key words appearing in the input text. Responses are generated by reassembly rules associated with selected decomposition rules. The fundamental technical problems with which ELIZA is concerned are: (1) the identification of key words, (2) the discovery of minimal context, (3) the choice of appropriate transformations, (4) generation of responses in the absence of key words, and (5) the provision of an editing capability for ELIZA “scripts”. A discussion of some psychological issues relevant to the ELIZA approach as well as of future developments concludes the paper. © 1983, ACM. All rights reserved.
Learning, planning, and representing knowledge at multiple levels of temporal abstraction are key, longstanding challenges for AI. In this paper we consider how these challenges can be addressed within the mathematical framework of reinforcement learning and Markov decision processes (MDPs). We extend the usual notion of action in this framework to include options---closed-loop policies for taking action over a period of time. Examples of options include picking up an object, going to lunch, and traveling to a distant city, as well as primitive actions such as muscle twitches and joint torques. Overall, we show that options enable temporally abstract knowledge and action to be included in the reinforcement learning framework in a natural and general way. In particular, we show that options may be used interchangeably with primitive actions in planning methods such as dynamic programming and in learning methods such as Q-learning.
The anatomy of alice
  • S Richard
  • Wallace
Richard S Wallace. The anatomy of alice. In Parsing the Turing Test, pages 181-210. Springer, 2009.
On the properties of neural machine translation: Encoder-decoder approaches
  • Kyunghyun Cho
  • Dzmitry Bart Van Merrienboer
  • Yoshua Bahdanau
  • Bengio
Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103-111, Doha, Qatar, October 2014.
A neural conversational model
  • Oriol Vinyals
  • Quoc Le
Oriol Vinyals and Quoc Le. A neural conversational model. In ICML Deep Learning Workshop, July 2015.
A latent semantic model with convolutional-pooling structure for information retrieval
  • Yelong Shen
  • Xiaodong He
  • Jianfeng Gao
  • Li Deng
  • Grégoire Mesnil
Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pages 101-110. ACM, 2014.
  • Jianfeng Gao
  • Michel Galley
  • Lihong Li
Jianfeng Gao, Michel Galley, and Lihong Li. Neural approaches to conversational ai. arXiv preprint arXiv:1809.08267, 2018.
  • Wen-Feng Cheng
  • Chao-Chung Wu
  • Ruihua Song
  • Jianlong Fu
  • Xing Xie
  • Jian-Yun Nie
Wen-Feng Cheng, Chao-Chung Wu, Ruihua Song, Jianlong Fu, Xing Xie, and Jian-Yun Nie. Image inspired poetry generation in xiaoice. arXiv preprint arXiv:1808.03090, 2018.