Conference PaperPDF Available

Self-improving Chatbots based on Reinforcement Learning

Authors:

Abstract and Figures

We present a Reinforcement Learning (RL) model for self-improving chatbots, specifically targeting FAQ-type chatbots. The model is not aimed at building a dialog system from scratch, but to leverage data from user conversations to improve chatbot performance. At the core of our approach is a score model, which is trained to score chatbot utterance-response tuples based on user feedback. The scores predicted by this model are used as rewards for the RL agent. Policy learning takes place offline, thanks to an user simulator which is fed with utterances from the FAQ-database. Policy learning is implemented using a Deep Q-Network (DQN) agent with epsilon-greedy exploration, which is tailored to effectively include fallback answers for out-of-scope questions. The potential of our approach is shown on a small case extracted from an enterprise chatbot. It shows an increase in performance from an initial 50% success rate to 75% in 20-30 training epochs.
Content may be subject to copyright.
Self-improving Chatbots based on Reinforcement Learning
Elena Ricciardelli, Debmalya Biswas
AI Center of Excellence
Philip Morris International
Lausanne, Switzerland
firstname.lastname@pmi.com
Abstract
We present a Reinforcement Learning (RL) model for self-improving chatbots, specifically targeting FAQ-type chatbots.
The model is not aimed at building a dialog system from scratch, but to leverage data from user conversations to improve
chatbot performance. At the core of our approach is a score model, which is trained to score chatbot utterance-response
tuples based on user feedback. The scores predicted by this model are used as rewards for the RL agent. Policy learning
takes place offline, thanks to an user simulator which is fed with utterances from the FAQ-database. Policy learning is
implemented using a Deep Q-Network (DQN) agent with epsilon-greedy exploration, which is tailored to effectively
include fallback answers for out-of-scope questions.
The potential of our approach is shown on a small case extracted from an enterprise chatbot. It shows an increase in
performance from an initial 50% success rate to 75% in 20-30 training epochs.
Keywords: Reinforcement Learning, Chatbots, NLP
Figure 1: Architecture of the RL model used in this work. The DQN agent is initially trained offline in a warm-up phase on the NLU.
The score model is also trained offline with the data from real user conversations. In the RL loop, the user state (user utterance) is
provided by the user simulator, the action (chatbot response) is provided by the DQN agent and the reward is provided by the score
model. Each tuple (st,at,rt) feeds the experience replay buffer, which is used to re-train the DQN after nepisodes episodes, which is a
tunable parameter.
1 Introduction
The majority of dialog agents in an enterprise setting are domain specific, consisting of a Natural Language Understand-
ing (NLU) unit trained to recognize the user’s goal in a supervised manner. However, collecting a good training set for a
production system is a time-consuming and cumbersome process. Chatbots covering a wide range of intents often face
poor performance due to intent overlap and confusion. Furthermore, it is difficult to autonomously retrain a chatbot tak-
ing into account the user feedback from live usage or testing phase. Self-improving chatbots are challenging to achieve,
primarily because of the difficulty in choosing and prioritizing metrics for chatbot performance evaluation. Ideally, one
wants a dialog agent to be capable to learn from the user’s experience and improve autonomously.
In this work, we present a reinforcement learning approach for self-improving chatbots, specifically targeting FAQ-type
chatbots. The core of such chatbots is an intent recognition NLU, which is trained with hard-coded examples of question
variations. When no intent is matched with a confidence level above 30%, the chatbot returns a fallback answer. For all
others, the NLU engine returns the corresponding confidence level along with the response.
Several research papers [2,3, 7, 8] have shown the effectiveness of a RL approach in developing dialog systems. Critical
to this approach is the choice of a good reward model. A typical reward model is the implementation of a penalty term
for each dialog turn. However, such rewards only apply to task completion chatbots where the purpose of the agent is
to satisfy user’s request in the shortest time, but it is not suitable for FAQ-type chatbots where the chatbot is expected to
provide a good answer in one turn. The user’s feedback can also be used as a reward model in an online reinforcement
learning. However, applying RL on live conversations can be challenging and it may incur a significant cost in case of
RL failure. A better approach for deployed systems is to perform the RL training offline and then update the NLU policy
once satisfactory levels of performance have been reached.
2 Reinforcement Learning Model
The RL model architecture is illustrated in Figure 1. The various components of the model are: the NLU unit, which is
used to initially train the RL agent in a warm-up phase; the user simulator, which randomly extracts the user utterances
from the database of user experiences; the score model trained on the user’s conversation with feedback and the RL
agent based on a Deep Q-Network (DQN) network.
1
2.1 Dialog System
We apply the reinforcement learning approach on a FAQ-type chatbot. At the core of the chatbot, there is an intent1
recognition NLU, which is trained with hard-coded examples of question variations. For this work, we have chosen the
open-source NLU from Rasa2, using the TensorFlow pipeline. However the RL approach is independent from the NLU
chosen and for systems in production it can easily be extended to NLU engines such as IBM Watson or Amazon LEX.
2.2 Real User Conversations
We used user feedback obtained during the development an actual internal chatbot for our work. The scope of the
chatbot was to answer employee queries related to office building facilities, HR policies and benefits, etc. All the 10 users
participating in the test phase were informed that their feedback would be used to improve the chatbot performance. The
testers provided a (binary) feedback after each conversation turn, thus rating the utterance-response tuples. The historical
data thus contains quadruples of the following format: (utterance, response, NLU confidence level and feedback). By
removing non valid conversations (i.e. those lacking or with invalid feedback) we end up with 550 user conversations,
triggering about 120 intents. Although we have tested the score model on all the conversations, the RL model has been
applied only on a subsample of 155 conversations, triggering the top 5 intents. On this subset, the user’s satisfaction is
50%.
Table 1: Example of conversation from the database, as well as the score provided by the model and by the user
Utterance Response Model
score
User
feedback
What is Interact? I can help you mainly with GS service-related ques-
tions on topics like transportation (your mobility
plan/benefit car), business travel, hospitality and
events, gym, mailroom etc.
0.30 0
where can I find my salary payment I was getting bored, but now you are here. 0.26 0
Have a nice day Bye bye! Thanks for the chat! 0.81 1
2.3 Reward Function: the Score Model
Evaluating chatbot performance is a long-standing issue in computational linguistic. Automatic metrics borrowed from
machine translations (e.g. [6]) do not perform well on short sentences (e.g. [4]), such as the chatbot utterance-response
tuples. On the other hand, human rating of chatbots is by now the de-facto standard to evaluate the success of a chatbot,
although those ratings are often difficult and expensive to gather.
To evaluate the correctness of chatbot responses, we propose a new approach which makes use of the user conversation
logs, gathered during the development and testing phases of the chatbot. Each user had been asked to provide a binary
feedback (positive/negative) at each chatbot turn. In order to use the user feedback in an offline reinforcement learning,
we have developed a score model, capable of modeling the binary feedback for unseen utterance-response tuples. In
a supervised fashion, the score model learns how to project the vector representations of utterance and response in a
linearly transformed space, such that similar vector representations give high score. As for the vector representation of
sentences, we compute sentence embedding through the universal sentence encoder [1], available through TensorFlow
Hub 3. To train the model, the optimization is done on a squared error (between model prediction and human feedback)
loss with L2 regulation. To evaluate the model, the predicted scores are then converted into a binary outcome and
compared with the targets (the user feedbacks). For those couples of utterances having a recognized intent with both
a positive feedback and a NLU confidence level close to 1, we perform data augmentation, assigning low scores to the
combination of utterance and fallback intent.
A similar approach for chatbot evaluation has been suggested by [4]. The authors model the scores by using a labelled
set of conversations, that also include model and human-generated responses, collected through crowdsourcing. Our
approach differs from the above authors in that it just requires a labelled set of utterance-response tuples, which are
relatively straightforward to gather during the chatbot development and user testing phases.
1An intent is defined as the user’s intention, which is formulated through the utterance
2https://rasa.com/
3https://www.tensorflow.org/hub
2
Figure 2: Performances of the score model. Left-hand panel: cross-validated test set accuracy with 95% confidence interval for
different sub-samples having different number of intents. The horizontal red line indicates the performances for the entire sample.
Right-hand panel: ROC curves for the different subsamples.
2.4 Policy Learning with DQN
To learn the policy, the RL agent uses a Q-learning algorithm with DQN architecture [5]. In DQN, a neural network
is trained to approximate the state-action function Q(st|at, θ), which represents the quality of an action atprovided a
state st, and θare the trainable parameters. As for the DQN network, we have followed the approach proposed by [3],
using a fully-connected network, fed by an experience replay pool buffer, that contains the one-hot representation of
utterance and response and the corresponding reward. An one-hot representation is possible in this case as we have a
finite possible values for utterances (given by the number of real users’s question in the logs) and responses (equal to
the number of intents used on out test-case, 5). In a warm-up phase, the DQN is trained on the NLU, using as a reward
the NLU confidence level. The DQN training set is augmented whenever a state-action pair has a confidence above a
threshold, by assigning zero weight to the given state and all the other available actions. Thus, at the starting of the RL
training, the agent performs similar to the NLU unit.
During RL training, we use an -greedy exploration, where random actions are explored according to a probability . We
use a time-varying which facilitates the exploration at the beginning of the training with t0= 0.2and t= 0.05 during
the last epoch. To speed-up the learning when picking random actions, we also force higher probability to get a ”No
intent detected”, as several questions are actually out of the chatbot scope, but they are erroneously matched to a wrong
intent by the NLU. During an epoch we simulate a batch of conversations of size nepisodes (ranging from 10 to 30 in our
experiments) and fill an experience replay buffer with the tuple (st,at,rt). The buffer has fixed size and it is flushed the
first time when the agent performance increases above a specified threshold. In those episodes where the state-action
tuple gets a reward greater than 50%, we perform data augmentation by assigning zero reward to the assignment of any
other action to the current state.
3 Model Evaluation
3.1 Score Model Evaluation
To evaluate the model, we select subsets of conversations, triggering the top Nintents, with Nbetween 5and 50. The
results of the score model are summarized in Figure 2, showing the cross-validated (5-fold CV) accuracy on the test set
and the ROC curve as a function of the number of intents. For the whole sample of conversations, we obtain a cross-
validated accuracy of 75% and an AUC of 0.84. However, by selecting only those conversations triggering the top 5
intents, thus including more examples per intent, we obtain an accuracy of 86% and an AUC of 0.94. For the RL model
evaluation, we have focussed on the 5intents subsets; which ensures that we have the most reliable rewards.
3.2 Reinforcement Learning Model Evaluation
The learning curve for the RL training is shown in Figure 3. In the left-hand panel, we compare the RL training with the
reward model with a test done with a direct reward (in interactive way), showing that the score model is giving similar
performances to the reference case, where the reward is known. Large fluctuations in the average score are due to a
limited batch size (nepisodes = 10) and a relatively large . We also show the success rate on a test set of 20 conversations,
extracted from the full sample, where a ”golden response” is manually provided for all the utterances. The agent success
rate increases from an initial 50% to 75% in only 30 epochs, showing the potential of this approach. In the right-hand
panel, we show the results using nepisodes = 30, showing similar performances but with a smoother learning curve.
3
Figure 3: Learning curves showing the DQN agent’s average score (continuous black line) per training epoch and success rate (purple
shaded area) based on a labelled test set of 20 conversations. Left-hand panel: learning curves for direct RL with interactive reward
(black line) and the reward model (blue dotted line), using 10 episodes per epoch. Right-hand panel: learning curves for the model
reward, using 30 episodes per epoch.
4 Conclusions
In this work, we have shown the potential of a reinforcement learning approach in improving the performance of FAQ-
type chatbots, based on the feedback from a user testing phase. To achieve this, we have developed a score model, which
is able to predict the user’s satisfaction on utterance-response tuples, and implemented a DQN reinforcement model,
using the score model predictions as rewards. We have evaluated the model on a small, but real, test case, demonstrating
promising results. Further training on more epochs and including more data, as well as extensive tests on the model
hyper-parameters are in progress. The value of our approach is in providing a practical tool to improve large-scale
chatbots (with a large set of diverse intents), in an automated fashion based on user feedback.
Finally, we notice that although the reinforcement learning model presented in this work is suitable for FAQ-type chat-
bots, it can be generalised to include the sequential nature of conversation by incorporating a more complex score model.
References
[1] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes,
Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. Universal sentence encoder. CoRR, abs/1803.11175,
2018.
[2] Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky. Deep reinforcement learning for dialogue
generation. arXiv preprint arXiv:1606.01541, 2016.
[3] Xiujun Li, Yun-Nung Chen, Jianfeng Gao, and Asli Celikyilmaz. End-to-end task-completion neural dialogue systems. In 8th
International Joint Conference on Natural Language Processing, 2017.
[4] Ryan Lowe, Michael Noseworthy, Iulian Vlad Serban, Nicolas Angelard-Gontier, Yoshua Bengio, and Joelle Pineau. Towards
an automatic turing test: Learning to evaluate dialogue responses. In Proceedings of the 55th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), pages 1116–1126. Association for Computational Linguistics, 2017.
[5] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Ried-
miller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dhar-
shan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning.
Nature, 518:529 EP –, 02 2015.
[6] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: A method for automatic evaluation of machine translation.
In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 311–318, 2002.
[7] Baolin Peng, Xiujun Li, Jianfeng Gao, Jingjing Liu, and Kam-Fai Wong. Deep dyna-q: Integrating planning for task-completion
dialogue policy learning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), pages 2182–2192. Association for Computational Linguistics, 2018.
[8] Iulian Vlad Serban, Chinnadhurai Sankar, Mathieu Germain, Saizheng Zhang, Zhouhan Lin, Sandeep Subramanian, Taesup Kim,
Michael Pieper, Sarath Chandar, Nan Rosemary Ke, Sai Mudumba, Alexandre de Br´
ebisson, Jose Sotelo, Dendi Suhubdy, Vincent
Michalski, Alexandre Nguyen, Joelle Pineau, and Yoshua Bengio. A deep reinforcement learning chatbot. CoRR, abs/1709.02349,
2017.
4
... Several methods have been proposed to improve chatbots. For example, recent work on the combination of reinforcement learning with human-robot interaction showed further improvement by learning from experience [20]. Reinforcement learning has also been suggested to improve chatbots with ensemble learning strategies [21]. ...
Article
Full-text available
Customer service is an important and expensive aspect of business, often being the largest department in most companies. With growing societal acceptance and increasing cost efficiency due to mass production, service robots are beginning to cross from the industrial domain to the social domain. Currently, customer service robots tend to be digital and emulate social interactions through on-screen text, but state-of-the-art research points towards physical robots soon providing customer service in person. This article explores the feasibility of Transfer Learning different customer service domains to improve chatbot models. In our proposed approach, transfer learning-based chatbot models are initially assigned to learn one domain from an initial random weight distribution. Each model is then tasked with learning another domain by transferring knowledge from the previous domains. To evaluate our approach, a range of 19 companies from domains such as e-Commerce, telecommunications, and technology are selected through social interaction with X (formerly Twitter) customer support accounts. The results show that the majority of models are improved when transferring knowledge from at least one other domain, particularly those more data-scarce than others. General language transfer learning is observed, as well as higher-level transfer of similar domain knowledge. For each of the 19 domains, the Wilcoxon signed-rank test suggests that 16 have statistically significant distributions between transfer and non-transfer learning. Finally, feasibility is explored for the deployment of chatbot models to physical robot platforms including “Pepper”, a semi-humanoid robot manufactured by SoftBank Robotics, and “Temi”, a personal assistant robot.
... Moreover, chatbots demonstrate proficiency in information retrieval, adeptly sourcing and disseminating relevant data. This amalgamation of capabilities underscores the utility and effectiveness of chatbots in augmenting human-computer interaction paradigms [1]. Among several types of chatbots, the primary category is commonly referred to as virtual assistants, designed to cater to users' needs across various domains and sectors [2]. ...
Article
Full-text available
One prevalent conversational system within the realm of natural language processing (NLP) is chatbots, designed to facilitate interactions between humans and machines. This study focuses on predicting frequently asked questions by students using the Duel Intent and Entity Transformer (DIET) Classifier method and assessing the performance of this method. The research involves employing 300 epochs with an 80% training data and 20% testing data split. In this study, the DIET Classifier adopts a multi-task transformer architecture to simultaneously handle classification and entity recognition tasks. Notably, it possesses the capability to integrate diverse word embeddings, such as BERT and GloVe, or pre-trained words from language models, and blend them with sparse words and n-gram character-level features in a plug-and-play manner. Throughout the training process of the DIET Classifier model, data loss and accuracy from both training and testing datasets are monitored at each epoch. The evaluation of the text classification model utilizes a confusion matrix. The accuracy results for testing the DIET Classifier method are presented through four case studies, each comprising 25 text messages and 15 corresponding chatbot responses. The obtained accuracy values range from 0.488 to 0.551, F1-Score values range from 0.427 to 0.463, and precision range from 0.417 to 0.457.
... 3) Reinforcement Learning: Reinforcement learning for user-centric dialog system design involves training the system to optimize responses based on feedback from domain experts or user interactions, improving overall performance and adaptability [13]. Authors in [14] highlight the integration of reinforcement learning in dialog systems, using human feedback as a reward mechanism to enhance model performance. ...
... Reinforcement Learning (RL) is the core technology that we use to make the conversation adapt in real-time to the user / environmental context [11]. Previous works have explored RL in the context of Recommender Systems [4,8,14], for User Wellbeing [9,13], and enterprise adoption also seems to be gaining momentum with the availability of Cloud APIs (e.g. ...
Conference Paper
Full-text available
We present a Reinforcement Learning (RL) enabled approach to personalize conversations in real-time based on user context. The results are based on a real-life lifestyle app that is able to provide personalized health and wellbeing-related content to users in an interactive fashion. Unfortunately, real-time personalization requires a high degree of knowledge reg. the user context and environment. As such, we need to perform this in a privacy preserving fashion; to adequately address user privacy concerns while providing a highly personalized experience. We address this in the paper by adding a privacy layer to the underlying NLP and RL frameworks. From a RL perspective, we show how the RL reward and policies implementations can be distributed, such that user privacy is protected while providing an accurate representation of the reward function.
... However, this system is not well-suited for all types of problems and requires careful tuning and design to attain optimal results. Authors in Ricciardelli and Biswas (2019) present a study where RL is used to train a FAQ chatbot. The policy learning was addressed by the implementation of a Deep Q-Network (DQN) agent with epsilon-greedy exploration. ...
Article
In the rapidly evolving domain of artificial intelligence, chatbots have emerged as a potent tool for various applications ranging from e-commerce to healthcare. This research delves into the intricacies of chatbot technology, from its foundational concepts to advanced generative models like ChatGPT. We present a comprehensive taxonomy of existing chatbot approaches, distinguishing between rule-based, retrieval-based, generative, and hybrid models. A specific emphasis is placed on ChatGPT, elucidating its merits for frequently asked questions (FAQs)-based chatbots, coupled with an exploration of associated Natural Language Processing (NLP) techniques such as named entity recognition, intent classification, and sentiment analysis. The paper further delves into the customization and fine-tuning of ChatGPT, its integration with knowledge bases, and the consequent challenges and ethical considerations that arise. Through real-world applications in domains such as online shopping, healthcare, and education, we underscore the transformative potential of chatbots. However, we also spotlight open challenges and suggest future research directions, emphasizing the need for optimizing conversational flow, advancing dialogue mechanics, improving domain adaptability, and enhancing ethical considerations. The research culminates in a call for further exploration in ensuring transparent, ethical, and user-centric chatbot systems.
... Reinforcement learning requires careful design of reward systems and may involve more complex training algorithms. (Ricciardelli, 2019) It's important to remember that these approaches can be mixed or adjusted depending on what's needed and the resources available for training a chatbot. This flexibility allows for better training strategies to match each chatbot's unique requirements. ...
Conference Paper
Full-text available
The cost of today's employee education is extensive, but also dynamic due to the global challenges of business that encompasses each activity. The special expression of the aforementioned is outlined in the elements of customer support provided to visitors or potential clients through the websites of companies or other institutions. In order to somewhat bridge the gap of new training and training of customer support, companies are forced to seek new models of communication with users and resolve their inquiries as soon as possible. As a solution, the implementation of chatbots in websites is imposed as the first phase of communication with users and the resolution of simple queries by users. Full business automation is a constant focus of corporations precisely because of the global market and increasingly demanding clients. The last year has been marked by the increasing availability of artificial intelligence, advanced computing, or big data processing. The orientation of this technology towards customer satisfaction, and through the use of advantages of chatbot as a "first line" of conversation with visitors enables unprecedented functionality. Through research of available literature and publicly available examples of good practice, the advantages, and disadvantages of implementing chatbots as a substitute for customer support will be observed.
... Many systems rely on human feedback for improvement (Ricciardelli and Biswas 2019;Khan et al. 2021). For example, a text messaging system that aims to help people manage their stress may ask a user to rate a supportive text they received in terms of usefulness (Figueroa et al. 2022). ...
Article
Machine learning algorithms often require quantitative ratings from users to effectively predict helpful content. When these ratings are unavailable, systems make implicit assumptions or imputations to fill in the missing information; however, users are generally kept unaware of these processes. In our work, we explore ways of informing the users about system imputations, and experiment with imputed ratings and various explanations required by users to correct imputations. We investigate these approaches through the deployment of a text messaging probe to 26 participants to help them manage psychological wellbeing. We provide quantitative results to report users' reactions to correct vs incorrect imputations and potential risks of biasing their ratings. Using semi-structured interviews with participants, we characterize the potential trade-offs regarding user autonomy, and draw insights about alternative ways of involving users in the imputation process. Our findings provide useful directions for future research on communicating system imputation and interpreting user non-responses.
Article
The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.
Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. Universal sentence encoder
  • Daniel Cer
  • Yinfei Yang
  • Sheng-Yi Kong
  • Nan Hua
  • Nicole Limtiaco
  • Rhomni St
  • Noah John
  • Mario Constant
  • Steve Guajardo-Cespedes
  • Yuan
Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. Universal sentence encoder. CoRR, abs/1803.11175, 2018.
End-to-end task-completion neural dialogue systems
  • Xiujun Li
  • Yun-Nung Chen
  • Jianfeng Gao
  • Asli Celikyilmaz
Xiujun Li, Yun-Nung Chen, Jianfeng Gao, and Asli Celikyilmaz. End-to-end task-completion neural dialogue systems. In 8th International Joint Conference on Natural Language Processing, 2017.
Bleu: A method for automatic evaluation of machine translation
  • Kishore Papineni
  • Salim Roukos
  • Todd Ward
  • Wei-Jing Zhu
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL '02, pages 311-318, 2002.