Article

Data-driven user simulation for automated evaluation of spoken dialog systems

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper proposes a novel integrated dialog simulation technique for evaluating spoken dialog systems. A data-driven user simulation technique for simulating user intention and utterance is introduced. A novel user intention modeling and generating method is proposed that uses a linear-chain conditional random field, and a two-phase data-driven domain-specific user utterance simulation method and a linguistic knowledge-based ASR channel simulation method are also presented. Evaluation metrics are introduced to measure the quality of user simulation at intention and utterance. Experiments using these techniques were carried out to evaluate the performance and behavior of dialog systems designed for car navigation dialogs and a building guide robot, and it turned out that our approach was easy to set up and showed similar tendencies to real human users.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Because the vocabulary size of the RoboCup dataset is small, the training data of three games appear to be already sufficient as indicated by reasonable recognition results. Starting from the written, normalized data, ASR errors were simulated roughly following Jung et al. [50]. Jung et al. [50] presented a system for user simulation which can be utilized to evaluate spoken dialogue systems, and the system also includes ASR channel simulation. ...
... Starting from the written, normalized data, ASR errors were simulated roughly following Jung et al. [50]. Jung et al. [50] presented a system for user simulation which can be utilized to evaluate spoken dialogue systems, and the system also includes ASR channel simulation. In order to simulate an erroneous utterance, a correct input sequence is transformed by applying the following four steps. ...
... 4) Steps 1)-3) are repeated several times, and all simulated utterances are ranked using a language model. Finally, one of the top-ranked utterances is chosen randomly as the resulting erroneous utterance [50]. We used the recognition results from the previous experiment for simulation. ...
Article
We present a computational model that allows us to study the interplay of different processes involved in first language acquisition. We build on the assumption that language acquisition is usage-driven and assume that there are different processes in language acquisition operating at different levels. Bottom-up processing allows a learner to identify regularities in the linguistic input received, while top-down processing exploits prior experience and previous knowledge to guide choices made during bottom-up processing. To shed light on the interplay between top-down and bottom-up processing in language acquisition, we present a computational model of language acquisition that is based on bootstrapping mechanisms and is usage-based in that it relies on discovered regularities to segment speech into word-like units. Based on this initial segmentation, our model induces a construction grammar that in turn acts as a top-down prior that guides the segmentation of new sentences into words. We spell out in detail these processes and their interplay, showing that top-down processing increases both understanding performance and segmentation accuracy. Our model thus contributes to a better understanding of the interplay between bottom-up and top-down processes in first language acquisition and thus to a better understanding of the mechanisms and architecture involved in language acquisition.
... Corpus-based learning is not the only approach to training dialogue systems. Researchers have also proposed trainng dialogue systems online through live interaction with humans, and off-line using user simulator models and reinforcement learning methods Mohan and Laird, 2014;Pietquin and Hastie, 2013;Jung et al., 2009;Georgila et al., 2006). However, these approaches are beyond the scope of this survey. ...
... The challenge with evaluating goal driven dialogue systems without human intervention is that the process necessarily requires multiple steps-it is difficult to determine if a task has been solved from a single utterance-response pair from a conversation. Thus, simulated data is often generated by a user simulator (Eckert et al., 1997;Schatzmann et al., 2007;Jung et al., 2009;Georgila et al., 2006;Pietquin and Hastie, 2013). Given a sufficiently accurate user simulation model, an interaction between the dialogue system and the user can be simulated from which it is possible to deduce the desired metrics, such as the goal completion rate. ...
... More sophisticated probabilistic models have been proposed based on directed graphical models, such as hidden Markov models and input-output hidden Markov models (Cuayáhuitl et al., 2005), and undirected graphical models, such as conditional random fields based on linear chains (Jung et al., 2009). Inspired by Pietquin (2005), Pietquin (2007) and Rossignol et al. (2011) propose the following directed graphical model: ...
Article
Full-text available
During the past decade, several areas of speech and language understanding have witnessed substantial breakthroughs from the use of data-driven models. In the area of dialogue systems, the trend is less obvious, and most practical systems are still built through significant engineering and expert knowledge. Nevertheless, several recent results suggest that data-driven approaches are feasible and quite promising. To facilitate research in this area, we have carried out a wide survey of publicly available datasets suitable for data-driven learning of dialogue systems. We discuss important characteristics of these datasets and how they can be used to learn diverse dialogue strategies. We also describe other potential uses of these datasets, such as methods for transfer learning between datasets and the use of external knowledge, and discuss appropriate choice of evaluation metrics for the learning objective.
... Previous user simulation studies can be roughly categorized into rule-based methods (Chung, 2005;Lopez-Cozar et al., 2006;Schatzmann et al., 2007a) and data-driven methods (Cuayahuitl et al., 2005;Eckert et al., 1997;Jung et al., 2009;Levin et al., 2000;Georgila et al., 2006;Pietquin, 2004). Rulebased methods generally allow for more control over their designs for the target domain while data-driven methods afford more portability from one domain to another and are attractive for modeling user behavior based on real data. ...
... Although development costs for data-driven methods are typically lower than those of rule-based methods, previous data-driven approaches have still required a certain amount of human effort. Most intention-level models take a semantically annotated corpus to produce user intention without introducing errors (Cuayahuitl et al., 2005;Jung et al., 2009). Surface-level approaches need transcribed data to train their surface form and error generating models (Jung et al., 2009;Schatzmann et al., 2007b). ...
... Most intention-level models take a semantically annotated corpus to produce user intention without introducing errors (Cuayahuitl et al., 2005;Jung et al., 2009). Surface-level approaches need transcribed data to train their surface form and error generating models (Jung et al., 2009;Schatzmann et al., 2007b). A few studies have attempted to directly simulate the intention, surface, and error by applying their statistical methods on the recognized data rather than on the transcribed data (Georgila et al., 2006;Schatzmann et al., 2005). ...
Conference Paper
This paper proposes an unsupervised approach to user simulation in order to automatically furnish updates and assessments of a deployed spoken dialog system. The proposed method adopts a dynamic Bayesian network to infer the unobservable true user action from which the parameters of other components are naturally derived. To verify the quality of the simulation, the proposed method was applied to the Let's Go domain (Raux et al., 2005) and a set of measures was used to analyze the simulated data at several levels. The results showed a very close correspondence between the real and simulated data, implying that it is possible to create a realistic user simulator that does not necessitate human intervention.
... A user simulator is expected to have the following properties: to be statistically consistent with real users, to generate coherent sequences of actions, and to generalize to new contexts [10]. User simulation can be either at the intention level, i.e., generating dialogue acts [11,9], or at the utterance level [12]. In this work, we focus on the intention level. ...
... Many models have been designed in order to meet the requirements cited above [6,13,14,15,11,9]. These models typically suffer from important drawbacks, which include the inability to take dialogue history into account [6], the need of rigid structure to ensure coherent user behaviour [16], heavy dependence on a specific domain [13], the inability to output several user intentions during one dialogue turn [12], or the requirement of a summarized action space for tractability [11]. ...
Article
Full-text available
User simulation is essential for generating enough data to train a statistical spoken dialogue system. Previous models for user simulation suffer from several drawbacks, such as the inability to take dialogue history into account, the need of rigid structure to ensure coherent user behaviour, heavy dependence on a specific domain, the inability to output several user intentions during one dialogue turn, or the requirement of a summarized action space for tractability. This paper introduces a data-driven user simulator based on an encoder-decoder recurrent neural network. The model takes as input a sequence of dialogue contexts and outputs a sequence of dialogue acts corresponding to user intentions. The dialogue contexts include information about the machine acts and the status of the user goal. We show on the Dialogue State Tracking Challenge 2 (DSTC2) dataset that the sequence-to-sequence model outperforms an agenda-based simulator and an n-gram simulator, according to F-score. Furthermore, we show how this model can be used on the original action space and thereby models user behaviour with finer granularity.
... A user simulator is expected to have the following properties: to be statistically consistent with real users, to generate coherent sequences of actions, and to generalize to new contexts [10]. User simulation can be either at the intention level, i.e., generating dialogue acts [11,9], or at the utterance level [12]. In this work, we focus on the intention level. ...
... Many models have been designed in order to meet the requirements cited above [6,13,14,15,11,9]. These models typically suffer from important drawbacks, which include the inability to take dialogue history into account [6], the need of rigid structure to ensure coherent user behaviour [16], heavy dependence on a specific domain [13], the inability to output several user intentions during one dialogue turn [12], or the requirement of a summarized action space for tractability [11]. ...
Conference Paper
Full-text available
User simulation is essential for generating enough data to train a statistical spoken dialogue system. Previous models for user simulation suffer from several drawbacks, such as the inability to take dialogue history into account, the need of rigid structure to ensure coherent user behaviour, heavy dependence on a specific domain, the inability to output several user intentions during one dialogue turn, or the requirement of a summarized action space for tractability. This paper introduces a data-driven user simulator based on an encoder-decoder recurrent neural network. The model takes as input a sequence of dialogue contexts and outputs a sequence of dialogue acts corresponding to user intentions. The dialogue contexts include information about the machine acts and the status of the user goal. We show on the Dialogue State Tracking Challenge 2 (DSTC2) dataset that the sequence-to-sequence model outperforms an agenda-based simulator and an n-gram simulator, according to F-score. Furthermore, we show how this model can be used on the original action space and thereby models user behaviour with finer granularity.
... Signal-level simulation discussed in [Götze et al., 2010] focuses on generating user utterances in the form of speech signals. Word-level simulation as discussed in [Jung et al., 2009] aims at generating text based user utterances. However, in order to perform dialogue optimization most often simulation of user behavior is performed at the intention-level [Eckert et al., 1997a]. ...
... In case of user simulation evaluation, the n-grams considered are not sequences of words but sequences of intentions. The later is termed as Discourse-bleu (dbleu) [Jung et al., 2009]. ...
Article
Recent advancements in the area of spoken language processing and the wide acceptance of portable devices, have attracted signicant interest in spoken dialogue systems.These conversational systems are man-machine interfaces which use natural language (speech) as the medium of interaction.In order to conduct dialogues, computers must have the ability to decide when and what information has to be exchanged with the users. The dialogue management module is responsible to make these decisions so that the intended task (such as ticket booking or appointment scheduling) can be achieved.Thus learning a good strategy for dialogue management is a critical task.In recent years reinforcement learning-based dialogue management optimization has evolved to be the state-of-the-art. A majority of the algorithms used for this purpose needs vast amounts of training data.However, data generation in the dialogue domain is an expensive and time consuming process. In order to cope with this and also to evaluatethe learnt dialogue strategies, user modelling in dialogue systems was introduced. These models simulate real users in order to generate synthetic data.Being computational models, they introduce some degree of modelling errors. In spite of this, system designers are forced to employ user models due to the data requirement of conventional reinforcement learning algorithms can learn optimal dialogue strategies from limited amount of training data when compared to the conventional algorithms. As a consequence of this, user models are no longer required for the purpose of optimization, yet they continue to provide a fast and easy means for quantifying the quality of dialogue strategies. Since existing methods for user modelling are relatively less realistic compared to real user behaviors, the focus is shifted towards user modelling by means of inverse reinforcement learning. Using experimental results, the proposed method's ability to learn a computational models with real user like qualities is showcased as part of this work.
... It is important to also mention the recent work introduced by Jung et al. (2009) where the authors demonstrated a data-driven case-based reasoning methodology based on an adaptation of conditional random fields (CRF) to the SDS context. Although the results looked very promising, the implemented simulation worked only on the intentions level, making use of simulated automatic speech recognition (ASR). ...
... Bergmann et al. (2003) provide a good overview of the case-based reasoning methodology in industrial application contexts. Moreover, Jung et al. (2009) have demonstrated the successful application of a data-driven casebased reasoning methodology in the SDS domain. Since this approach does not always lead to some particular decision, we have to use a back-off mechanism to support its application in real-time dialogs. ...
Article
This paper introduces a new data-driven ap-proach for a realistic user simulation for spo-ken dialog system evaluation. It describes a simulator that calls commercial German spo-ken dialog systems via the phone and performs spoken dialogs in a realistic manner by sub-mitting synthesized user utterances to this "black-box" dialog system. Also, it demon-strates the usefulness of a case-based reason-ing approach to interaction management. Be-sides that, it explores the promising architec-ture implementation that allows usability en-gineers to run a user simulation at early stages of spoken dialog system development. This implementation is also extensible towards multimodal dialog evaluation purposes.
... While they are insightful, they do not consider the quality of the dialogue in terms of structure, naturalness, and coherence. Out of these aspects, naturalness has attracted the most attention and metrics such as D-BLEU [14] and SUPER [27] were proposed to assess it. ...
Preprint
User simulation is a promising approach for automatically training and evaluating conversational information access agents, enabling the generation of synthetic dialogues and facilitating reproducible experiments at scale. However, the objectives of user simulation for the different uses remain loosely defined, hindering the development of effective simulators. In this work, we formally characterize the distinct objectives for user simulators: training aims to maximize behavioral similarity to real users, while evaluation focuses on the accurate prediction of real-world conversational agent performance. Through an empirical study, we demonstrate that optimizing for one objective does not necessarily lead to improved performance on the other. This finding underscores the need for tailored design considerations depending on the intended use of the simulator. By establishing clear objectives and proposing concrete measures to evaluate user simulators against those objectives, we pave the way for the development of simulators that are specifically tailored to their intended use, ultimately leading to more effective conversational agents.
... Pseudo Conversations Simulation. Following previous simulation methods [8,19,50], we extract the high-frequency conversation patterns of seeking suggestions and giving recommendations to create response templates for the user and recommender, respectively. A sample example can be given as follows. ...
Preprint
Conversational recommender systems (CRS) aim to provide the recommendation service via natural language conversations. To develop an effective CRS, high-quality CRS datasets are very crucial. However, existing CRS datasets suffer from the long-tail issue, \ie a large proportion of items are rarely (or even never) mentioned in the conversations, which are called long-tail items. As a result, the CRSs trained on these datasets tend to recommend frequent items, and the diversity of the recommended items would be largely reduced, making users easier to get bored. To address this issue, this paper presents \textbf{LOT-CRS}, a novel framework that focuses on simulating and utilizing a balanced CRS dataset (\ie covering all the items evenly) for improving \textbf{LO}ng-\textbf{T}ail recommendation performance of CRSs. In our approach, we design two pre-training tasks to enhance the understanding of simulated conversation for long-tail items, and adopt retrieval-augmented fine-tuning with label smoothness strategy to further improve the recommendation of long-tail items. Extensive experiments on two public CRS datasets have demonstrated the effectiveness and extensibility of our approach, especially on long-tail recommendation.
... The gold standard for evaluation of dialogues is human judgement (Meena et al., 2014;Ultes et al., 2013;Jang et al., 2020;Shim et al., 2021;Khalid et al., 2020b;Panfili et al., 2021) but human judgements are hard to obtain. Other than human judgements, dialogue simulations are used to judge different aspects of a model behavior (Jung et al., 2009;Eckert et al., 1997;Cuayáhuitl et al., 2010;Khalid et al., 2020a;Kreyssig et al., 2018;Sun et al., 2021a). Neural models of dialogue rely on text similarity metrics like BLEU, ROUGE or METEOR (Papineni et al., 2002;Lin, 2004;Banerjee and Lavie, 2005), however these do not correlate well with the human Table 1: This example presents a problematic judgement by the DialogRPT metric. ...
... User simulation. User simulations have been extensively studied in the context of dialogue systems, where they were used for training (Schatzmann et al. 2007;Kreyssig et al. 2018) or evaluation (Scheffler and Young 2001;Jung et al. 2009; Crook and Marin 2017; Zhang and Balog 2020) purposes. In both cases, the focus has been on generating simulators for the specific task that closely resemble real users. ...
Article
Interactive AI (IAI) systems are increasingly popular as the human-centered AI design paradigm is gaining strong traction. However, evaluating IAI systems, a key step in building such systems, is particularly challenging, as their output highly depends on the performed user actions. Developers often have to rely on limited and mostly qualitative data from ad-hoc user testing to assess and improve their systems. In this paper, we present InteractEva; a systematic evaluation framework for IAI systems. We also describe how we have applied InteractEva to evaluate a commercial IAI system, leading to both quality improvements and better data-driven design decisions.
... Callejas et al. (2012) suggest to overcome the lack of interpretability of these apporaches by using mutidimensional subspace clustering to graphically show the similarity between generated and real data, but their metric is susceptible to the choice of features and clustering algorithm. Jung et al. (2009) propose to adapt the BLEU score (Papineni et al., 2002) to capture "dialogue level naturalness" by considering a "gram" to be a user or system action and show that this metric correlates well with human judgement. One drawback to applying this metric to compare the sequences generated by user models with references is the arbitrary ordering of action sequences. ...
... The user simulator can adopt either agenda-based approach or model-based approach for the dialog manager. While for NLG, the user simulator can use the dialog act to select pre-defined templates, retrieve user utterances from previously collected dialogs, or generate the surface form utterance directly with pre-trained language model (Jung et al., 2009). ...
... [21,3,1,6]. User simulators of utterance level communicates to agent with utterance directly [10,11,15]. Our user simulator can work on both of the two levels. ...
Preprint
Recent reinforcement learning algorithms for task-oriented dialogue system absorbs a lot of interest. However, an unavoidable obstacle for training such algorithms is that annotated dialogue corpora are often unavailable. One of the popular approaches addressing this is to train a dialogue agent with a user simulator. Traditional user simulators are built upon a set of dialogue rules and therefore lack response diversity. This severely limits the simulated cases for agent training. Later data-driven user models work better in diversity but suffer from data scarcity problem. To remedy this, we design a new corpus-free framework that taking advantage of their benefits. The framework builds a user simulator by first generating diverse dialogue data from templates and then build a new State2Seq user simulator on the data. To enhance the performance, we propose the State2Seq user simulator model to efficiently leverage dialogue state and history. Experiment results on an open dataset show that our user simulator helps agents achieve an improvement of 6.36% on success rate. State2Seq model outperforms the seq2seq baseline for 1.9 F-score.
... The user simulator can adopt either agenda-based approach or model-based approach for the dialog manager. While for NLG, the user simulator can use the dialog act to select pre-defined templates, retrieve user utterances from previously collected dialogs, or generate the surface form utterance directly with pre-trained language model (Jung et al., 2009). ...
Preprint
User simulators are essential for training reinforcement learning (RL) based dialog models. The performance of the simulator directly impacts the RL policy. However, building a good user simulator that models real user behaviors is challenging. We propose a method of standardizing user simulator building that can be used by the community to compare dialog system quality using the same set of user simulators fairly. We present implementations of six user simulators trained with different dialog planning and generation methods. We then calculate a set of automatic metrics to evaluate the quality of these simulators both directly and indirectly. We also ask human users to assess the simulators directly and indirectly by rating the simulated dialogs and interacting with the trained systems. This paper presents a comprehensive evaluation framework for user simulator study and provides a better understanding of the pros and cons of different user simulators, as well as their impacts on the trained systems.
... As our systems have become more complicated and the statistical methods we use demand more and more data, proper system assessment becomes an increasingly difficult challenge. One of the easier approaches to goal-oriented system assessment is to employ user simulation (Jung et al., 2009;Pietquin and Hastie, 2013;Schatzmann et al., 2005). It aims at the overall assessment of the system by measuring goal completion. ...
... Just like hardware-in-the-loop simulations, where the hardware, or part thereof, is physical rather than virtual, these have the drawback that they must run in real time and cannot be accelerated to investigate usage over longer time intervals [18]. Moreover, user testing is known to be expensive [19]. Figure 1 illustrates how TPUI from real-life usage of fielded products can fill the gap by providing realistic human inputs and thus contribute to more realistic results [20], without the need to recruit human subjects and slow down to real-time execution. ...
Article
Full-text available
The real-life use of a product is often hard to foresee during its development. Fortunately, today’s connective products offer the opportunity to collect information about user actions, which enables companies to investigate the actual use for the benefit of next-generation products. A promising application opportunity is to input the information to engineering simulations and increase their realism to (i) reveal how use-related phenomena influence product performance and (ii) to evaluate design variations on how they succeed in coping with real users and their behaviors. In this article we explore time-stamped usage data from connected fridge-freezers by investigating energy losses caused by door openings and by evaluating control-related design variations aimed at mitigating these effects. By using a fast-executing simulation setup we could simulate much faster than real time and investigate usage over a longer time. We showed that a simple, single-cycle load pattern based on aggregated input data can be simulated even faster but only produce rough estimates of the outcomes. Our model was devised to explore application potential rather than producing the most accurate predictions. Subject to this reservation, our outcomes indicate that door openings do not affect energy consumption as much as some literature suggests. Through what-if studies we could evaluate three design variations and nevertheless point out that particular solution elements resulted in more energy-efficient ways of dealing with door openings. Based on our findings, we discuss possible impacts on product design practice for companies seeking to collect and exploit usage data from connected products in combination with simulations. http://asmedigitalcollection.asme.org/article.aspx?articleid=2722696&resultClick=1
... • Along the granularity dimension, the user simulator can operate either at the dialogue-act level (also known as intention level), or at the utterance level (Jung et al., 2009). • Along the methodology dimension, the user simulator can be implemented using a rulebased approach, or a model-based approach with the model learned from real conversational corpus. ...
Preprint
The present paper surveys neural approaches to conversational AI that have been developed in the last few years. We group conversational systems into three categories: (1) question answering agents, (2) task-oriented dialogue agents, and (3) chatbots. For each category, we present a review of state-of-the-art neural approaches, draw the connection between them and traditional approaches, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies.
... These have the drawback that they must run in real time and cannot be accelerated to investigate usage over a longer time interval [12]. Moreover, deploying real users in testing is known to be expensive [13]. Figure 1 illustrates how TPUI from real-life usage of fielded products can fill the gap by providing realistic human inputs and thus contribute to more realistic results [14], without the need to slow down to realtime execution or recruit human subjects. ...
Conference Paper
Full-text available
Today's connected products increasingly allow us to collect and analyze information on how they are actually used. An engineering activity where usage data can prove particularly useful, and be converted to actionable engineering knowledge, is simulation: user behavior is often hard to model, and collected data representing real user interactions as simulation input can increase realism of simulations. This is especially useful for (i) investigating use-related phenomena that influence the product's performance and (ii) evaluating design variations on how they succeed in coping with real users and their behaviors. In this paper we explored time-stamped usage data from connected refrigerators, investigating the influence of door openings on energy consumption and evaluating control-related design variations envisaged to mitigate negative effects of door openings. We used a fast-executing simulation setup that allowed us to simulate much faster than real time and investigate usage over a longer time. According to our first outcomes, door openings do not affect energy consumption as much as some literature suggests. Through what-if studies we could evaluate three design variations and nevertheless point out that particular solution elements resulted in better ways of dealing with door openings in terms of energy consumption.
... For Part-of-speech tagging and emotion recognition, the most important thing is the Chinese word segmentation scheme. At present, there are two kinds of commonly used methods: one is the matching technology based on string; the second is based on statistical and machine learning word segmentation methods, there are three common models for sequence tagging which is CRF, HMM and MEHMM [18]. The statistical learning model of Chinese word segmentation can be roughly divided into two categories: Word-Based Generative Model and Character-Based Discriminative Model. ...
Article
Full-text available
As the promotion competition of comprehensive e-commerce platforms becomes increasingly keener, this paper aims at finding the differences and key elements of consumer perceptions of different products from the perspective of category subdivision, and exploring the transformation mechanism between text comments and graded comments in order for accurate recommendation of products. First online commodities were generally divided into six categories; and the dictionary-based method was employed to calculate the emotional distribution of each category; then the key factors affecting user experience were identified through word frequency analysis; next, by adjusting emotion intensity and emotion weights, the correlation between text comment and graded comment was studied; finally, the prediction model was built for grade correction. Significant differences exist in the emotional perceptions of consumers whose concerns have similar dimensions but different degrees. Text comments are correlated with graded comments, but deviation between the two occurs with external interference. Adjustment of emotion intensity and emotion weight has an impact on the comprehensive emotion value of products, based on which the recommendation sequencing can be optimized.
... There are also some user simulations built on multiple levels. For instance, Jung et al. (2009) integrated different data-driven approaches on intention and word levels to build a novel user simulation. The user intent simulation is for generating user intention patterns, and then a two-phase data-driven domain-specific user utterance simulation is proposed to produce a set of structured utterances with sequences of words given a user intent and select the best one using the BLEU score. ...
Conference Paper
Full-text available
We motivate and describe a new freely available human-human dialogue data set for interactive learning of visually grounded word meanings through osten-sive definition by a tutor to a learner. The data has been collected using a novel, character-by-character variant of the DiET chat tool (Healey et al., 2003; Mills and Healey, submitted) with a novel task, where a Learner needs to learn invented visual attribute words (such as " burchak " for square) from a tutor. As such, the text-based interactions closely resemble face-to-face conversation and thus contain many of the linguistic phenomena encountered in natural, spontaneous dialogue. These include self-and other-correction, mid-sentence continuations, interruptions, overlaps, fillers, and hedges. We also present a generic n-gram framework for building user (i.e. tutor) simulations from this type of incremental data, which is freely available to researchers. We show that the simulations produce outputs that are similar to the original data (e.g. 78% turn match similarity). Finally, we train and evaluate a Reinforcement Learning dialogue control agent for learning visually grounded word meanings, trained from the BURCHAK corpus. The learned policy shows comparable performance to a rule-based system built previously.
... Therefore, to the best of our knowledge, there is no standard way to build a user simulator. Here, we summarize the literature of user simulation in different aspects: @BULLET At the granularity level, the user simulator can operate either at the dialog-act 2 level, or at the utterance level [8]. @BULLET At the methodology level, the user simulator could use a rule-based approach, or a modelbased approach where the model is learned from training data. ...
Article
Full-text available
Despite widespread interests in reinforcement-learning for task-oriented dialogue systems, several obstacles can frustrate research and development progress. First, reinforcement learners typically require interaction with the environment, so conventional dialogue corpora cannot be used directly. Second, each task presents specific challenges, requiring separate corpus of task-specific annotated data. Third, collecting and annotating human-machine or human-human conversations for task-oriented dialogues requires extensive domain knowledge. Because building an appropriate dataset can be both financially costly and time-consuming, one popular approach is to build a user simulator based upon a corpus of example dialogues. Then, one can train reinforcement learning agents in an online fashion as they interact with the simulator. Dialogue agents trained on these simulators can serve as an effective starting point. Once agents master the simulator, they may be deployed in a real environment to interact with humans, and continue to be trained online. To ease empirical algorithmic comparisons in dialogues, this paper introduces a new, publicly available simulation framework, where our simulator, designed for the movie-booking domain, leverages both rules and collected data. The simulator supports two tasks: movie ticket booking and movie seeking. Finally, we demonstrate several agents and detail the procedure to add and test your own agent in the proposed framework.
... They simulate a binary value of users' frustration level according to simple hand-coded rules in addition to user actions. (Jung et al., 2009) integrate different data-driven modeling techniques on the intention, lexical, and prosodic levels to build a user simulation that mimics user behaviors at all levels. ...
Article
A user simulation is a computer program which simulates human user behaviors. Recently, user simulations have been widely used in two spoken dialog system development tasks. One is to generate large simulated corpora for applying machine learning to learn new dialog strategies, and the other is to replace human users to test dialog system performance. Although previous studies have shown successful examples of applying user simulations in both tasks, it is not clear what type of user simulation is most appropriate for a specific task because few studies compare different user simulations in the same experimental setting. In this research, we investigate how to construct user simulations in a specific task for spoken dialog system development. Since most current user simulations generate user actions based on probabilistic models, we identify two main factors in constructing such user simulations: the choice of user simulation model and the approach to set up user action probabilities. We build different user simulation models which differ in their efforts in simulating realistic user behaviors and exploring more user actions. We also investigate different manual and trained approaches to set up user action probabilities. We introduce both task-dependent and task-independent measures to compare these simulations. We show that a simulated user which mimics realistic user behaviors is not always necessary for the dialog strategy learning task. For the dialog system testing task, a user simulation which simulates user behaviors in a statistical way can generate both objective and subjective measures of dialog system performance similar to human users. Our research examines the strengths and weaknesses of user simulations in spoken dialog system development. Although our results are constrained to our task domain and the resources available, we provide a general framework for comparing user simulations in a task-dependent context. In addition, we summarize and validate a set of evaluation measures that can be used in comparing different simulated users as well as simulated versus human users.
... Using the user simulator avoids the evaluation problems that can arise with human subjects and experiments can be conducted effectively under various different conditions without changing other control variables. The user simulator was implemented based on the previous work [48] and includes user intention modeling using a linear-chain conditional ran- dom field (CRF), data-driven user utterance simulations, and ASR channel simulations which uses linguistic knowledge. When a dialog state is given, the user simulator generates a user intention based on the CRF model. ...
Article
Full-text available
This study examines the dialog-based language learning game (DB-LLG) realized in a 3D environment built with game contents. We designed the DB-LLG to communicate with users who can conduct interactive conversations with game characters in various immersive environments. From the pilot test, we found that several technologies were identified as essential in the construction of the DB-LLG such as dialog management, hint generation, and grammar error detection and feedback. We describe the technical details of our system POSTECH immersive English study (Pomy). We evaluated the performance of each technology using a simulator and field tests with users.
... How to effectively measure the realism of simulated dialogs is still very much an open research question. Some measures are discussed for example in (Jung et al. 2009), based on comparing the simulated dialogs with real user dialogs using the BLEU metric and based on human judgments. In the absence of real user dialogs with the same SDS, we aim for greater variability in the simulated user behavior. ...
Article
In this paper, we introduce the SpeechEval system, a plat-form for the automatic evaluation of spoken dialog systems on the basis of learned user strategies. The increasing num-ber of spoken dialog systems calls for efficient approaches for their development and testing. The goal of SpeechEval is the minimization of hand-crafted resources to maximize the portability of this evaluation environment across spoken dialog systems and domains. In this paper we discuss the ar-chitecture of SpeechEval, as well as the user simulation tech-nique which allows us to learn general user strategies from a new corpus. We present this corpus, the VOICE Awards human-machine dialog corpus, and show how this corpus is used to semi-automatically extract the resources and knowl-edge bases on which SpeechEval is based.
... Ferri et al. used [6]'s method to match a multimodal dialog sentence and define the multimodal templates similarity [8]. Jung et al. used Needleman-Wunsch algorithms [3] for measuring similarity between simulated spoken utterances in user simulation problem [9]. ...
Conference Paper
The ability to accurately judge the similarity between sentences is important for dialog system development in various areas such as utterance verification, context reasoning, utterance clustering. However, standard text similarity measures fail when directly applied to dialog sentences which are usually very short and have many ungrammatical omissions and inversions. This paper presents a method for sentence similarity refining method using discourse similarity of dialog sentences. First, we propose a novel discourse similarity based on the dialog act taxonomy. Given discourse similarity, we then present a novel way of rescoring original sentence score by explicitly adding discourse score to it. Experiments on test data sets demonstrate that the proposed measure significantly outperforms traditional similarity scoring measures.
... We evaluated the performance of these models by measuring dialogue similarity to the original data, based on the Kullback-Leibler (KL) divergence, as also used by, e.g. [58], [59], [60]. We compare the raw probabilities as observed in the data with the probabilities generated by our n-gram models using different discounting techniques, including Witten-Bell, Good-Turing, absolute and linear discounting [57], see Table III. ...
Article
Full-text available
We present and evaluate a novel approach to natural language generation (NLG) in statistical spoken dialogue systems (SDS) using a data-driven statistical optimization framework for incremental information presentation (IP), where there is a trade-off to be solved between presenting “enough" information to the user while keeping the utterances short and understandable. The trained IP model is adaptive to variation from the current generation context (e.g. a user and a non-deterministic sentence planner), and it incrementally adapts the IP policy at the turn level. Reinforcement learning is used to automatically optimize the IP policy with respect to a data-driven objective function. In a case study on presenting restaurant information, we show that an optimized IP strategy trained on Wizard-of-Oz data outperforms a baseline mimicking the wizard behavior in terms of total reward gained. The policy is then also tested with real users, and improves on a conventional hand-coded IP strategy used in a deployed SDS in terms of overall task success. The evaluation found that the trained IP strategy significantly improves dialogue task completion for real users, with up to a 8.2% increase in task success. This methodology also provides new insights into the nature of the IP problem, which has previously been treated as a module following dialogue management with no access to lower-level context features (e.g. from a surface realizer and/or speech synthesizer).
... He then hand selects parameters to ensure that the user's actions are in accordance with their goal. Jung et al. (2009) use large amounts of dialog state annotations (e.g. what information has been provided so far) to learn Conditional Random Fields over the user utterances, and assume that those features ensure user consistency. ...
Conference Paper
Full-text available
User simulation is frequently used to train statistical dialog managers for task-oriented domains. At present, goal-driven simulators (those that have a persistent notion of what they wish to achieve in the dialog) require some task-specific engineering, making them impossible to evaluate intrinsically. Instead, they have been evaluated extrinsically by means of the dialog managers they are intended to train, leading to circularity of argument. In this paper, we propose the first fully generative goal-driven simulator that is fully induced from data, without hand-crafting or goal annotation. Our goals are latent, and take the form of topics in a topic model, clustering together semantically equivalent and phonetically confusable strings, implicitly modelling synonymy and speech recognition noise. We evaluate on two standard dialog resources, the Communicator and Let's Go datasets, and demonstrate that our model has substantially better fit to held out data than competing approaches. We also show that features derived from our model allow significantly greater improvement over a baseline at distinguishing real from randomly permuted dialogs.
... For each dialog, the system gets 20 points if the dialog is successfully completed, and loses one point for each dialogue turn. To evaluate the dialog system, we used a dialog simulator proposed in [17]. We used 1000 simulated dialogs and 10-best recognition hypotheses for an automatic evaluation. ...
Conference Paper
In data-driven spoken dialog system development, developers should prepare a dialog corpus with semantic annotation. However, the labeling process is a laborious and time consuming task. To reduce human efforts, we propose an unsupervised approach based on non-parametric Bayesian Hidden Markov Model to the problem of modeling user actions. With the non-parametric model, system designers do not need to determine the number and type of user actions. In the experiments, we evaluated the clustering results by comparing them to the human annotation. We also tested a dialog system that used models trained from the automatically annotated corpus with a user simulation.
Article
Full-text available
Dialogue policy is a crucial component in task-oriented Spoken Dialogue Systems (SDSs). As a decision function, it takes the current dialogue state as input and generates appropriate system’s response. In this paper, we explore the reinforcement learning approaches to solve this problem in an Indic language scenario. Recently, Deep Reinforcement Learning (DRL) has been used to optimise the dialogue policy. However, many DRL approaches are not sample-efficient. Hence, particular attention is given to actor-critic methods based on off-policy reinforcement learning that utilise the Experience Replay (ER) technique for reducing the bias and variance to achieve high sample efficiency. ER based actor-critic methods, such as Advantage Actor-Critic Experience Replay (A2CER) are proven to deliver competitive results in gaming environments that are fully observable and have a very small action-set. While, in SDSs, the states are not fully observable and often have to deal with the large action space. Describing the limitations of traditional methods, i.e., value-based and policy-based methods, such as high variance, low sample-efficiency, and often converging to local optima, we firstly explore the use of A2CER in dialogue policy learning. It is shown to beat the current state-of-the-art deep learning methods for SDS. Secondly, to handle the issues of early-stage performance, we utilise a demonstration corpus to pre-train the models prior to on-line policy learning. We thus experiment with the A2CER on a larger action space and find it significantly faster than the current state-of-the-art. Combining both approaches, we present a novel DRL based dialogue policy optimisation method, A2CER and its effectiveness for a task-oriented SDS in the Indic language.
Chapter
Recent reinforcement learning algorithms for task-oriented dialogue system absorbs a lot of interest. However, an unavoidable obstacle for training such algorithms is that annotated dialogue corpora are often unavailable. One of the popular approaches addressing this is to train a dialogue agent with a user simulator. Traditional user simulators are built upon a set of dialogue rules and therefore lack response diversity. This severely limits the simulated cases for agent training. Later data-driven user models work better in diversity but suffer from data scarcity problem. To remedy this, we design a new corpus-free framework that taking advantage of their benefits. The framework builds a user simulator by first generating diverse dialogue data from templates and then build a new State2Seq user simulator on the data. To enhance the performance, we propose the State2Seq user simulator model to efficiently leverage dialogue state and history. Experiment results on an open dataset show that our user simulator helps agents achieve an improvement of 6.36%6.36\% on success rate. State2Seq model outperforms the seq2seq baseline for 1.9 F-score.
Conference Paper
Full-text available
In this paper, we suggest the generalization of an Arabic Spoken Language Understanding (SLU) system in a multi-domain human-machine dialog. We are interested particularly in domain portability of SLU system related to both structured (DBMS) and unstructured data (Information Extraction), related to four domains. In this work, we used the thematic approach for four domains which are School Management, Medical Diagnostics, Consultation domain and Question-Answering domain (DAWQAS). We should note that two kinds of classifiers are used in our experiments: statistical and neural, namely: Gaussian Naive Bayes, Bernoulli Naive Bayes, Logistic Regression, SGD, Passive Aggressive Classifier, Perceptron, Linear Support Vector and Convolutional Neural Network.
Article
The present paper surveys neural approaches to conversational AI that have been developed in the last few years. We group conversational systems into three categories: (1) question answering agents, (2) task-oriented dialogue agents, and (3) chatbots. For each category, we present a review of state-of-the-art neural approaches, draw the connection between them and traditional approaches, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies.
Chapter
In this paper we provide an overview of the CLEF 2018 Dynamic Search Lab. The lab ran for the first time in 2017 as a workshop. The outcomes of the workshop were used to define the tasks of this year’s evaluation lab. The lab strives to answer one key question: how can we evaluate, and consequently build, dynamic search algorithms? Unlike static search algorithms, which consider user request’s independently, and consequently do not adapt their ranking with respect to the user’s sequence of interactions and the user’s end goal, dynamic search algorithms try to infer the user’s intentions based on their interactions and adapt their ranking accordingly. Session personalization, contextual search, conversational search, dialog systems are some examples of dynamic search. Herein, we describe the overall objectives of the CLEF 2018 Dynamic Search Lab, the resources created, and the evaluation methodology designed.
Conference Paper
In this paper we provide an overview of the first edition of the CLEF Dynamic Search Lab. The CLEF Dynamic Search lab ran in the form of a workshop with the goal of approaching one key question: how can we evaluate dynamic search algorithms? Unlike static search algorithms, which essentially consider user request’s independently, and which do not adapt the ranking w.r.t the user’s sequence of interactions, dynamic search algorithms try to infer from the user’s intentions from their interactions and then adapt the ranking accordingly. Personalized session search, contextual search, and dialog systems often adopt such algorithms. This lab provides an opportunity for researchers to discuss the challenges faced when trying to measure and evaluate the performance of dynamic search algorithms, given the context of available corpora, simulations methods, and current evaluation metrics. To seed the discussion, a pilot task was run with the goal of producing search agents that could simulate the process of a user, interacting with a search system over the course of a search session. Herein, we describe the overall objectives of the CLEF 2017 Dynamic Search Lab, the resources created for the pilot task and the evaluation methodology adopted.
Chapter
Employing statistical models of users, generation contexts and of naturallanguages themselves has several potentially beneficial features: the ability to trainmodels on real data, the availability of precise mathematical methods foroptimisation, and the capacity to adapt robustly to previously unseensituations.
Chapter
The evaluation of conversational interfaces is a continuously evolving research area that encompasses a rich variety of methodologies, techniques, and tools. As conversational interfaces become more complex, their evaluation has become multifaceted. Furthermore, evaluation involves paying attention not only to the different components in isolation, but also to interrelations between the components and the operation of the system as a whole. This chapter discusses the main measures that are employed for evaluating conversational interfaces from a variety of perspectives.
Article
This paper proposes a novel technique to create scenarios that can be used by a user simulator for exhaustively evaluating spoken dialogue systems. The scenarios are automatically created from simple scenario-templates that the systems' developers create manually employing their knowledge about typical goals of the system's users. The scenarios contain goals, which the user simulator will try to achieve through the interaction with the systems. The goals are represented in the form of semantic frames, which are associated with user utterances of sentences and are taken from utterance corpora. In this way, the scenarios enable speech-based interaction between the simulator and the spoken dialogue systems to be evaluated. Experiments have been carried out employing two spoken dialogue systems (Saplen and Viajero), a user simulator and two utterance corpora previously collected for two different application domains: fast-food ordering and bus travel information. Experimental results show that the technique has been useful for exhaustively evaluating the systems and finding out problems in their performance that must addressed to improve them. Some of these problems are caused by acoustic similarity between some uttered words and strong speaker accents. Thus, we think these problems would have been difficult to uncover employing the user simulation techniques typically used nowadays, as they do not employ real speech and just consider semantics of user intentions.
Article
The concept of Deep Level Situation Understanding is proposed to realize human-like natural communication among agents (e.g., humans and robots/machines), where it consists of surface level understanding (such as gesture/posture recognition, facial expression recognition, and speech/voice recognition), emotion understanding, intention understanding, and atmosphere understanding by applying customised knowledge of each agent and by taking considerations to careful attentions. It aims to not impose burden on humans in human-machine communication, to realize harmonious communication by excluding unnecessary troubles or misunderstandings among agents, and finally to create a peaceful, happy, and prosperous humans-robots society. A scenario is established to demonstrate several communication activities between a businessman and a secretary-robot/a human-boss/a waitress-robot/a human-partner/a therapy-robot (PARO) in one day.
Article
In this chapter we will describe a new approach to generating natural language in interactive systems - one that shares many features with more traditional planning approaches but that uses statistical machine learning models to develop adaptive natural language generation (NLG) components for interactive applications. We employ statistical models of users, of generation contexts, and of natural language itself. This approach has several potential advantages: the ability to train models on real data, the availability of precise mathematical methods for optimization, and the capacity to adapt robustly to previously unseen situations. Rather than emulating human behavior in generation (which can be sub-optimal), these methods can find strategies for NLG that improve on human performance. Recently, some very encouraging test results have been obtained with real users of systems developed using these methods. In this chapter we will explain the motivations behind this approach, and will present several case studies, with reference to recent empirical results in the areas of information presentation and referring expression generation, including new work on the generation of temporal referring expressions. Finally, we provide a critical outlook for future work on statistical approaches to adaptive NLG.
Article
Statistical approaches to dialogue management have steadily increased inpopularity over the last decade. Recent evaluations of such dialogue managershave shown their feasibility for sizeable domains and their advantage in terms ofincreased robustness. Moreover, simulated users have shown to be highly beneficialin the development and testing of dialogue managers and in particular, fortraining statistical dialogue managers. Learning the optimal policy of aPOMDP dialogue manager is typically done using the reinforcement learning(RL), but with the RL algorithms that are commonly used today, thisprocess still relies on the use of a simulated user. Data-driven approaches touser simulation have been developed to train dialogue managers on morerealistic user behaviour. This chapter provides an overview of user simulationtechniques and evaluation methodologies. In particular, recent developments inagenda-based user simulation, dynamic Bayesian network-based simulations andinverse reinforcement learning-based user simulations are discussed indetail. Finally, we will discuss ongoing work and future challenges for usersimulation.
Chapter
A spoken dialogue system has a large number of complex problems to overcome. To simplify matters, two key assumptions are almost always taken. First, only dialogues with exactly two participants are considered and second, all interactions between the system and the user are in the form of turns
Article
In building a spoken dialogue system (SDS), disambiguation of user speech input is an important and challenging task. In our research, we develop a new disambiguation technique. The core component of the technique is a user behavior model that is used to predict user dialogue actions. When ambiguity occurs, information about predicted user actions can be used for disambiguation. We apply a reinforcement learning algorithm for creating and online updating the user behavior model. In the previous stage of our resaerch, the algorithm was based on the Markov Decision Process (MDP), which had limitations to deal with uncertainty in perceiving states. In this paper, we present a new reinforcement learning algorithm for user behavior modeling, which is based on POMDP (Partially Observable Markov Decision Process). We will describe how a learning agent creates and updates the user behavior model when it is uncertain about the current states, and how the agent applies the model for disambiguating user speech input. We will present an experimental system and initial experimental results as well.
Conference Paper
When interacting with information access systems, users have typically distinct interests, preferences and goals that may significantly influence the way they judge the relevance of these systems' outputs. User modeling aims at capturing such interests and preferences in the form of personalized user profiles which could be then used by the systems to provide tailored information services to their users. In this context, we present in this paper a novel ontology-based user profile for personalized information access which, different from the majority of existing approaches, captures user preferences in the form of semantic relevance paths within the ontology graph. More importantly, we present an automated approach for learning and maintaining such profiles in dialogue systems by taking advantage of the dialogue-based interaction between the system and the user and eliciting from the latter important feedback regarding the relevance of the provided information.
Conference Paper
A recent trend in spoken dialogue research is the use of reinforcement learning to train dialogue systems in a simulated environment. Past researchers have shown that the types of errors that are simulated can have a significant effect on simulated dialogue performance. Since modern systems typically receive an N-best list of possible user utterances, it is important to be able to simulate a full N-best list of hypotheses. This paper presents a new method for simulating such errors based on logistic regression, as well as a new method for simulating the structure of N-best lists of semantics and their probabilities, based on the Dirichlet distribution. Off-line evaluations show that the new Dirichlet model results in a much closer match to the receiver operating characteristics (ROC) of the live data. Experiments also show that the logistic model gives confusions that are closer to the type of confusions observed in live situations. The hope is that these new error models will be able to improve the resulting performance of trained dialogue systems.
Article
Full-text available
This paper describes a method for simulating mixed initiative human-machine dialogues using data collected by a prototype dialogue system. The behaviour of the user population is modelled probabilistically using an explicit representation of user state. Recognition and understanding errors are also modelled. The simulation can be used for evaluation of competing strategies, as well as automatic learning of dialogue strategies.
Article
Full-text available
Recent work in the area of probabilistic user sim- ulation for training statistical dialogue managers has investigated a new agenda-based user model and presented preliminary experiments with a handcrafted model parameter set. Training the model on dialogue data is an important next step, but non-trivial since the user agenda states are not observable in data and the space of possible states and state transitions is intractably large. This paper presents a summary-space mapping which greatly reduces the number of state tran- sitions and introduces a tree-based method for representing the space of possible agenda state sequences. Treating the user agenda as a hid- den variable, the forward/backward algorithm can then be successfully applied to iteratively es- timate the model parameters on dialogue data.
Article
Full-text available
A diagnostic evaluation of eight Switchboard-corpus recognition (and six forced-alignment) systems was conducted in order to ascertain whether the associated error patterns can be traced to a specific set of factors. Each recognition system's output was converted to a common format and scored relative to a reference transcript derived from phonetically hand-labeled data (comprising fifty-four minutes of material from several hundred speakers). This reference material was analyzed with respect to several dozen acoustic, linguistic and speaker characteristics, which in turn, were correlated with the recognition-error patterns via a decision-tree analysis. The decision trees indicate that the most consistent factors associated with superior recognition performance pertain to accurate classification of phonetic segments and features. These results suggest that future-generation recognition systems would benefit from improving the acoustic models used for phonetic classification, as well as the pronunciation models involved in lexical matching.
Article
Full-text available
This paper proposes a probabilistic framework for spoken dialog management using dialog examples. To overcome the complexity prob- lems of the classic partially observable Mar- kov decision processes (POMDPs) based dialog manager, we use a frame-based belief state representation that reduces the complexi- ty of belief update. We also used dialog ex- amples to maintain a reasonable number of system actions to reduce the complexity of the optimizing policy. We developed weather in- formation and car navigation dialog system that employed a frame-based probabilistic framework. This framework enables people to develop a spoken dialog system using a prob- abilistic approach without complexity prob- lem of POMDP.
Article
Full-text available
In this demonstration, we present POSSDM (POSTECH Situ-ation based Dialogue Manager) for a spoken dialogue system using a new example and situation based dialogue management tech-niques for effective generation of appropriate system responses. Spoken dialogue system should generate cooperative responses to smoothly lead the dialogue with users. We introduce a new dia-logue management technique incorporating dialogue examples and situation-based rules for EPG (Electronic Program Guide) domain. For the system response inference, we automatically construct and index a dialogue example database from dialogue corpus, and the best dialogue example is retrieved for a proper system response with the query from a current user utterance and a discourse his-tory. When dialogue corpus is not enough to cover the domain, we also apply manually constructed situation-based rules mainly for meta-level dialogue management.
Article
Full-text available
The lack of suitable training and testing data is currently a major roadblock in applying machine-learning techniques to dialogue man-agement. Stochastic modelling of real users has been suggested as a solution to this problem, but to date few of the proposed models have been quantitatively evaluated on real data. In-deed, there are no established criteria for such an evaluation. This paper presents a systematic approach to testing user simulations and as-sesses the most prominent domain-independent techniques using a large DARPA Communica-tor corpus of human-computer dialogues. We show that while recent advances have led to significant improvements in simulation quality, simple statistical metrics are still sufficient to discern synthetic from real dialogues.
Conference Paper
Full-text available
We report evaluation results for real users of a learnt dialogue management policy versus a hand-coded policy in the TALK project's "Townlnfo" tourist information system. The learnt policy, for filling and confirming information slots, was derived from COMMUNICATOR (flight-booking) data using reinforcement learning (RL) as described in [2], ported to the tourist information domain (using a general method that we propose here), and tested using 18 human users in 180 dialogues, who also used a state-of-the-art hand- coded dialogue policy embedded in an otherwise identical system. We found that users of the (ported) learned policy had an average gain in perceived task completion of 14.2% (from 67.6% to 81.8% at p < .03), that the hand-coded policy dialogues had on average 3.3 more system turns (p < .01), and that the user satisfaction results were comparable, even though the policy was learned for a different domain. Combining these in a dialogue reward score, we found a 14.4% increase for the learnt policy (a 23.8% relative increase, p < .03). These results are important because they show a) that results for real users are consistent with results for automatic evaluation [2] of learned policies using simulated users [3, 4], b) that a policy learned using linear function approximation over a very large policy space [2] is effective for real users, and c) that policies learned using data for one domain can be used successfully in other domains. We also present a qualitative discussion of the learnt policy.
Conference Paper
Full-text available
Spoken language understanding (SLU) addresses the problem of mapping natural language speech into semantic frame for structure encoding of its meaning. Most of the SLU systems separate out the dialog act (DA) identification from the named entity (NE) recognition to generate the semantic frames. In previous works, these two subtasks are treated by independent or cascaded approaches. In the cascaded systems, however, DA and NE influence only to one side, rather than to both sides. In this paper, we develop a new joint SLU model with a triangular-chain conditional random field (CRF) to encode inter-dependence between DA and NE. On four real dialog data, we show that our joint approach outperforms both independent and cascaded approaches.
Conference Paper
Full-text available
The application of machine learning methods to the dialogue management component of spoken dialogue systems is a grow- ing research area. Whereas traditional methods use hand- crafted rules to specify a dialogue policy, machine learning techniques seek to learn dialogue behaviours from a corpus of training data. In this paper, we identify the properties of a corpus suitable for training machine-learning techniques, and propose a framework for collecting dialogue data. The ap- proach is akin to a "Wizard of Oz" set-up with a "wizard" and a "user", but introduces several novel variations to simulate the ASR communication-channel. Specifically, a turn-taking model common in spoken dialogue system is used, and rather than hearing the user directly, the wizard sees simulated speech recognition results on a screen. The simulated recognition re- sults are produced with an error-generation algorithm which al- lows the target WER to be adjusted. An evaluation of the algo- rithm is presented.
Conference Paper
Full-text available
This paper describes and compares two methods for simulating user behaviour in spoken dialogue systems. User simulations are important for automatic dialogue strategy learning and the evaluation of competing strategies. Our methods are designed for use with "Information State Update" (ISU)-based dialogue systems. The first method is based on supervised learning using linear feature combination and a normalised exponential output function. The user is modelled as a stochastic process which selects user actions ( speech act, task pairs) based on features of the current dialogue state, which encodes the whole history of the dialogue. The second method uses n-grams of speech act, task pairs, restricting the length of the history considered by the order of the n-gram. Both models were trained and eval- uated on a subset of the COMMUNICATOR corpus, to which we added annotations for user actions and Information States. The model based on linear feature combination has a perplexity of 2.08 whereas the best n-gram (4-gram) has a perplexity of 3.58. Each one of the user models ran against a system policy trained on the same corpus with a method similar to the one used for our linear feature combination model. The quality of the simu- lated dialogues produced was then measured as a function of the filled slots, confirmed slots, and number of actions performed by the system in each dialogue. In this experiment both the lin- ear feature combination model and the best n-grams (5-gram and 4-gram) produced similar quality simulated dialogues.
Conference Paper
Full-text available
Describes an approach to the automatic evaluation of both the speech recognition and understanding capabilities of a spoken dialogue system for train timetable information. For performance judgement, we use word accuracy for recognition and concept accuracy for understanding. Both measures are calculated by comparing these modules' outputs with a correct reference answer. We report evaluation results for a spontaneous speech corpus with about 10,000 utterances. We observed a nearly linear relationship between word accuracy and concept accuracy
Conference Paper
Full-text available
In this paper, we describe a new methodology to develop mixed-initiative spoken dialog systems, which is based on the extensive use of simulations to accelerate the development process. With the help of simulations, a system providing informa- tion about a database of nearly 1000 restaurants in the Boston area has been developed. The simula- tor can produce thousands of unique dialogs which benefit not only dialog development but also pro- vide data to train the speech recognizer and under- standing components, in preparation for real user interactions. Also described is a strategy for creat- ing cooperative responses to user queries, incorpo- rating an intelligent language generation capability that produces content-dependent verbal descriptions of listed items.
Conference Paper
Full-text available
Previous studies evaluate simulated dialog corpora using evaluation measures which can be automatically extracted from the dialog systems' logs. However, the validity of these automatic measures has not been fully proven. In this study, we first recruit human judges to assess the quality of three simulated dia- log corpora and then use human judgments as the gold standard to validate the conclu- sions drawn from the automatic measures. We observe that it is hard for the human judges to reach good agreement when asked to rate the quality of the dialogs from given perspec- tives. However, the human ratings give con- sistent ranking of the quality of simulated cor- pora generated by different simulation mod- els. When building prediction models of hu- man judgments using previously proposed au- tomatic measures, we find that we cannot reli- ably predict human ratings using a regression model, but we can predict human rankings by a ranking model.
Conference Paper
Full-text available
This work presents an agenda-based approach to improve the robustness of the dialog man- ager by using dialog examples and n-best recognition hypotheses. This approach sup- ports n-best hypotheses in the dialog man- ager and keeps track of the dialog state us- ing a discourse interpretation algorithm with the agenda graph and focus stack. Given the agenda graph and n-best hypotheses, the system can predict the next system actions to maximize multi-level score functions. To evaluate the proposed method, a spoken dia- log system for a building guidance robot was developed. Preliminary evaluation shows this approach would be effective to improve the ro- bustness of example-based dialog modeling.
Conference Paper
Full-text available
This paper investigates the problem of boot- strapping a statistical dialogue manager with- out access to training data and proposes a new probabilistic agenda-based method for simu- lating user behaviour. In experiments with a statistical POMDP dialogue system, the simu- lator was realistic enough to successfully test the prototype system and train a dialogue pol- icy. An extensive study with human subjects showed that the learned policy was highly com- petitive, with task completion rates above 90%.
Article
Full-text available
This paper proposes a new technique to test the performance of spoken dialogue systems by artificially simulating the behaviour of three types of user (very cooperative, coop- erative and not very cooperative) interacting with a system by means of spoken dialogues. Experiments using the technique were carried out to test the performance of a previously developed dialogue system designed for the fast-food domain and working with two kinds of language model for automatic speech recognition: one based on 17 prompt-dependent language models, and the other based on one prompt-independent language model. The use of the simulated user enables the identification of problems relating to the speech recogni- tion, spoken language understanding, and dialogue management components of the system. In particular, in these experiments problems were encountered with the recognition and understanding of postal codes and addresses and with the lengthy sequences of repetitive confirmation turns required to correct these errors. By employing a simulated user in a range of different experimental conditions sufficient data can be generated to support a systematic analysis of potential problems and to enable fine-grained tuning of the system.
Conference Paper
Full-text available
We propose the “advanced ” n-grams as a new technique for simulating user behaviour in spoken dialogue systems, and we compare it with two methods used in our prior work, i.e. linear feature combination and “normal ” n-grams. All methods operate on the intention level and can incorporate speech recognition and understanding errors. In the linear feature combination model user actions (lists of 〈 speech act, task 〉 pairs) are selected, based on features of the current dialogue state which encodes the whole history of the dialogue. The user simulation based on “normal ” n-grams treats a dialogue as a sequence of lists of 〈 speech act, task 〉 pairs. Here the length of the history considered is restricted by the order of the n-gram. The “advanced ” n-grams are a variation of the normal ngrams, where user actions are conditioned not only on speech acts and tasks but also on the current status of the tasks, i.e. whether
Conference Paper
Full-text available
Human-machine dialogue is heavily influenced by speech recognition and understanding errors and it is hence desirable to train and test statistical dialogue system policies under realistic noise conditions. This paper presents a novel approach to error simulation based on statistical models for word-level utterance generation, ASR confusions, and confidence score generation. While the method explicitly models the context-dependent acoustic confusability of words and allows the system specific language model and semantic decoder to be incorporated, it is computationally inexpensive and thus potentially suitable for running thousands of training simulations. Experimental evaluation results with a POMDP-based dialogue system and the Hidden Agenda User Simulator indicate a close match between the statistical properties of real and synthetic errors.
Conference Paper
Full-text available
This paper presents a probabilistic method to simulate task-oriented human-computer dialogues at the intention level, that may be used to improve or to evaluate the performance of spoken dialogue systems. Our method uses a network of hidden Markov models (HMMs) to predict system and user intentions, where a "language model" predicts sequences of goals and the component HMMs predict sequences of intentions. We compare standard HMMs, input HMMs and input-output HMMs in an effort to better predict sequences of intentions. In addition, we propose a dialogue similarity measure to evaluate the realism of the simulated dialogues. We performed experiments using the DARPA communicator corpora and report results with three different metrics: dialogue length, dialogue similarity and precision-recall
Conference Paper
Full-text available
The field of spoken dialogue systems has developed rapidly. However, optimisation, evaluation and rapid development of systems remain problematic. This paper describes a method of producing a probabilistic simulation of mixed initiative dialogue with recognition and understanding errors. Both user behaviour and system errors are modelled using a data-driven approach, and the quality of the simulations are evaluated by comparing them to real human-machine dialogues. The simulation system can be used to perform rapid evaluations of prototype systems, thus aiding the development process. It is also envisaged that it will be used as a tool for automation of dialogue design
Conference Paper
Full-text available
Automatic speech dialogue systems are becoming common. In order to assess their performance, a large sample of real dialogues has to be collected and evaluated. This process is expensive, labor intensive, and prone to errors. To alleviate this situation we propose a user simulation to conduct dialogues with the system under investigation. Using stochastic modeling of real users we can both debug and evaluate a speech dialogue system while it is still in the lab, thus substantially reducing the amount of field testing with real users
Article
Full-text available
We propose a quantitative model for dialog systems that can be used for learning the dialog strategy. We claim that the problem of dialog design can be formalized as an optimization problem with an objective function reflecting different dialog dimensions relevant for a given application. We also show that any dialog system can be formally described as a sequential decision process in terms of its state space, action set, and strategy. With additional assumptions about the state transition probabilities and cost assignment, a dialog system can be mapped to a stochastic model known as Markov decision process (MDP). A variety of data driven algorithms for finding the optimal strategy (i.e., the one that optimizes the criterion) is available within the MDP framework, based on reinforcement learning. For an effective use of the available training data we propose a combination of supervised and reinforcement learning: the supervised learning is used to estimate a model of the user, i.e., the MDP parameters that quantify the user's behavior. Then a reinforcement learning algorithm is used to estimate the optimal strategy while the system interacts with the simulated user. This approach is tested for learning the strategy in an air travel information system (ATIS) task. The experimental results we present in this paper show that it is indeed possible to find a simple criterion, a state space representation, and a simulated user parameterization in order to automatically learn a relatively complex dialog behavior, similar to one that was heuristically designed by several research groups
Article
Full-text available
Pronunciation modeling in automatic speech recognition systems has had mixed results in the past; one likely reason for poor performance is the increased confusability in the lexicon from adding new pronunciation variants. In this work, we propose a new framework for determining lexically confusable words based on inverted finite state transducers (FSTs); we also present experiments designed to test some of the implementation details of this framework. The method is evaluated by looking at how well the algorithm predicts the errors in an ASR system. We see from the confusions learned in a training set that we are able to generalize this information to predict errors in an unseen test set.
Article
Full-text available
The limitations of traditional knowledge representation methods for modeling complex human behaviour led to the investigation of statistical models. Predictive statistical models enable the anticipation of certain aspects of human behaviour, such as goals, actions and preferences. In this paper, we motivate the development of these models in the context of the user modeling enterprise. We then review the two main approaches to predictive statistical modeling, content-based and collaborative, and discuss the main techniques used to develop predictive statistical models. We also consider the evaluation requirements of these models in the user modeling context, and propose topics for future research.
Article
Full-text available
Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused.
Article
Full-text available
The DIPPER architecture is a collection of software agents for prototyping spoken dialogue systems. Implemented on top of the Open Agent Architecture (OAA), it comprises agents for speech input and output, dialogue management, and further supporting agents. We define a formal syntax and semantics for the DIPPER information state update language.
Conference Paper
We present conditional random fields, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.
Article
We study the numerical performance of a limited memory quasi-Newton method for large scale optimization, which we call the L-BFGS method. We compare its performance with that of the method developed by Buckley and LeNir (1985), which combines cycles of BFGS steps and conjugate direction steps. Our numerical tests indicate that the L-BFGS method is faster than the method of Buckley and LeNir, and is better able to use additional storage to accelerate convergence. We show that the L-BFGS method can be greatly accelerated by means of a simple scaling. We then compare the L-BFGS method with the partitioned quasi-Newton method of Griewank and Toint (1982a). The results show that, for some problems, the partitioned quasi-Newton method is clearly superior to the L-BFGS method. However we find that for other problems the L-BFGS method is very competitive due to its low iteration cost. We also study the convergence properties of the L-BFGS method, and prove global convergence on uniformly convex problems.
Article
In a spoken dialog system, determining which action a machine should take in a given situation is a difficult problem because automatic speech recognition is unreliable and hence the state of the conversation can never be known with certainty. Much of the research in spoken dialog systems centres on mitigating this uncertainty and recent work has focussed on three largely disparate techniques: parallel dialog state hypotheses, local use of confidence scores, and automated planning. While in isolation each of these approaches can improve action selection, taken together they currently lack a unified statistical framework that admits global optimization. In this paper we cast a spoken dialog system as a partially observable Markov decision process (POMDP). We show how this formulation unifies and extends existing techniques to form a single principled framework. A number of illustrations are used to show qualitatively the potential benefits of POMDPs compared to existing techniques, and empirical results from dialog simulations are presented which demonstrate significant quantitative gains. Finally, some of the key challenges to advancing this method – in particular scalability – are briefly outlined.
Conference Paper
We address the problem of estimating the word error rate (WER) of an automatic speech recognition (ASR) system without using acoustic test data. This is an important problem which is faced by the designers of new applications which use ASR. Quick estimate of WER early in the design cycle can be used to guide the decisions involving dialog strategy and grammar design. Our approach involves estimating the probability distribution of the word hypotheses produced by the underlying ASR system given the text test corpus. A critical component of this system is a phonemic confusion model which seeks to capture the errors made by ASR on the acoustic data at a phonemic level. We use a confusion model composed of probabilistic phoneme sequence conversion rules which are learned from phonemic transcription pairs obtained by leave-one-out decoding of the training set. We show reasonably close estimation of WER when applying the system to test sets from different domains.
Conference Paper
In this work we propose a procedure model for rapid automatic strategy learning in multimodal dialogs. Our approach is tailored for typical task-oriented human-robot dialog interactions, with no prior knowledge about the expected user and system dynamics be- ing present. For such scenarios, we propose the use of stochastic dialog simulation for strategy learning, where the user and sys- tem error models are solely trained through the initial execution of an inexpensive Wizard-of-Oz experiment. We argue that for the addressed dialogs, already a small data corpus combined with a low-conditioned simulation model facilitates learning of strong and complex dialog strategies. To validate our overall approach, we empirically show the supremacy of the learned strategy over a hand-crafted strategy for a concrete human-robot dialog scenario. To the authors' knowledge, this work is thefirst to perform strategy learning from multimodal dialog simulation. Index Terms: strategy learning, multimodal human-robot dialogs
Article
We present a new methodology of user simulation applied to the evaluation and refinement of stochastic dialog systems. Common weaknesses of these systems are the scarceness of the training corpus and the cost of an evaluation made by real users. We have considered the user simulation technique as an alternative way of testing and improving our dialog system. We have developed a new dialog manager that plays the role of the user. This user dialog manager incorporates several knowledge sources, combining statistical and heuristic information in order to define its dialog strategy. Once the user simulator is integrated into the dialog system, it is possible to enhance the dialog models by an automatic strategy learning. We have performed an extensive evaluation, achieving a slight but clear improvement of the dialog system.
Article
This paper describes the response planning and generation components of the mercury flight reservation system, a mixed-initiative spoken dialogue system that supports both voice-only interaction and multi-modal interaction augmenting spoken inputs with typing or clicking at a displayed Web page. mercury is configured using the Galaxy Communicator architecture (Seneff, Hurley, Lau, Schmid, & Zue, 1998), where a suite of servers interact via program control mediated by a central hub. Language generation is performed in two steps: response planning, or deep-structure generation, is carried out by the dialogue manager, and is well-integrated with other aspects of dialogue control; control flow is specified by a dialogue control table (Seneff & Polifroni, 2000a). Response generation, or surface-form generation, is executed by a separate language generation server, under the guidance of a set of recursive generation rules and an associated lexicon (Baptist & Seneff, 2000). The generation of the textual string for the graphical interface and the marked-up synthesis string for spoken outputs are controlled by a shared set of generation rules (Seneff & Polifroni, 2000b). Thus there is a direct meaning-to-speech mapping that eliminates the need to analyze linguistic structure for synthesis. To date, we have collected over 25 000 utterances from users interacting with the mercury system. We report here on both the results of user satisfaction studies conducted by the National Institute of Standards and Technology (NIST), and on our own tabulation of a number of different measures of dialogue success.
Article
This paper is concerned mainly with the choice of a figure of merit for representing the performance of connected-word recognisers when DP word-symbol sequence matching is used for the scoring. Properties of the DP scoring method are discussed. Experimental tests using data from the DARPA Resource Management Task confirm a prediction made from random number simulations that DP scoring overestimates substitution errors and underestimates insertion and deletion errors. As a result, the commonly used total error measure has a particularly large bias. The use of an alternative measure, percent correct, results in lower bias but ignores insertion errors. A new figure of merit, weighted total errors, takes all three kinds of errors into account and minimises bias. Finally, some more sophisticated figures of merit are discussed briefly.
Article
In recent years, a question of great interest has been the development of tools and techniques to facilitate the evaluation of dialogue systems. The latter can be evaluated from various points of view, such as recognition and understanding rates, dialogue naturalness and robustness against recognition errors. Evaluation usually requires compiling a large corpus of words and sentences uttered by users, relevant to the application domain the system is designed for. This paper proposes a new technique that makes it possible to reuse such a corpus for the evaluation and to check the performance of the system when different dialogue strategies are used. The technique is based on the automatic generation of conversations between the dialogue system, together with an additional dialogue system called user simulator that represents the user’s interaction with the dialogue system. The technique has been applied to evaluate a dialogue system developed in our lab using two different recognition front-ends and two different dialogue strategies to handle user confirmations. The experiments show that the prompt-dependent recognition front-end achieves better results, but that this front-end is appropriate only if users limit their utterances to those related to the current system prompt. The prompt-independent front-end achieves inferior results, but enables front-end users to utter any permitted utterance at any time, irrespective of the system prompt. In consequence, this front-end may allow a more natural and comfortable interaction. The experiments also show that the re-prompting confirmation strategy enhances system performance for both recognition front-ends.
Article
A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed. From these findings it is possible to determine whether significant homology exists between the proteins. This information is used to trace their possible evolutionary development.The maximum match is a number dependent upon the similarity of the sequences. One of its definitions is the largest number of amino acids of one protein that can be matched with those of a second protein allowing for all possible interruptions in either of the sequences. While the interruptions give rise to a very large number of comparisons, the method efficiently excludes from consideration those comparisons that cannot contribute to the maximum match.Comparisons are made from the smallest unit of significance, a pair of amino acids, one from each protein. All possible pairs are represented by a two-dimensional array, and all possible comparisons are represented by pathways through the array. For this maximum match only certain of the possible pathways must be evaluated. A numerical value, one in this case, is assigned to every cell in the array representing like amino acids. The maximum match is the largest number that would result from summing the cell values of every pathway.
Conference Paper
Though the field of spoken dialogue systems has developed quickly in the last decade, rapid design of dialogue strategies remains uneasy. Several approaches to the problem of automatic strategy learning have been proposed and the use of reinforcement learning introduced by Levin and Pieraccini (see Pieraccini, R. et al., IEEE Trans. on Speech and Audio Proc., vol.8, p.11-23, 2000) is becoming part of the state of the art in this area. However, the quality of the strategy learned by the system depends on the definition of the optimization criterion and on the accuracy of the environment model. We propose to bring a model of an ASR system into a simulated environment in order to enhance the learned strategy. To do so, we introduced recognition error rates and confidence levels produced by ASR systems in the optimization criterion
Conference Paper
In stochastic language modeling, backing-off is a widely used method to cope with the sparse data problem. In case of unseen events this method backs off to a less specific distribution. In this paper we propose to use distributions which are especially optimized for the task of backing-off. Two different theoretical derivations lead to distributions which are quite different from the probability distributions that are usually used for backing-off. Experiments show an improvement of about 10% in terms of perplexity and 5% in terms of word error rate
Article
The design of Spoken Dialog Systems cannot be considered as the simple combination of speech processing technologies. Indeed, speech-based interface design has been an expert job for a long time. It necessitates good skills in speech technologies and low-level programming. Moreover, rapid development and reusability of previously designed systems remains uneasy. This makes optimality and objective evaluation of design very difficult. The design process is therefore a cyclic process composed of prototype releases, user satisfaction surveys, bug reports and refinements. It is well known that human intervention for testing is time-consuming and above all very expensive. This is one of the reasons for the recent interest in dialog simulation for evaluation as well as for design automation and optimization. In this paper we expose a probabilistic framework for a realistic simulation of spoken dialogs in which the major components of a dialog system are modeled and parameterized thanks to independent data or expert knowledge. Especially, an Automatic Speech Recognition (ASR) system model and a User Model (UM) have been developed. The ASR model, based on articulatory similarities in language models, provides task-adaptive performance prediction and Confidence Level (CL) distribution estimation. The user model relies on the Bayesian Networks (BN) paradigm and is used both for user behavior modeling and Natural Language Understanding (NLU) modeling. The complete simulation framework has been used to train a reinforcement-learning agent on two different tasks. These experiments helped to point out several potentially problematic dialog scenarios.
Article
Though the field of spoken dialogue systems has developed quickly in the last decade, rapid design of dialogue strategies remains uneasy. Several approaches to the problem of automatic strategy learning have been proposed and the use of Reinforcement Learning introduced by Levin and Pieraccini is becoming part of the state of the art in this area. However, the quality of the strategy learned by the system depends on the definition of the optimization criterion and on the accuracy of the environment model.
Article
We present conditional random fields, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.
Probabilistic simulation of human–machine dialogues rAuthor's personal copy Corpus-based dialogue simulation for automatic strategy learning and evaluation
  • K Scheffler
  • S S Young
  • Jung
Scheffler, K., Young, S., 2000. Probabilistic simulation of human–machine dialogues. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2. 508 S. Jung et al./Computer Speech and Language 23 (2009) 479–509 rAuthor's personal copy Scheffler, K., Young, S., 2001. Corpus-based dialogue simulation for automatic strategy learning and evaluation. NAACL Workshop on Adaptation in Dialogue Systems, pp. 64–70
Towards understanding spontaneous speech: Word accuracy vs. conceptaccuracy DIPPER: Description and formalisation of an information-state update dialogue system architecture Developing a Flexible Spoken Dialog System Using Simulation
  • M Boros
  • W Eckert
  • F Gallwitz
  • G Gorz
  • G Hanrieder
  • H Niemann
Boros, M., Eckert, W., Gallwitz, F., Gorz, G., Hanrieder, G., Niemann, H., 1996. Towards understanding spontaneous speech: Word accuracy vs. conceptaccuracy. In: 4th International Conference on Spoken Language Processing, vol. 2. Bos, J., Klein, E., Lemon, O., Oka, T., 2003. DIPPER: Description and formalisation of an information-state update dialogue system architecture. In: 4th SIGdial Workshop on Discourse and Dialogue, Sapporo, Japan. Chung, G., 2004. Developing a Flexible Spoken Dialog System Using Simulation. Association for Computational Linguistics, pp. 63–70.
User modeling for spoken dialogue system evaluation. Automatic Speech Recognition and Understanding
  • W Eckert
  • E Levin
  • R Pieraccini
Eckert, W., Levin, E., Pieraccini, R., 1997. User modeling for spoken dialogue system evaluation. Automatic Speech Recognition and Understanding, 80-87.
Statistical user simulation with a hidden agenda Probabilistic simulation of human–machine dialogues
  • J Schatzmann
  • B Thomson
  • S Young
  • Sigdial
  • K Scheffler
  • S Young
Schatzmann, J., Thomson, B., Young, S., 2007c. Statistical user simulation with a hidden agenda. SigDial. Scheffler, K., Young, S., 2000. Probabilistic simulation of human–machine dialogues. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2. 508 S. Jung et al. / Computer Speech and Language 23 (2009) 479–509