Antoine Raux

Honda Research Institute USA, Inc., Mountain View, California, United States

Are you Antoine Raux?

Claim your profile

Publications (34)2.37 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we address issues in situated language understanding in a rapidly changing environment – a moving car. Specifically, we propose methods for understanding user queries about specific tar-get buildings in their surroundings. Unlike previous studies on physically situated interactions such as interaction with mobile robots, the task is very sensitive to timing because the spatial relation between the car and the target is changing while the user is speaking. We collected situated utterances from drivers using our research system, Townsurfer, which is embedded in a real vehicle. Based on this data, we analyze the timing of user queries, spatial relationships between the car and tar-gets, head pose of the user, and linguistic cues. Optimized on the data, our algorithms improved the target identification rate by 24.1% absolute.
    SIGDIAL 2014; 06/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we address Townsurfer, a situated multi-modal dialog system in vehicles. The system integrates multi-modal inputs of speech, geo-location, gaze (face direction) and dialog history to answer drivers' queries about their surroundings. To select appropriate data source used to answer queries, we apply belief tracking across the above modalities. We conducted a preliminary data collection and an evaluation focusing on the effect of gaze (head irection) and geo-location estimations. We report the result and analysis on the data.
    Proceedings of the 6th workshop on Eye gaze in intelligent human machine interaction: gaze in multimodal interaction; 12/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Many modern spoken dialog systems use probabilistic graphical models to update their belief over the concepts under discussion, increasing robustness in the face of noisy input. However, such models are ill-suited to probabilistic reasoning about spatial relationships between entities. In particular, a car navigation system that infers users' intended destination using nearby landmarks as descriptions must be able to use distance measures as a factor in inference. In this paper, we describe a belief tracking system for a location identification task that combines a semantic belief tracker for categorical concepts based on the DPOT framework (Raux and Ma, 2011) with a kernel density estimator that incorporates landmark evidence from multiple turns and landmark hypotheses, into a posterior probability over candidate locations. We evaluate our approach on a corpus of destination setting dialogs and show that it significantly outperforms a deterministic baseline.
    Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue; 07/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Developing interactive robots is an extremely challenging task which requires a broad range of expertise across diverse disciplines, including, robotic planning, spoken language understanding, belief tracking and action management. While there has been a boom in recent years in the development of reusable components for robotic systems within common architectures, such as the Robot Operating System (ROS), little emphasis has been placed on developing components for Human-Robot-Interaction. In this paper we introduce HRItk (the Human-Robot-Interaction toolkit), a framework, consisting of messaging protocols, core-components, and development tools for rapidly building speech-centric interactive systems within the ROS environment. The proposed toolkit was specifically designed for extensibility, ease of use, and rapid development, allowing developers to quickly incorporate speech interaction into existing projects.
    NAACL-HLT Workshop on Future Directions and Needs in the Spoken Dialog Community: Tools and Data; 06/2012
  • Antoine Raux, Maxine Eskenazi
    [Show abstract] [Hide abstract]
    ABSTRACT: Even as progress in speech technologies and task and dialog modeling has allowed the development of advanced spoken dialog systems, the low-level interaction behavior of those systems often remains rigid and inefficient. Based on an analysis of human-human and human-computer turn-taking in naturally occurring task-oriented dialogs, we define a set of features that can be automatically extracted and show that they can be used to inform efficient end-of-turn detection. We then frame turn-taking as decision making under uncertainty and describe the Finite-State Turn-Taking Machine (FSTTM), a decision-theoretic model that combines data-driven machine learning methods and a cost structure derived from Conversation Analysis to control the turn-taking behavior of dialog systems. Evaluation results on CMU Let's Go, a publicly deployed bus information system, confirm that the FSTTM significantly improves the responsiveness of the system compared to a standard threshold-based approach, as well as previous data-driven methods.
    ACM Transactions on Speech and Language Processing - TSLP. 01/2012;
  • Source
    AI Magazine. 01/2011; 32:93-100.
  • Source
    Ai Magazine 01/2011; 32(4):15-16. · 0.73 Impact Factor
  • Source
    Antoine Raux, Yi Ma
    INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, Florence, Italy, August 27-31, 2011; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We introduce a novel approach for robust belief tracking of user intention within a spoken dialog system. The space of user intentions is modeled by a probabilistic extension of the underlying domain ontology called a probabilistic ontology tree (POT). POTs embody a principled approach to leverage the dependencies among domain concepts and incorporate corroborating or conflicting dialog observations in the form of interpreted user utterances across dialog turns. We tailor standard inference algorithms to the POT framework to efficiently compute the user intentions in terms of m-best most probable explanations. We empirically validate the efficacy of our POT and compare it to a hierarchical frame-based approach in experiments with users of a tourism information system.
    Proceedings of the SIGDIAL 2010 Conference, The 11th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 24-15 September 2010, Tokyo, Japan; 01/2010
  • Source
    Antoine Raux, Mikio Nakano
    [Show abstract] [Hide abstract]
    ABSTRACT: In spoken communications, correction ut- terances, which are utterances correct- ing other participants utterances and be- haviors, play crucial roles, and detecting them is one of the key issues. Previ- ously, much work has been done on au- tomatic detection of correction utterances in human-human and human-computer di- alogs, but they mostly dealt with the cor- rection of erroneous utterances. How- ever, in many real situations, especially in communications between humans and mo- bile robots, the misunderstandings man- ifest themselves not only through utter- ances but also through physical actions performed by the participants. In this pa- per, we focus on action corrections and propose a classification of such utterances into Omission, Commission, and Degree corrections. We present the results of our analysis of correction utterances in dialogs between two humans who were engaging in a kind of on-line computer game, where one participant plays the role of the re- mote manager of a convenience store, and the other plays the role of a robot store clerk. We analyze the linguistic content, prosody as well as the timing of correction utterances and found that all features were significantly correlated with action correc- tions.
    Proceedings of the SIGDIAL 2010 Conference, The 11th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 24-15 September 2010, Tokyo, Japan; 01/2010
  • Source
    INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010; 01/2010
  • Source
    Antoine Raux, Maxine Eskenazi
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper introduces the Finite-State Turn- Taking Machine (FSTTM), a new model to control the turn-taking behavior of conversa- tional agents. Based on a non-deterministic finite-state machine, the FSTTM uses a cost matrix and decision theoretic principles to se- lect a turn-taking action at any time. We show how the model can be applied to the problem of end-of-turn detection. Evaluation results on a deployed spoken dialog system show that the FSTTM provides significantly higher respon- siveness than previous approaches.
    Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, May 31 - June 5, 2009, Boulder, Colorado, USA; 01/2009
  • Source
    Antoine Raux, Maxine Eskenazi
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes a novel algorithm to dy-namically set endpointing thresholds based on a rich set of dialogue features to detect the end of user utterances in a dialogue system. By analyzing the relationship between silences in user's speech to a spoken dialogue system and a wide range of automatically extracted fea-tures from discourse, semantics, prosody, tim-ing and speaker characteristics, we found that all features correlate with pause duration and with whether a silence indicates the end of the turn, with semantics and timing being the most informative. Based on these features, the pro-posed method reduces latency by up to 24% over a fixed threshold baseline. Offline evalu-ation results were confirmed by implementing the proposed algorithm in the Let's Go system.
    06/2008;
  • Maxine Eskenazi, Antoine Raux
    [Show abstract] [Hide abstract]
    ABSTRACT: The CMU Let's Go Spoken Dialogue System has been used daily for about three years to answer calls to the Pittsburgh Port Authority for bus information in the evening and on weekends. This has resulted in a database of over 50 000 spoken dialogues as of January 2008, one of the largest publicly available sets of this type of data. While retraining the system with part of this data, it became apparent that there are times of the day, of the week and of the year when the average number of successful calls is significantly higher. We will present evidence, using these three measures of time (hour, day of week, month of year) and criteria such as signal-to-noise ratio, estimated success rate, number of turns per dialogue, number of non-understandings per dialogue, and barge-in rate to detect the regular, predictable appearance of high and low success rates and to suggest methods for palliating this effect in order to increase overall dialogue success rates.
    The Journal of the Acoustical Society of America 06/2008; 123(5):3881. · 1.65 Impact Factor
  • Source
    A. Raux, M. Eskenazi
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a new architecture for spoken dialogue systems that explicitly separates the discrete, abstract representation used in the high-level dialogue manager and the continuous, real-time nature of real world events. We propose to use the concept of conversational floor as a means to synchronize the internal state of the dialogue manager with the real world. To act as the interface between these two layers, we introduce a new component, called the Interaction Manager. The proposed architecture was implemented as a new version of the Olympus framework, which can be used across different domains and modalities. We confirmed the practicality of the approach by porting Let's Go, an existing deployed dialogue system to the new architecture.
    Automatic Speech Recognition & Understanding, 2007. ASRU. IEEE Workshop on; 01/2008
  • [Show abstract] [Hide abstract]
    ABSTRACT: This short paper is intended to advertise Let's Go Lab, a plat- form for the evaluation of spoken dialog research. Unlike other dialog platforms, in addition to example dialog data and a portable software system, Let's Go Lab affords evaluation with real users. Let's Go has served the Pittsburgh public with bus schedule information since 2005, answering more than 52,000 calls to date. Index Terms: spoken dialog systems, evaluation
    INTERSPEECH 2008, 9th Annual Conference of the International Speech Communication Association, Brisbane, Australia, September 22-26, 2008; 01/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This tutorial will give a practical description of the free software Carnegie Mellon Olympus 2 Spoken Dialog Architecture. Building real working dialog systems that are robust enough for the general public to use is difficult. Most frequently, the functionality of the conversations is severely limited - down to simple question-answer pairs. While off-the-shelf toolkits help the development of such simple systems, they do not support more advanced, natural dialogs nor do they offer the transparency and flexibility required by computational linguistic researchers. However, Olympus 2 offers a complete dialog system with automatic speech recognition (Sphinx) and synthesis (SAPI, Festival) and has been used, along with previous versions of Olympus, for teaching and research at Carnegie Mellon and elsewhere for some 5 years. Overall, a dozen dialog systems have been built using various versions of Olympus, handling tasks ranging from providing bus schedule information to guidance through maintenance procedures for complex machinery, to personal calendar management. In addition to simplifying the development of dialog systems, Olympus provides a transparent platform for teaching and conducting research on all aspects of dialog systems, including speech recognition and synthesis, natural language understanding and generation, and dialog and interaction management. The tutorial will give a brief introduction to spoken dialog systems before going into detail about how to create your own dialog system within Olympus 2, using the Let's Go bus information system as an example. Further, we will provide guidelines on how to use an actual deployed spoken dialog system such as Let's Go to validate research results in the real world. As a possible testbed for such research, we will describe Let's Go Lab, which provides access to both the Let's Go system and its genuine user population for research experiments.
    ACL 2008, Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, June 15-20, 2008, Columbus, Ohio, USA, Tutorial Abstracts; 01/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We introduce Olympus, a freely available framework for research in conversational interfaces. Olympus' open, transparent, flexible, modular and scalable nature facilitates the development of large-scale, real-world systems, and enables research leading to technological and scientific advances in conversational spoken language interfaces. In this paper, we describe the overall architecture, several systems spanning different domains, and a number of current research efforts supported by Olympus.
    04/2007;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Spoken dialog systems typically use a limited number of non- understanding recovery strategies and simple heuristic policies to engage them (e.g. first ask user to repeat, then give help, then transfer to an operator). We propose a supervised, online method for learning a non-understanding recovery policy over a large set of recovery strategies. The approach consists of two steps: first, we construct runtime estimates for the likelihood of success of each recovery strategy, and then we use these estimates to construct a policy. An experiment with a publicly available spoken dialog system shows that the learned policy produced a 12.5% relative improvement in the non-understanding recovery rate.
    Spoken Language Technology Workshop, 2006. IEEE; 01/2007
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe ConQuest, an open-source, reusable spoken dialog system that pro- vides technical program information dur- ing conferences. The system uses a transparent, modular and open infrastruc- ture, and aims to enable applied research in spoken language interfaces. The con- ference domain is a good platform for ap- plied research since it permits periodical redeployments and evaluations with a real user-base. In this paper, we describe the system's functionality, overall architec- ture, and we discuss two initial deploy- ments.
    Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, April 22-27, 2007, Rochester, New York, USA; 01/2007

Publication Stats

404 Citations
2.37 Total Impact Points

Institutions

  • 2009–2013
    • Honda Research Institute USA, Inc.
      Mountain View, California, United States
  • 2003–2008
    • Carnegie Mellon University
      • Language Technologies Institute
      Pittsburgh, PA, United States
    • Kyoto University
      Kioto, Kyōto, Japan