Maxine Eskenazi’s research while affiliated with Carnegie Mellon University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (189)


Examining Prosody in Spoken Navigation Instructions for People with Disabilities
  • Conference Paper

January 2024

Cathy Jiao

·

Aaron Steinfeld

·

Maxine Eskenazi

Overview of the Ninth Dialog System Technology Challenge: DSTC9
  • Article
  • Full-text available

January 2024

·

16 Reads

·

2 Citations

IEEE/ACM Transactions on Audio Speech and Language Processing

·

Seokhwan Kim

·

·

[...]

·

Rajen Subba

This paper introduces the Ninth Dialog System Technology Challenge (DSTC-9). This edition of the DSTC focuses on applying end-to-end dialog technologies for four distinct tasks in dialog systems, namely, 1. Task-oriented dialog Modeling with Unstructured Knowledge Access, 2. Multi-domain task-oriented dialog, 3. Interactive evaluation of dialog and 4. Situated interactive multimodal dialog. This paper describes the task definition, provided datasets, baselines, and evaluation setup for each track. We also summarize the results of the submitted systems to highlight the general trends of the state-of-the-art technologies for the tasks.

Download

Fig. 1 Large Language Models, comparison of select approximate sizes
Turn-level fine-grained metrics on the FED dataset for manually chosen examples over the TNLGv2, BLOOM, OPT, Flan-T5, and InstructGPT models.
An example of a prompt with examples from DSTC 10.
Understanding the Effectiveness of Very Large Language Models on Dialog Evaluation

January 2023

·

487 Reads

Language models have steadily increased in size over the past few years. They achieve a high level of performance on various natural language processing (NLP) tasks such as question answering and summarization. Large language models (LLMs) have been used for generation and can now output human-like text. Due to this, there are other downstream tasks in the realm of dialog that can now harness the LLMs' language understanding capabilities. Dialog evaluation is one task that this paper will explore. It concentrates on prompting with LLMs: BLOOM, OPT, GPT-3, Flan-T5, InstructDial and TNLGv2. The paper shows that the choice of datasets used for training a model contributes to how well it performs on a task as well as on how the prompt should be structured. Specifically, the more diverse and relevant the group of datasets that a model is trained on, the better dialog evaluation performs. This paper also investigates how the number of examples in the prompt and the type of example selection used affect the model's performance.


Figure 1: DialPort Portal. This screenshot of the Portal displays (1) the dialog history, shown in the center of the screen, (2) an input field for the user to type their responses, and (3) a set of feedback buttons below the dialog history ("Like", "Dislike", "Feedback?", "Improve Response?" and "End Conversation"). The interface clear and emphasizes the three important actions that a user should perform while using the Portal: (1) reading the dialog history, (2) responding to the dialog system, and (3) providing feedback.
Figure 2: DialCrowd Examples and Counterexamples with Explanations
Figure 3: The home page for a system on the DialPort dashboard. General information about the conversations collected from the system are displayed. Sections such as "Words and Phrases" and "Graphs" can be expanded or collapsed to view additional information about the system.
The DialPort tools

August 2022

·

14 Reads

The DialPort project http://dialport.org/, funded by the National Science Foundation (NSF), covers a group of tools and services that aim at fulfilling the needs of the dialog research community. Over the course of six years, several offerings have been created, including the DialPort Portal and DialCrowd. This paper describes these contributions, which will be demoed at SIGDIAL, including implementation, prior studies, corresponding discoveries, and the locations at which the tools will remain freely available to the community going forward.


Figure 1: Facebook advertisement used to recruit users to interact with systems on DialPort.
Interactive Evaluation of Dialog Track at DSTC9

July 2022

·

14 Reads

The ultimate goal of dialog research is to develop systems that can be effectively used in interactive settings by real users. To this end, we introduced the Interactive Evaluation of Dialog Track at the 9th Dialog System Technology Challenge. This track consisted of two sub-tasks. The first sub-task involved building knowledge-grounded response generation models. The second sub-task aimed to extend dialog models beyond static datasets by assessing them in an interactive setting with real users. Our track challenges participants to develop strong response generation models and explore strategies that extend them to back-and-forth interactions with real users. The progression from static corpora to interactive evaluation introduces unique challenges and facilitates a more thorough assessment of open-domain dialog systems. This paper provides an overview of the track, including the methodology and results. Furthermore, it provides insights into how to best evaluate open-domain dialog models


Figure 2: Visualization of LAD. (1) Domain-agnostic algorithms use the schema to create a seed dataset which conveys the necessary structural constraints. (2) Large LMs reformulate individual utterances to add linguistic diversity. (3) Validation heuristics are used to ensure adherence to the schema.
Figure 3: First example dialog that was used to evaluate AMT workers during the pre-screening. The correct answers for this example are: (1) the dialog is complete, (2) the system did not ask for all of the information (dietaray restrictions), (3) the system asked for redundant information (name of the host).
Figure 4: Second example dialog that was used to evaluate AMT workers during the pre-screening. The correct answers for this example are: (1) the dialog is complete, (2) the system did not ask for all of the information (name), (3) the system asked for redundant information (destination). The chosen slots in this image are sampled from the STAR corpus (Mosig et al., 2020), which is not necessarily the case for all of the scenarios.
Experimental results on intent prediction. We report the accuracy of training CBEO on (1) one utter- ance/intent (i.e., the seed data) and (2) the synthetic data produced by LAD. For reference, we also show the results reported by Mehri and Eric (2021) obtained with full human-annotated training datasets.
Experimental results on the STAR corpus. SAM + LAD is compared with zero-shot results re- ported by prior work. For reference, the performance of SAM when trained on the full corpus is also shown.
LAD: Language Models as Data for Zero-Shot Dialog

July 2022

·

51 Reads

To facilitate zero-shot generalization in taskoriented dialog, this paper proposes Language Models as Data (LAD). LAD is a paradigm for creating diverse and accurate synthetic data which conveys the necessary structural constraints and can be used to train a downstream neural dialog model. LAD leverages GPT-3 to induce linguistic diversity. LAD achieves significant performance gains in zero-shot settings on intent prediction (+15%), slot filling (+31.4 F-1) and next action prediction (+11 F1). Furthermore, an interactive human evaluation shows that training with LAD is competitive with training on human dialogs. LAD is open-sourced, with the code and data available at https://github.com/Shikib/lad.


Payment Statistics for HITs
DialCrowd 2.0: A Quality-Focused Dialog System Crowdsourcing Toolkit

July 2022

·

15 Reads

Dialog system developers need high-quality data to train, fine-tune and assess their systems. They often use crowdsourcing for this since it provides large quantities of data from many workers. However, the data may not be of sufficiently good quality. This can be due to the way that the requester presents a task and how they interact with the workers. This paper introduces DialCrowd 2.0 to help requesters obtain higher quality data by, for example, presenting tasks more clearly and facilitating effective communication with workers. DialCrowd 2.0 guides developers in creating improved Human Intelligence Tasks (HITs) and is directly applicable to the workflows used currently by developers and researchers.


Figure 1: In this work, we investigate instruction tuning on dialogue. Instruction tuning involves training a model on a mixture of tasks defined through natural language instructions. The model then exhibits zero-shot or few-shot generalization to new tasks.
Figure 2: INSTRUCTDIAL task taxonomy. Green color represents classification tasks and orange color represents generation tasks.
Figure 4: Model's performance improves with the number of seen tasks during training. We report average Accuracy across Eval Selection, Answer Selection, Relation Classification, and Dialfact Classification, and average RougeL scores for Knowledge Grounded Generation and Begins with Generation.
Zero-shot slot filling results on the Restau- rant8k corpus.
Improving Zero and Few-shot Generalization in Dialogue through Instruction Tuning

May 2022

·

385 Reads

Instruction tuning is an emergent paradigm in NLP wherein natural language instructions are leveraged with language models to induce zero-shot performance on unseen tasks. Instructions have been shown to enable good performance on unseen tasks and datasets in both large and small language models. Dialogue is an especially interesting area to explore instruction tuning because dialogue systems perform multiple kinds of tasks related to language (e.g., natural language understanding and generation, domain-specific interaction), yet instruction tuning has not been systematically explored for dialogue-related tasks. We introduce InstructDial, an instruction tuning framework for dialogue, which consists of a repository of 48 diverse dialogue tasks in a unified text-to-text format created from 59 openly available dialogue datasets. Next, we explore cross-task generalization ability on models tuned on InstructDial across diverse dialogue tasks. Our analysis reveals that InstructDial enables good zero-shot performance on unseen datasets and tasks such as dialogue evaluation and intent detection, and even better performance in a few-shot setting. To ensure that models adhere to instructions, we introduce novel meta-tasks. We establish benchmark zero-shot and few-shot performance of models trained using the proposed framework on multiple dialogue tasks.


Report from the NSF Future Directions Workshop on Automatic Evaluation of Dialog: Research Directions and Challenges

March 2022

·

91 Reads

·

1 Citation

This is a report on the NSF Future Directions Workshop on Automatic Evaluation of Dialog. The workshop explored the current state of the art along with its limitations and suggested promising directions for future work in this important and very rapidly changing area of research.



Citations (58)


... Identifying knowledge sources was also used in the Ninth Dialog System Technology Challenge (DSTC9) with a track called "Beyond domain APIs -Tasks-oriented conversational modeling with unstructured knowledge access". This track aimed to expand different task-oriented dialog systems by incorporating external unstructured knowledge sources (Gunasekara et al., 2020). The track's purpose was to investigate how to support frictionless task-oriented situations so that the flow of the conversation does not break when users have questions that are out of the scope of APIs/DB but possibly are available in external knowledge sources. ...

Reference:

Conversational Information Seeking
Overview of the Ninth Dialog System Technology Challenge: DSTC9

IEEE/ACM Transactions on Audio Speech and Language Processing

... If the applications were made available on the Play Store, 70 % of the students indicated that they would download them; however, only 5 % were willing to pay for the download. This is contrary to the findings of a study done at the Carnegie Mellon University in the United States of America, where non-native English students evaluated an application used during the preparation of scientific presentations [7]. The students thought the application could be used in real-life situations and they were willing to pay between $1 and $2 for it [7]. ...

POLLI: a handheld-based aid for non-native student presentations
  • Citing Conference Paper
  • August 2013

... This is especially true when the evaluation is carried out through user studies, which compensate users for their participation [4]. Therefore, quite a lot of efforts are made, aimed at automating the evaluation, or at least automating certain aspects of the evaluation [19,20,21,22]. But still, as automated metrics do not necessarily capture all aspects of the system's quality, a human evaluation is performed, which usually asks about the naturalness and quality of the generated utterances and flow of dialogue [6,17]. ...

Unsupervised Evaluation of Interactive Dialog with DialoGPT
  • Citing Conference Paper
  • January 2020

... A long-standing goal in task-oriented dialogue research has been zero-shot transfer of critical modules such as the NLU and DST to previously unseen domains and backend APIs (Mehri et al., 2022). To achieve this goal, we need a way to represent new domains and APIs in a format that can be fed to a machine learning model. ...

LAD: Language Models as Data for Zero-Shot Dialog
  • Citing Conference Paper
  • January 2022

... Previous studies primarily focused on text-based zero-shot natural language understanding (NLU) [2,3,4,5], which processes transcripts produced by an automatic speech recognition (ASR) model to create a modular solution to zero-shot SLU. Among these studies, the prompt-based question-answering (QA) framework [6,7,8] has gained popularity, driven by the recent advancements in generative large language models (LLMs) [9,10,11]. This approach involves crafting a descriptive question for each semantic label (e.g. ...

GenSF: Simultaneous Adaptation of Generative Pre-trained Models and Slot Filling
  • Citing Conference Paper
  • January 2021

... To address the generalization issues in neural networks, particularly in task-oriented dialogue systems, various neuro-symbolic methodologies have been investigated. (Mehri and Eskenazi, 2021) proposes schema graphs to generalize across various unseen domains and tasks. In (Romero et al., 2021), the authors fine-tuned GPT-2 to generate the text and symbolic representations. ...

Schema-Guided Paradigm for Zero-Shot Dialog
  • Citing Conference Paper
  • January 2021

... To facilitate zero-shot inference across varied tasks and datasets, Google introduced instruction tuning (Wei et al., 2021). This method trains language models to execute tasks based on natural language instructions (Chakrabarty et al., 2022;Gupta et al., 2022). Whereas the traditional classification paradigm requires that a specific new head is trained for each task, the instruction tuning paradigm maintains the language generation head trained during pre-training and finetunes the model by transforming classification tasks into language generation tasks. ...

InstructDial: Improving Zero and Few-shot Generalization in Dialogue through Instruction Tuning
  • Citing Conference Paper
  • January 2022

... A growing body of work has shown that people mirror linguistic patterns produced by technology, as well. For example, people adopt the words and syntactic structures produced by a computer system [9,10] and the pronunciation patterns of text-to-speech (TTS) voices presented across a variety of forms [14,17,19,24,30,49,54,56,57]. However, the magnitude of mirroring often differs when making direct comparisons between a human and technological interlocutor. ...

Prosodic entrainment in an information-driven dialog system
  • Citing Conference Paper
  • September 2012

... Dialogue evaluation research has raised awareness of measuring flexibility and understanding among many other criteria. There exist automated metrics based on NLP models for assessing the quality of dialogues, but their correlation with human judgments needs to be improved on (Mehri et al., 2022;Siro et al., 2022). While TTM is focused on usability metrics (easiness, confidence, speed, likeliness to use), we target dialogue and explanation quality metrics. ...

Report from the NSF Future Directions Workshop on Automatic Evaluation of Dialog: Research Directions and Challenges

... Related work for our work is relatively sparse. Although automatic evaluation of dialogue systems is an active field of research (Yeh, Eskenazi, and Mehri 2021;Khalid and Lee 2022), most of the metrics and approaches focus on evaluating a dialogue in utterance level Ghazarian et al. 2020). However, our work focuses on the evaluation of the dialogues in conversation level, mostly produced by an AI algorithms, such as Graph2Bot introduced by Bouraoui et al. (2019) and is a tool for assisting conversational agent designers. ...

A Comprehensive Assessment of Dialog Evaluation Metrics
  • Citing Conference Paper
  • January 2021