Arthur Szlam’s research while affiliated with Meta and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (134)


CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory
  • Conference Paper

July 2023

·

22 Reads

·

106 Citations

Nur (Mahi)Shafiullah

·

·

Lerrel Pinto

·

[...]

·

Arthur Szlam

A Data Source for Reasoning Embodied Agents

June 2023

·

9 Reads

·

3 Citations

Proceedings of the AAAI Conference on Artificial Intelligence

Recent progress in using machine learning models for reasoning tasks has been driven by novel model architectures, large-scale pre-training protocols, and dedicated reasoning datasets for fine-tuning. In this work, to further pursue these advances, we introduce a new data generator for machine reasoning that integrates with an embodied agent. The generated data consists of templated text queries and answers, matched with world-states encoded into a database. The world-states are a result of both world dynamics and the actions of the agent. We show the results of several baseline models on instantiations of train sets. These include pre-trained language models fine-tuned on a text-formatted representation of the database, and graph-structured Transformers operating on a knowledge-graph representation of the database. We find that these models can answer some questions about the world-state, but struggle with others. These results hint at new research directions in designing neural reasoning models and database representations. Code to generate the data and train the models will be released at github.com/facebookresearch/neuralmemory


Figure 2: The architecture of the developed data crowdsourcing collection tool
Figure 3: The example of the game from the multi-turn dataset, where Architect can see the target structure and needs to provide instructions for the Builder (Mehta et al., 2023)
Statistics of Single-Turn Dataset
Examples of pairs of unclear instructions and clarifying questions
Results of the baselines on 'What': Clarifica- tion Need Prediction task
Transforming Human-Centered AI Collaboration: Redefining Embodied Agents Capabilities through Interactive Grounded Language Instructions
  • Preprint
  • File available

May 2023

·

158 Reads

Human intelligence's adaptability is remarkable, allowing us to adjust to new tasks and multi-modal environments swiftly. This skill is evident from a young age as we acquire new abilities and solve problems by imitating others or following natural language instructions. The research community is actively pursuing the development of interactive "embodied agents" that can engage in natural conversations with humans and assist them with real-world tasks. These agents must possess the ability to promptly request feedback in case communication breaks down or instructions are unclear. Additionally, they must demonstrate proficiency in learning new vocabulary specific to a given domain. In this paper, we made the following contributions: (1) a crowd-sourcing tool for collecting grounded language instructions; (2) the largest dataset of grounded language instructions; and (3) several state-of-the-art baselines. These contributions are suitable as a foundation for further research.

Download

Figure 1: (top) Baseline vanilla LM directly generates the answer (A) given the context (C) and the question (Q). (middle) Scratchpad allows the model to generate intermediate reasoning tokens before answering the question but after it has seen the context. (bottom) Our Self-Notes method allows the model to deviate from the input context at any time to reason and take notes.
Test Accuracy (in %) for the reasoning and state-tracking tasks. "*" indicates out-of-distribution harder test settings.
Toy-Story setting without ground-truth notes.
Algorithm unsupervised
Learning to Reason and Memorize with Self-Notes

May 2023

·

100 Reads

Large language models have been shown to struggle with limited context memory and multi-step reasoning. We propose a simple method for solving both of these problems by allowing the model to take Self-Notes. Unlike recent scratchpad approaches, the model can deviate from the input context at any time to explicitly think. This allows the model to recall information and perform reasoning on the fly as it reads the context, thus extending its memory and enabling multi-step reasoning. Our experiments on multiple tasks demonstrate that our method can successfully generalize to longer and more complicated instances from their training setup by taking Self-Notes at inference time.


Multi-Party Chat: Conversational Agents in Group Settings with Humans and Models

April 2023

·

6 Reads

·

3 Citations

Current dialogue research primarily studies pairwise (two-party) conversations, and does not address the everyday setting where more than two speakers converse together. In this work, we both collect and evaluate multi-party conversations to study this more general case. We use the LIGHT environment to construct grounded conversations, where each participant has an assigned character to role-play. We thus evaluate the ability of language models to act as one or more characters in such conversations. Models require two skills that pairwise-trained models appear to lack: (1) being able to decide when to talk; (2) producing coherent utterances grounded on multiple characters. We compare models trained on our new dataset to existing pairwise-trained dialogue models, as well as large language models with few-shot prompting. We find that our new dataset, MultiLIGHT, which we will publicly release, can help bring significant improvements in the group setting.


Infusing Commonsense World Models with Graph Knowledge

January 2023

·

2 Reads

While language models have become more capable of producing compelling language, we find there are still gaps in maintaining consistency, especially when describing events in a dynamically changing world. We study the setting of generating narratives in an open world text adventure game, where a graph representation of the underlying game state can be used to train models that consume and output both grounded graph representations and natural language descriptions and actions. We build a large set of tasks by combining crowdsourced and simulated gameplays with a novel dataset of complex actions in order to to construct such models. We find it is possible to improve the consistency of action narration models by training on graph contexts and targets, even if graphs are not present at test time. This is shown both in automatic metrics and human evaluations. We plan to release our code, the new set of tasks, and best performing models.


Collecting Interactive Multi-modal Datasets for Grounded Language Understanding

November 2022

·

16 Reads

Human intelligence can remarkably adapt quickly to new tasks and environments. Starting from a very young age, humans acquire new skills and learn how to solve new tasks either by imitating the behavior of others or by following provided natural language instructions. To facilitate research which can enable similar capabilities in machines, we made the following contributions (1) formalized the collaborative embodied agent using natural language task; (2) developed a tool for extensive and scalable data collection; and (3) collected the first dataset for interactive grounded language understanding.


CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory

October 2022

·

10 Reads

We propose CLIP-Fields, an implicit scene model that can be trained with no direct human supervision. This model learns a mapping from spatial locations to semantic embedding vectors. The mapping can then be used for a variety of tasks, such as segmentation, instance identification, semantic search over space, and view localization. Most importantly, the mapping can be trained with supervision coming only from web-image and web-text trained models such as CLIP, Detic, and Sentence-BERT. When compared to baselines like Mask-RCNN, our method outperforms on few-shot instance identification or semantic segmentation on the HM3D dataset with only a fraction of the examples. Finally, we show that using CLIP-Fields as a scene memory, robots can perform semantic navigation in real-world environments. Our code and demonstrations are available here: https://mahis.life/clip-fields/


Reducing Conversational Agents’ Overconfidence Through Linguistic Calibration

August 2022

·

22 Reads

·

117 Citations

Transactions of the Association for Computational Linguistics

While improving neural dialogue agents’ factual accuracy is the object of much research, another important aspect of communication, less studied in the setting of neural dialogue, is transparency about ignorance. In this work, we analyze to what extent state-of-the-art chit-chat models are linguistically calibrated in the sense that their verbalized expression of doubt (or confidence) matches the likelihood that the model’s responses are factually incorrect (or correct). We find that these models are poorly calibrated, yet we show that likelihood of correctness can accurately be predicted. By incorporating such metacognitive features into the training of a controllable generation model, we obtain a dialogue agent with greatly improved linguistic calibration.


BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage

August 2022

·

141 Reads

·

16 Citations

We present BlenderBot 3, a 175B parameter dialogue model capable of open-domain conversation with access to the internet and a long-term memory, and having been trained on a large number of user defined tasks. We release both the model weights and code, and have also deployed the model on a public web page to interact with organic users. This technical report describes how the model was built (architecture, model and training scheme), and details of its deployment, including safety mechanisms. Human evaluations show its superiority to existing open-domain dialogue agents, including its predecessors (Roller et al., 2021; Komeili et al., 2022). Finally, we detail our plan for continual learning using the data collected from deployment, which will also be publicly released. The goal of this research program is thus to enable the community to study ever-improving responsible agents that learn through interaction.


Citations (61)


... A growing trend in generation tasks is to employ a Transformer-based language model with a decoder component to create new datasets [11]. Simple Augmentation TransformersDA [10] ✓ DAGAM [11] ✓ ✓ GenAug [12] ✓ AuGPT [13] ✓ COCA [14] ✓ Selection-DA [15] ✓ LAMBADA [16] ✓ ✓ LeCA [17] ✓ G-DAUGc [18] ✓ MRC-QA [19] ✓ Prompt-based Augmentation GPT3Mix [20] ✓ DA-intent [21] ✓ WANLI [22] ✓ FlipDA [4] ✓ ✓ AugESC [23] ✓ AugGPT [2] ✓ Read-Com [24] ✓ DAIL [25] ✓ ✓ DA-NMT [26] ✓ ✓ ✓ EPA [27] ✓ ZeroShotDataAug [28] ✓ Dialogue-Convert [29] ✓ HiPSTG [30] ✓ SUNGEN [31] ✓ LLM-powered [1] ✓ LLM-PTM [32] ✓ Generative-DA [33] ✓ ICLEF [34] ✓ ✓ LLM-DA [35] ✓ ✓ Synthetic-DA [36] ✓ ✓ LLM-Assisted [3] ✓ LLM2LLM [37] ✓ PromptMix [38] ✓ ✓ Unnatural-instructions [39] ✓ GENIUS [5] ✓ TAPP [40] ✓ X-GEAR [41] ✓ InPars [42] ✓ ✓ ConvAug [43] ✓ ✓ Promptagator [44] ✓ DAPDR [45] ✓ UDAPDR [46] ✓ Retrieval-based Augmentation AugmentedSBERT [47] ✓ ✓ zicl [48] ✓ ✓ RetGen [7] ✓ Internet-Aug [49] ✓ DialogGen [50] ✓ ChatPLUG [51] ✓ EDGE [52] ✓ RGQA [53] ✓ CGRG [54] ✓ IM-RAG [55] ✓ EAE-RAG [56] ✓ SeeKeR [57] ✓ Efficient-RAG [58] ✓ LAPDOG [59] ✓ Personae-DA [60] ✓ Hybrid Augmentation DAICL [61] ✓ ✓ KAPING [62] ✓ ALCE [63] ✓ RADA [64] ✓ ✓ UniMS-RAG [65] ✓ QA-Internet [8] ✓ ✓ ReAct [66] ✓ For decoder-only models, such as Read-Com [24] leverages GPT-4 [75] to obtain synthesised datasets similar to the original data's style and semantics. LLM-Assisted [3] adopts Llama2 [76] to produce three different level augmentations based on a provided sample in the training set. ...

Reference:

Text Data Augmentation for Large Language Models: A Comprehensive Survey of Methods, Challenges, and Opportunities
Language Models that Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion
  • Citing Conference Paper
  • January 2022

... Large language models (LLMs) have recently shown remarkable capabilities in reasoning-intensive tasks such as coding (Chen et al., 2021;Li et al., 2022;Rozière et al., 2023) and solving complex mathematical problems (Shao et al., 2024;Azerbayev et al., 2024). Prompting strategies like chain-of-thought prompting (Nye et al., 2021;Wei et al., 2022;Kojima et al., 2022;Adolphs et al., 2022) and self-consistency sampling (Wang et al., 2023) enhance these models' final-answer accuracy by encouraging them to articulate intermediate reasoning steps. However, a significant issue remains: even when these methods boost final-answer correctness, the internal reasoning steps are often unreliable or logically inconsistent (Uesato et al., 2022;Lightman et al., 2024). ...

Reason first, then respond: Modular Generation for Knowledge-infused Dialogue
  • Citing Conference Paper
  • January 2022

... Recent visual SLAM approaches further capture visual appearance, drawing on advances in Neural Radiance Fields (NeRF) [42] and its variants [29,43], allowing for photorealistic image synthesis of environments. These advances enable new possibilities in complex downstream tasks, including detailed semantic scene understanding [23], language-guided manipulation [59], and visual nav-igation [58]. Additionally, neural representations have the advantage of filling unseen regions with smooth geometric estimation and offering a low-memory footprint [45,63,71]. ...

CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory
  • Citing Conference Paper
  • July 2023

... В поддержку этого положения приводятся данные нейрокогнитивных исследований, фиксирующих сходные нейрональные механизмы пространственной и когнитивной навигации. В качестве разумной альтернативы языковым моделям предлагается широкая концепция телесно воплощенного искусственного интеллекта и в частности активно развивающиеся мультимодальные генеративные модели [4][5][6]. ...

A Data Source for Reasoning Embodied Agents
  • Citing Article
  • June 2023

Proceedings of the AAAI Conference on Artificial Intelligence

... Therefore, it is crucial to develop mechanisms that allow CA to accurately gauge the context and dynamics of group interactions. To decide the timing of interventions, there were prior approaches such as speaker prediction [16,21,102], mentioning with wake word [48], turn-based intervention [52], and proactive intervention strategies [61]. The decision of natural intervention timing allows the AI to craft and present input that integrates smoothly with the flow of the conversation. ...

Multi-Party Chat: Conversational Agents in Group Settings with Humans and Models
  • Citing Preprint
  • April 2023

... In contrast to static evaluation protocols of LLMs, LLM-based agents are tested on open tasks within interactive environments. Among these, systematic benchmarks include API environments for tool learning [47,50,42], text-based game environments [40,9,17,59] for language agent evaluation and multi-modal simulators for embodied agents [54,11,48]. CivRealm [45], a project built upon Freeciv (another similar but unpopular strategy game), primarily focuses on the comparison between LLMs and RL approaches in unit control and economic development. ...

Learning to speak and act in a fantasy text adventure game
  • Citing Article
  • March 2019

... Since we typically care about the LM's confidence in the "semantic space" due to semantic invariance, instead of manipulating logits, a popular approach is to perform additional training for confidence estimation. This could be done on the base LM (either full LM [55,60,161] or partial [82]) with a different loss, or using a separate model on the internal or external representations from the base LM [3,52,89,111,128]. On the other end of the spectrum, without any training, prompting could be used to elicit verbalized confidence values [124,139], or to recalibrate LLM confidence for a particular distribution [71] via in-context learning. ...

Reducing Conversational Agents’ Overconfidence Through Linguistic Calibration

Transactions of the Association for Computational Linguistics

... We created a straightforward Python script to facilitate the AAS. Figure 3 illustrates the AAS approach. To develop the AAS, we designed a prompt chaining system inspired by the method used in Blender Bot-3 (Shuster et al., 2022). Specifically, we treated each prompt as a modular component and linked these modules into a chain for automated assessment. ...

BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage
  • Citing Preprint
  • August 2022

... Here, early work focused on conditioning generation on particular control 'codes' (Keskar et al., 2019). More recently, work focuses on classifier guidance, in which a separate classifier guides token-by-token generation (Krause et al., 2020;Yang and Klein, 2021;Shuster et al., 2021;Arora et al., 2022). Note that our focus is not to introduce a novel algorithm for classifier guidance, but rather to demonstrate how it can be combined with implicit negative feedback to solve a problem in AI-mediated communication -namely -the lack of easy integration between different modes of interaction. ...

Am I Me or You? State-of-the-Art Dialogue Models Cannot Maintain an Identity
  • Citing Conference Paper
  • January 2022