Chapter

Evaluation of Pretrained Large Language Models in Embodied Planning Tasks

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Modern pretrained large language models (LLMs) are increasingly being used in zero-shot or few-shot learning modes. Recent years have seen increased interest in applying such models to embodied artificial intelligence and robotics tasks. When given in a natural language, the agent needs to build a plan based on this prompt. The best solutions use LLMs through APIs or models that are not publicly available, making it difficult to reproduce the results. In this paper, we use publicly available LLMs to build a plan for an embodied agent and evaluate them in three modes of operation: 1) the subtask evaluation mode, 2) the full autoregressive plan generation, and 3) the step-by-step autoregressive plan generation. We used two prompt settings: prompt-containing examples of one given task and a mixed prompt with examples of different tasks. Through extensive experiments, we have shown that the subtask evaluation mode, in most cases, outperforms others with a task-specific prompt, whereas the step-by-step autoregressive plan generation posts better performance in the mixed prompt setting.KeywordsLarge language modelsPlan generationPlanning for embodied agents

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Large language models (LLM), capable of solving tasks for which they were not originally trained [3], have found widespread use in embodied AI and robotics tasks [6,12,[16][17][18]25,27]. In particular, they find applications in task planning. ...
... Nevertheless, countless recent examples of using LLMs in robotic task planning exist. According to the authors of [35] mode, full autoregressive plan generation, and step-by-step autoregressive plan generation. However, this classification is only sufficient when using the LLM as the planner. ...
Article
Full-text available
This paper presents an experimental study regarding the use of OpenAI’s ChatGPT [1] for robotics applications. We outline a strategy that combines design principles for prompt engineering and the creation of a high-level function library which allows ChatGPT to adapt to different robotics tasks, simulators, and form factors. We focus our evaluations on the effectiveness of different prompt engineering techniques and dialog strategies towards the execution of various types of robotics tasks. We explore ChatGPT’s ability to use free-form dialog, parse XML tags, and to synthesize code, in addition to the use of task-specific prompting functions and closed-loop reasoning through dialogues. Our study encompasses a range of tasks within the robotics domain, from basic logical, geometrical, and mathematical reasoning all the way to complex domains such as aerial navigation, manipulation, and embodied agents. We show that ChatGPT can be effective at solving several of such tasks, while allowing users to interact with it primarily via natural language instructions. In addition to these studies, we introduce an open-sourced research tool called PromptCraft , which contains a platform where researchers can collaboratively upload and vote on examples of good prompting schemes for robotics applications, as well as a sample robotics simulator with ChatGPT integration, making it easier for users to get started with using ChatGPT for robotics. Videos and blog: aka.ms/ChatGPT-Robotics PromptCraft, AirSim-ChatGPT code: https://github.com/microsoft/PromptCraft-Robotics.
Article
Full-text available
A feature of tasks in embodied artificial intelligence is that a query to an intelligent agent is formulated in natural language. As a result, natural language processing methods have to be used to transform the query into a format convenient for generating an appropriate action plan. There are two basic approaches to the solution of this problem. One is based on specialized models trained with particular instances of instructions translated into agent-executable format. The other approach relies on the ability of large language models trained with a large amount of unlabeled data to store common sense knowledge. As a result, such models can be used to generate an agent’s action plan in natural language without preliminary learning. This paper provides a detailed review of models based on the second approach as applied to embodied artificial intelligence tasks.
Article
Full-text available
There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gram overlap, which is neither necessary nor sufficient for the task of simulating human judgment. We hypothesize that semantic propositional content is an important component of human caption evaluation, and propose a new automated caption evaluation metric defined over scene graphs coined SPICE. Extensive evaluations across a range of models and datasets indicate that SPICE captures human judgments over model-generated captions better than other automatic metrics (e.g., system-level correlation of 0.88 with human judgments on the MS COCO dataset, versus 0.43 for CIDEr and 0.53 for METEOR). Furthermore, SPICE can answer questions such as `which caption-generator best understands colors?' and `can caption-generators count?'
Article
Full-text available
Byte pair encoding (BPE) is a simple universal text compression scheme. Decompression is very fast and requires small work space. Moreover, it is easy to decompress an arbitrary part of the original text. However, it has not been so popular since the compression is rather slow and the compression ratio is not as good as other methods such as Lempel-Ziv type compression. In this paper, we bring out a potential advantage of BPE compression. We show that it is very suitable from a practical view point of compressed pattern matching, where the goal is to find a pattern directly in compressed text without decompressing it explicitly. We compare running times to find a pattern in (1) BPE compressed files, (2) Lempel-Ziv-Welch compressed files, and (3) original text files, in various situations. Experimental results show that pattern matching in BPE compressed text is even faster than matching in the original text. Thus the BPE compression reduces not only the disk space but also the searchin...
Article
Language models (LMs) have demonstrated their capability in possessing commonsense knowledge of the physical world, a crucial aspect of performing tasks in everyday life. However, it remains unclear whether they have the capacity to generate grounded, executable plans for embodied tasks. This is a challenging task as LMs lack the ability to perceive the environment through vision and feedback from the physical environment. In this paper, we address this important research question and present the first investigation into the topic. Our novel problem formulation, named G-PlanET, inputs a high-level goal and a data table about objects in a specific environment, and then outputs a step-by-step actionable plan for a robotic agent to follow. To facilitate the study, we establish an evaluation protocol and design a dedicated metric, KAS, to assess the quality of the plans. Our experiments demonstrate that the use of tables for encoding the environment and an iterative decoding strategy can significantly enhance the LMs' ability in grounded planning. Our analysis also reveals interesting and non-trivial findings.
Conference Paper
There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gram overlap, which is neither necessary nor sufficient for the task of simulating human judgment. We hypothesize that semantic propositional content is an important component of human caption evaluation, and propose a new automated caption evaluation metric defined over scene graphs coined SPICE. Extensive evaluations across a range of models and datasets indicate that SPICE captures human judgments over model-generated captions better than other automatic metrics (e.g., system-level correlation of 0.88 with human judgments on the MS COCO dataset, versus 0.43 for CIDEr and 0.53 for METEOR). Furthermore, SPICE can answer questions such as which caption-generator best understands colors? and can caption-generators count?
Do as i can and not as i say: grounding language in robotic affordances
  • M Ahn
  • A Brohan
  • N Brown
  • Y Chebotar
Ahn, M., Brohan, A., Brown, N., Chebotar, Y., et al.: Do as i can and not as i say: grounding language in robotic affordances (2022)
Language models are few-shot learners
  • T Brown
Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)
Language models as zero-shot planners: extracting actionable knowledge for embodied agents
  • W Huang
  • P Abbeel
  • D Pathak
  • I Mordatch
Huang, W., Abbeel, P., Pathak, D., Mordatch, I.: Language models as zero-shot planners: extracting actionable knowledge for embodied agents. In: ICML (2022)
Finetuned language models are zero-shot learners
  • J Wei
Wei, J., et al.: Finetuned language models are zero-shot learners (2021)
Chain of thought prompting elicits reasoning in large language models
  • J Wei
Wei, J., et al.: Chain of thought prompting elicits reasoning in large language models (2022)
GPT-J-6B: a 6 billion parameter autoregressive language model
  • B Wang
  • A Komatsuzaki
Wang, B., Komatsuzaki, A.: GPT-J-6B: a 6 billion parameter autoregressive language model (2021). https://github.com/kingoflolz/mesh-transformer-jax