Preprint

LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

The ability to plan a course of action that achieves a desired state of affairs has long been considered a core competence of intelligent agents and has been an integral part of AI research since its inception. With the advent of large language models (LLMs), there has been considerable interest in the question of whether or not they possess such planning abilities. PlanBench, an extensible benchmark we developed in 2022, soon after the release of GPT3, has remained an important tool for evaluating the planning abilities of LLMs. Despite the slew of new private and open source LLMs since GPT3, progress on this benchmark has been surprisingly slow. OpenAI claims that their recent o1 (Strawberry) model has been specifically constructed and trained to escape the normal limitations of autoregressive LLMs--making it a new kind of model: a Large Reasoning Model (LRM). Using this development as a catalyst, this paper takes a comprehensive look at how well current LLMs and new LRMs do on PlanBench. As we shall see, while o1's performance is a quantum improvement on the benchmark, outpacing the competition, it is still far from saturating it. This improvement also brings to the fore questions about accuracy, efficiency, and guarantees which must be considered before deploying such systems.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

Article
Full-text available
While humans sometimes do show the capability of correcting their own erroneous guesses with self‐critiquing, there seems to be no basis for that assumption in the case of LLMs.
Article
Full-text available
Proving mathematical theorems at the olympiad level represents a notable milestone in human-level automated reasoning1–4, owing to their reputed difficulty among the world’s best talents in pre-university mathematics. Current machine-learning approaches, however, are not applicable to most mathematical domains owing to the high cost of translating human proofs into machine-verifiable format. The problem is even worse for geometry because of its unique translation challenges1,5, resulting in severe scarcity of training data. We propose AlphaGeometry, a theorem prover for Euclidean plane geometry that sidesteps the need for human demonstrations by synthesizing millions of theorems and proofs across different levels of complexity. AlphaGeometry is a neuro-symbolic system that uses a neural language model, trained from scratch on our large-scale synthetic data, to guide a symbolic deduction engine through infinite branching points in challenging problems. On a test set of 30 latest olympiad-level problems, AlphaGeometry solves 25, outperforming the previous best method that only solves ten problems and approaching the performance of an average International Mathematical Olympiad (IMO) gold medallist. Notably, AlphaGeometry produces human-readable proofs, solves all geometry problems in the IMO 2000 and 2015 under human expert evaluation and discovers a generalized version of a translated IMO theorem in 2004.
Article
Full-text available
Large language models (LLMs) have demonstrated tremendous capabilities in solving complex tasks, from quantitative reasoning to understanding natural language. However, LLMs sometimes suffer from confabulations (or hallucinations), which can result in them making plausible but incorrect statements1,2. This hinders the use of current large models in scientific discovery. Here we introduce FunSearch (short for searching in the function space), an evolutionary procedure based on pairing a pretrained LLM with a systematic evaluator. We demonstrate the effectiveness of this approach to surpass the best-known results in important problems, pushing the boundary of existing LLM-based approaches³. Applying FunSearch to a central problem in extremal combinatorics—the cap set problem—we discover new constructions of large cap sets going beyond the best-known ones, both in finite dimensional and asymptotic cases. This shows that it is possible to make discoveries for established open problems using LLMs. We showcase the generality of FunSearch by applying it to an algorithmic problem, online bin packing, finding new heuristics that improve on widely used baselines. In contrast to most computer search approaches, FunSearch searches for programs that describe how to solve a problem, rather than what the solution is. Beyond being an effective and scalable strategy, discovered programs tend to be more interpretable than raw solutions, enabling feedback loops between domain experts and FunSearch, and the deployment of such programs in real-world applications.
Conference Paper
Full-text available
We report on the results of applying classical planning tech- niques to the problem of analyzing computer network vul- nerabilities. Specifically, we are concerned with the gener- ation of Adversary Courses of Action , which are extended sequences of exploits leading from some initial state to an attacker's goal. In this application, we have demonstrated the generation of attack plans for a simple but realistic web-based document control system, with excellent performance com- pared to the prevailing state of the art in this area. In addition to the new capabilities gained in the area of vul- nerability analysis, this implementation provided some in- sights into performance and modeling issues for classical planning systems, both specifically with regard to M ETRIC- FF and other forward heuristic planners, and more generally for classical planning. To facilitate additional work in this area, the domain model on which this work was done will be made freely available. See the paper's Conclusion for details.
Article
Fast Downward is a classical planning system based on heuristic search. It can deal with general deterministic planning problems encoded in the propositional fragment of PDDL2.2, including advanced features like ADL conditions and effects and derived predicates (axioms). Like other well-known planners such as HSP and FF, Fast Downward is a progression planner, searching the space of world states of a planning task in the forward direction. However, unlike other PDDL planning systems, Fast Downward does not use the propositional PDDL representation of a planning task directly. Instead, the input is first translated into an alternative representation called multi-valued planning tasks, which makes many of the implicit constraints of a propositional planning task explicit. Exploiting this alternative representation, Fast Downward uses hierarchical decompositions of planning tasks for computing its heuristic function, called the causal graph heuristic, which is very different from traditional HSP-like heuristics based on ignoring negative interactions of operators. In this article, we give a full account of Fast Downwards approach to solving multi-valued planning tasks. We extend our earlier discussion of the causal graph heuristic to tasks involving axioms and conditional effects and present some novel techniques for search control that are used within Fast Downwards best-first search algorithm: preferred operators transfer the idea of helpful actions from local search to global best-first search, deferred evaluation of heuristic functions mitigates the negative effect of large branching factors on search performance, and multi-heuristic best-first search combines several heuristic evaluation functions within a single search algorithm in an orthogonal way. We also describe efficient data structures for fast state expansion (successor generators and axiom evaluators) and present a new non-heuristic search algorithm called focused iterative-broadening search, which utilizes the information encoded in causal graphs in a novel way. Fast Downward has proven remarkably successful: It won the "classical (i.e., propositional, non-optimising) track of the 4th International Planning Competition at ICAPS 2004, following in the footsteps of planners such as FF and LPG. Our experiments show that it also performs very well on the benchmarks of the earlier planning competitions and provide some insights about the usefulness of the new search enhancements.
Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change
  • Karthik Valmeekam
  • Matthew Marquez
  • Alberto Olmo
  • Sarath Sreedharan
  • Subbarao Kambhampati
Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change. Advances in Neural Information Processing Systems, 36, 2024.
Openai o1 system card
  • Openai
OpenAI. Openai o1 system card. preprint, 2024.
My (pure) speculation about what openai o1 might be doing
  • Subbarao Kambhampati
Subbarao Kambhampati. My (pure) speculation about what openai o1 might be doing, 2024. URL https: //x.com/rao2z/status/1834354533931385203. Accessed: 2024-09-14.
Introducing openai o1-preview
  • Openai
OpenAI. Introducing openai o1-preview, 2024. URL https://openai.com/index/ introducing-openai-o1-preview/. Accessed: 2024-09-19.
Ban warnings fly as users dare to probe the "thoughts" of openai's latest model
  • Benj Edwards
Benj Edwards. Ban warnings fly as users dare to probe the "thoughts" of openai's latest model.
Chain of thoughtlessness: An analysis of cot in planning
  • Kaya Stechly
  • Karthik Valmeekam
  • Subbarao Kambhampati
Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati. Chain of thoughtlessness: An analysis of cot in planning. arXiv preprint arXiv:2405.04776, 2024.
GPT3-to-plan: Extracting plans from text using
  • Alberto Olmo
  • Sarath Sreedharan
  • Subbarao Kambhampati
Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. GPT3-to-plan: Extracting plans from text using GPT-3. arXiv preprint arXiv:2106.07131, 2021.
Llms can't plan, but can help planning in llm-modulo frameworks
  • Subbarao Kambhampati
  • Karthik Valmeekam
  • Lin Guan
  • Kaya Stechly
  • Mudit Verma
  • Siddhant Bhambri
  • Lucas Saldyt
  • Anil Murthy
Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Kaya Stechly, Mudit Verma, Siddhant Bhambri, Lucas Saldyt, and Anil Murthy. Llms can't plan, but can help planning in llm-modulo frameworks. arXiv preprint arXiv:2402.01817, 2024.
o1 gets it right almost always
  • Noam Brown
Noam Brown. o1 gets it right almost always, 2024. URL https://x.com/polynoamial/status/ 1834280720493412724. Accessed: 2024-09-14.
Thought of search: Planning with language models through the lens of efficiency
  • Michael Katz
  • Harsha Kokel
  • Kavitha Srinivas
  • Shirin Sohrabi
Michael Katz, Harsha Kokel, Kavitha Srinivas, and Shirin Sohrabi. Thought of search: Planning with language models through the lens of efficiency, 2024. URL https://arxiv.org/abs/2404.11833.
Ai agents that matter
  • Sayash Kapoor
  • Benedikt Stroebl
  • Zachary S Siegel
  • Nitya Nadgir
  • Arvind Narayanan
Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. Ai agents that matter, 2024. URL https://arxiv.org/abs/2407.01502.
Llm+ p: Empowering large language models with optimal planning proficiency
  • Bo Liu
  • Yuqian Jiang
  • Xiaohan Zhang
  • Qiang Liu
  • Shiqi Zhang
  • Joydeep Biswas
  • Peter Stone
Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter Stone. Llm+ p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477, 2023.
On the planning abilities of large language models-a critical investigation
  • Karthik Valmeekam
  • Matthew Marquez
  • Sarath Sreedharan
  • Subbarao Kambhampati
Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. On the planning abilities of large language models-a critical investigation. arXiv preprint arXiv:2305.15771, 2023.
log. probs returned by openai's api are *incredibly* unstable
  • Xuan Tan Zhi
Tan Zhi Xuan. log. probs returned by openai's api are *incredibly* unstable, 2023. URL https://x.com/ xuanalogue/status/1653280462935146496. Accessed: 2024-09-14.
Travelplanner: A benchmark for real-world planning with language agents
  • Jian Xie
  • Kai Zhang
  • Jiangjie Chen
  • Tinghui Zhu
  • Renze Lou
  • Yuandong Tian
  • Yanghua Xiao
  • Yu Su
Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, and Yu Su. Travelplanner: A benchmark for real-world planning with language agents. arXiv preprint arXiv:2402.01622, 2024.
  • Swaroop Huaixiu Steven Zheng
  • Hugh Mishra
  • Xinyun Zhang
  • Minmin Chen
  • Azade Chen
  • Le Nova
  • Hou
  • Heng-Tze
  • Cheng
  • V Quoc
  • Ed H Le
  • Chi
Huaixiu Steven Zheng, Swaroop Mishra, Hugh Zhang, Xinyun Chen, Minmin Chen, Azade Nova, Le Hou, Heng-Tze Cheng, Quoc V Le, Ed H Chi, et al. Natural plan: Benchmarking llms on natural language planning. arXiv preprint arXiv:2406.04520, 2024.