March 2025
·
1 Read
Artificial Intelligence
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
March 2025
·
1 Read
Artificial Intelligence
November 2024
·
7 Reads
Previous work has attempted to boost Large Language Model (LLM) performance on planning and scheduling tasks through a variety of prompt engineering techniques. While these methods can work within the distributions tested, they are neither robust nor predictable. This limitation can be addressed through compound LLM architectures where LLMs work in conjunction with other components to ensure reliability. In this paper, we present a technical evaluation of a compound LLM architecture--the LLM-Modulo framework. In this framework, an LLM is paired with a complete set of sound verifiers that validate its output, re-prompting it if it fails. This approach ensures that the system can never output any fallacious output, and therefore that every output generated is guaranteed correct--something previous techniques have not been able to claim. Our results, evaluated across four scheduling domains, demonstrate significant performance gains with the LLM-Modulo framework using various models. Additionally, we explore modifications to the base configuration of the framework and assess their impact on overall system performance.
October 2024
·
8 Reads
The ability to plan a course of action that achieves a desired state of affairs has long been considered a core competence of intelligent agents and has been an integral part of AI research since its inception. With the advent of large language models (LLMs), there has been considerable interest in the question of whether or not they possess such planning abilities, but -- despite the slew of new private and open source LLMs since GPT3 -- progress has remained slow. OpenAI claims that their recent o1 (Strawberry) model has been specifically constructed and trained to escape the normal limitations of autoregressive LLMs -- making it a new kind of model: a Large Reasoning Model (LRM). In this paper, we evaluate the planning capabilities of two LRMs (o1-preview and o1-mini) on both planning and scheduling benchmarks. We see that while o1 does seem to offer significant improvements over autoregressive LLMs, this comes at a steep inference cost, while still failing to provide any guarantees over what it generates. We also show that combining o1 models with external verifiers -- in a so-called LRM-Modulo system -- guarantees the correctness of the combined system's output while further improving performance.
September 2024
·
47 Reads
·
1 Citation
The ability to plan a course of action that achieves a desired state of affairs has long been considered a core competence of intelligent agents and has been an integral part of AI research since its inception. With the advent of large language models (LLMs), there has been considerable interest in the question of whether or not they possess such planning abilities. PlanBench, an extensible benchmark we developed in 2022, soon after the release of GPT3, has remained an important tool for evaluating the planning abilities of LLMs. Despite the slew of new private and open source LLMs since GPT3, progress on this benchmark has been surprisingly slow. OpenAI claims that their recent o1 (Strawberry) model has been specifically constructed and trained to escape the normal limitations of autoregressive LLMs--making it a new kind of model: a Large Reasoning Model (LRM). Using this development as a catalyst, this paper takes a comprehensive look at how well current LLMs and new LRMs do on PlanBench. As we shall see, while o1's performance is a quantum improvement on the benchmark, outpacing the competition, it is still far from saturating it. This improvement also brings to the fore questions about accuracy, efficiency, and guarantees which must be considered before deploying such systems.
July 2024
·
4 Reads
·
1 Citation
Artificial Intelligence
May 2024
·
204 Reads
·
1 Citation
As the applicability of Large Language Models (LLMs) extends beyond traditional text processing tasks, there is a burgeoning interest in their potential to excel in planning and reasoning assignments, realms traditionally reserved for System 2 cognitive competencies. Despite their perceived versatility, the research community is still unraveling effective strategies to harness these models in such complex domains. The recent discourse introduced by the paper on LLM Modulo marks a significant stride, proposing a conceptual framework that enhances the integration of LLMs into diverse planning and reasoning activities. This workshop paper delves into the practical application of this framework within the domain of travel planning, presenting a specific instance of its implementation. We are using the Travel Planning benchmark by the OSU NLP group, a benchmark for evaluating the performance of LLMs in producing valid itineraries based on user queries presented in natural language. While popular methods of enhancing the reasoning abilities of LLMs such as Chain of Thought, ReAct, and Reflexion achieve a meager 0%, 0.6%, and 0% with GPT3.5-Turbo respectively, our operationalization of the LLM-Modulo framework for TravelPlanning domain provides a remarkable improvement, enhancing baseline performances by 4.6x for GPT4-Turbo and even more for older models like GPT3.5-Turbo from 0% to 5%. Furthermore, we highlight the other useful roles of LLMs in the planning pipeline, as suggested in LLM-Modulo, which can be reliably operationalized such as extraction of useful critics and reformulator for critics.
May 2024
·
19 Reads
Reinforcement Learning (RL) suffers from sample inefficiency in sparse reward domains, and the problem is pronounced if there are stochastic transitions. To improve the sample efficiency, reward shaping is a well-studied approach to introduce intrinsic rewards that can help the RL agent converge to an optimal policy faster. However, designing a useful reward shaping function specific to each problem is challenging, even for domain experts. They would either have to rely on task-specific domain knowledge or provide an expert demonstration independently for each task. Given, that Large Language Models (LLMs) have rapidly gained prominence across a magnitude of natural language tasks, we aim to answer the following question: Can we leverage LLMs to construct a reward shaping function that can boost the sample efficiency of an RL agent? In this work, we aim to leverage off-the-shelf LLMs to generate a guide policy by solving a simpler deterministic abstraction of the original problem that can then be used to construct the reward shaping function for the downstream RL agent. Given the ineffectiveness of directly prompting LLMs, we propose MEDIC: a framework that augments LLMs with a Model-based feEDback critIC, which verifies LLM-generated outputs, to generate a possibly sub-optimal but valid plan for the abstract problem. Our experiments across domains from the BabyAI environment suite show 1) the effectiveness of augmenting LLMs with MEDIC, 2) a significant improvement in the sample complexity of PPO and A2C-based RL agents when guided by our LLM-generated plan, and finally, 3) pave the direction for further explorations of how these models can be used to augment existing RL pipelines.
May 2024
·
21 Reads
·
1 Citation
The reasoning abilities of Large Language Models (LLMs) remain a topic of debate. Some methods such as ReAct-based prompting, have gained popularity for claiming to enhance sequential decision-making abilities of agentic LLMs. However, it is unclear what is the source of improvement in LLM reasoning with ReAct based prompting. In this paper we examine these claims of ReAct based prompting in improving agentic LLMs for sequential decision-making. By introducing systematic variations to the input prompt we perform a sensitivity analysis along the claims of ReAct and find that the performance is minimally influenced by the "interleaving reasoning trace with action execution" or the content of the generated reasoning traces in ReAct, contrary to original claims and common usage. Instead, the performance of LLMs is driven by the similarity between input example tasks and queries, implicitly forcing the prompt designer to provide instance-specific examples which significantly increases the cognitive burden on the human. Our investigation shows that the perceived reasoning abilities of LLMs stem from the exemplar-query similarity and approximate retrieval rather than any inherent reasoning abilities.
May 2024
·
16 Reads
From its inception, AI has had a rather ambivalent relationship with humans -- swinging between their augmentation and replacement. Now, as AI technologies enter our everyday lives at an ever increasing pace, there is a greater need for AI systems to work synergistically with humans. One critical requirement for such synergistic human-AI interaction is that the AI systems be explainable to the humans in the loop. To do this effectively, AI agents need to go beyond planning with their own models of the world, and take into account the mental model of the human in the loop. Drawing from several years of research in our lab, we will discuss how the AI agent can use these mental models to either conform to human expectations, or change those expectations through explanatory communication. While the main focus of the book is on cooperative scenarios, we will point out how the same mental models can be used for obfuscation and deception. Although the book is primarily driven by our own research in these areas, in every chapter, we will provide ample connections to relevant research from other groups.
March 2024
·
8 Reads
·
2 Citations
Proceedings of the AAAI Conference on Artificial Intelligence
Our work aims at efficiently leveraging ambiguous demonstrations for the training of a reinforcement learning (RL) agent. An ambiguous demonstration can usually be interpreted in multiple ways, which severely hinders the RL agent from learning stably and efficiently. Since an optimal demonstration may also suffer from being ambiguous, previous works that combine RL and learning from demonstration (RLfD works) may not work well. Inspired by how humans handle such situations, we propose to use self-explanation (an agent generates explanations for itself) to recognize valuable high-level relational features as an interpretation of why a successful trajectory is successful. This way, the agent can leverage the explained important relations as guidance for its RL learning. Our main contribution is to propose the Self-Explanation for RL from Demonstrations (SERLfD) framework, which can overcome the limitations of existing RLfD works. Our experimental results show that an RLfD model can be improved by using our SERLfD framework in terms of training stability and performance. To foster further research in self-explanation-guided robot learning, we have made our demonstrations and code publicly accessible at https://github.com/YantianZha/SERLfD. For a deeper understanding of our work, interested readers can refer to our arXiv version at https://arxiv.org/pdf/2110.05286.pdf, including an accompanying appendix.
... Perceived inconsistencies can arise for other reasons, for instance, when the user's mental model of the world does not match up with the information the agent is acting on. Even if an agent is taking actions that align with the user's goals, its actions may appear misaligned if the human's model of the world is different [72]; think of a shopping agent purchasing what appears to be an overly expensive widget because it knows that the cheaper model is incompatible with the user's needs, but fails to consider the user's budget limitations. ...
Reference:
Challenges in Human-Agent Communication
July 2024
Artificial Intelligence
... In particular, in the context of deep RL, Guan et al. (2021) provide coarse symbolic feedback in the form of object-centric image regions to accompany binary feedback on an agent's proposed actions. Another interesting use of symbolic explanations in RL is that of Zha et al. (2021), in which an RL agent learns to better understand human demonstrations by grounding these in humanaligned symbolic representations. ...
March 2024
Proceedings of the AAAI Conference on Artificial Intelligence
... We are interested in providing explanations that aid operators in interpreting solutions generated by complex multirobot systems that incorporate task allocation, scheduling, and motion planning into its decision making. Prior XAI work has addressed this challenge by introducing techniques for generating explanations for task allocation [36], [35], scheduling [24], [9], and motion planning [15] independently. However, recent work in the multi-robot community has shown that the close interdependency between these three subproblems (i.e., determining which robots should perform which tasks affect the timing/schedule of those tasks, and in turn, the motion plans required for their execution) is most effectively addressed by holistic solutions that consider all three challenges together [31], [27], [30]. ...
March 2024
Proceedings of the AAAI Conference on Artificial Intelligence
... Beyond language, appropriate social cues require inferring and predicting the beliefs of others (Smith, 2010;Bradford et al., 2015). While the extent to which language models truly possess a theory of mind remains a subject of debate (Ullman, 2023;Verma et al., 2024;Strachan et al., 2024), recent advancements in instruction fine-tuning and alignment techniques have enhanced AI capabilities to infer user intent and respond appropriately to communicative cues (Ouyang et al., 2022). ...
March 2024
... The default objective in non-optimal planning (or search) is to simply produce any plan as quickly as possible, i.e., without regard for quality. A bounded-cost search algorithm takes as input a cost bound, and aims to find a solution within that bound as quickly as possible, i.e., without expending effort on achieving a better-quality solution than required by the bound [Stern et al., 2011;. Bounded suboptimal search algorithms, the most famous of which is Weighted A ⋆ [Pohl, 1970], take a relative bound parameter w and ensure the solution found is within a factor w of optimal. ...
Reference:
A Survey on Plan Optimization
July 2023
Proceedings of the International Conference on Automated Planning and Scheduling
... While in HIL systems, ablation studies are useful, in AI 2 L systems, they are essential. Typically, some other important aspects of the evaluation of these AI 2 Lsystems are the interpretability, explainability (Sreedharan, Kulkarni, and Kambhampati 2022), interactive capabilities (Zahedi et al. 2023), and generalizability of these systems (Wüst et al. 2024). ...
March 2023
... 35). LLMs are additionally indicted for several other incapacities, including planning (Valmeekam et al., 2023), natural language understanding, folk physics, information retrieval, pragmatics, theory of mind, spatial inference, simple logical reasoning (Dziri et al., 2023), and mathematical reasoning. ...
February 2023
... A popular tool introduced for eliciting and estimating trust by such works is through the use of self-report scales [19,25,6]. Works have also looked at developing methods for estimating trust-levels through eyetracking [12], social [15] and other behavior [26,28] cues. Unfortunately, directly using these measures to drive agent behavior remains quite challenging. ...
Reference:
A Mental Model Based Theory of Trust
March 2022
... Due to the lack of a unified terminology for XAI and its goals, it is still unclear which scales should be used to investigate how explanations generated by robots impact human factors, as numerous scales have been proposed and are used for different goals [18]. While some works lay a more holistic foundation of different scales for their use in robotics [19], others focus on specific scales for evaluating human-robot interaction [20] and their application (e.g. [21], [22]), which are also applicable to evaluate human factors for robot navigation explanations [15]. ...
May 2021
Proceedings of the International Conference on Automated Planning and Scheduling
... Landmarks have an enormous history of use in speeding up the combinatorial search process for planning (Hoffmann, Porteous, and Sebastia 2004) as well as in planningadjacent tasks like plan recognition (Pereira, Oren, and Meneguzzi 2020). In the past, landmarks have also been used to summarize plans (Chen and Mooney 2011;Grover et al. 2020;Sreedharan et al. 2020b) to the end-user and debug plans (Sreedharan et al. 2020a) for the developer in complex real-world domains such as in the authoring of goaloriented conversational agents (Muise et al. 2019), as well as for localization in path planning settings (Mataric 1992). To the best of our knowledge, this is the first attempt at using landmarks for plan disambiguation with end users. ...
June 2020
Proceedings of the International Conference on Automated Planning and Scheduling