Article

Evaluating Large Language Models with RAG Capability: A Perspective from Robot Behavior Planning and Execution

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

After the significant performance of Large Language Models (LLMs) was revealed, their capabilities were rapidly expanded with techniques such as Retrieval Augmented Generation (RAG). Given their broad applicability and fast development, it's crucial to consider their impact on social systems. On the other hand, assessing these advanced LLMs poses challenges due to their extensive capabilities and the complex nature of social systems. In this study, we pay attention to the similarity between LLMs in social systems and humanoid robots in open environments. We enumerate the essential components required for controlling humanoids in problem solving which help us explore the core capabilities of LLMs and assess the effects of any deficiencies within these components. This approach is justified because the effectiveness of humanoid systems has been thoroughly proven and acknowledged. To identify needed components for humanoids in problem-solving tasks, we create an extensive component framework for planning and controlling humanoid robots in an open environment. Then assess the impacts and risks of LLMs for each component, referencing the latest benchmarks to evaluate their current strengths and weaknesses. Following the assessment guided by our framework, we identified certain capabilities that LLMs lack and concerns in social systems.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In the process of integrating Elasticsearch (ES) into the Retrieval-Augmented Generation (RAG) framework, the core of the algorithm lies in how to efficiently retrieve the most relevant documents to the query through ES and pass these documents as context to the Large Language Model (LLM) to generate high-quality answers. This process involves multiple aspects such as text similarity calculation, document scoring mechanism, and the final document selection strategy [23]. Below we will introduce the key formulas and reasoning involved in this process in detail. ...
Preprint
This study aims to improve the accuracy and quality of large-scale language models (LLMs) in answering questions by integrating Elasticsearch into the Retrieval Augmented Generation (RAG) framework. The experiment uses the Stanford Question Answering Dataset (SQuAD) version 2.0 as the test dataset and compares the performance of different retrieval methods, including traditional methods based on keyword matching or semantic similarity calculation, BM25-RAG and TF-IDF- RAG, and the newly proposed ES-RAG scheme. The results show that ES-RAG not only has obvious advantages in retrieval efficiency but also performs well in key indicators such as accuracy, which is 0.51 percentage points higher than TF-IDF-RAG. In addition, Elasticsearch's powerful search capabilities and rich configuration options enable the entire question-answering system to better handle complex queries and provide more flexible and efficient responses based on the diverse needs of users. Future research directions can further explore how to optimize the interaction mechanism between Elasticsearch and LLM, such as introducing higher-level semantic understanding and context-awareness capabilities, to achieve a more intelligent and humanized question-answering experience.
  • Achiam
Achiam et al. 2023. GPT-4 technical report. arXiv preprint. arXiv:2303.08774.
Emergent autonomous scientific research capabilities of large language models
  • D Boiko
  • M Robert
  • G Gabe
Boiko, D.; Robert, M.; and Gabe, G. 2023. Emergent autonomous scientific research capabilities of large language models. arXiv preprint. arXiv:2304.05332.
Eight things to know about large language models
  • S Bowman
Bowman, S. 2023. Eight things to know about large language models. arXiv preprint. arXiv:2304.00612. Bubeck et al. 2023. Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv preprint. arXiv:2303.12712.
Open problems and fundamental limitations of reinforcement learning from human feedback
  • Casper
Casper et al. 2023. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint. arXiv:2307.15217.
Think you have solved question answering? try arc, the ai2 reasoning challenge
  • Clark
Clark et al. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint. arXiv:1803.05457.
MME: A comprehensive evaluation benchmark for multimodal large language models
  • Fu
Fu et al. 2023. MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint. arXiv:2306.13394. Ge et al. 2023. OpenAGI: When LLM meets domain experts. arXiv preprint. arXiv:2304.04370. Ghandeharioun et al. 2024. Patchscope: A Unifying Framework for Inspecting Hidden Representations of Language Models. arXiv preprint. arXiv:2401.06102.
A systematic survey of prompt engineering on vision-language foundation models
  • Gu
Gu et al. 2023. A systematic survey of prompt engineering on vision-language foundation models. arXiv preprint. arXiv:2307.12980.
Benchmarking large language models as AI research agents
  • Huang
Huang et al. 2023. Benchmarking large language models as AI research agents. arXiv preprint. arXiv:2310.03302. Hubinger et al. 2024. Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training. arXiv preprint. arXiv:2401.05566.
Lewis et al. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks
  • F Kanehiro
  • H Hirohisa
  • K Shuuji
Kanehiro, F.; Hirohisa, H.; and Shuuji, K. 2004. OpenHRP: Open architecture humanoid robotics platform. In The International Journal of Robotics Research 23.2 (2004): 155-165. Lewis et al. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. Inn Advances in Neural Information Processing Systems 33 (2020): 9459-9474.
Wei et al. 2022. Chain-of-thought prompting elicits reasoning in large language models
  • Mialon
Mialon et al. 2023. GAIA: a benchmark for General AI Assistants. arXiv preprint. arXiv:2311.12983. Wang et al. 2023. NaviSTAR: Socially Aware Robot Navigation with Hybrid Spatio-Temporal Graph Transformer and Preference Learning. arXiv preprint. arXiv:2304.05979. Wei et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems 35 (2022): 24824-24837.
AGIEval: A human-centric benchmark for evaluating foundation models
  • Zheng
Zheng et al. 2023. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv preprint. arXiv:2306.05685. Zhong et al. 2023. AGIEval: A human-centric benchmark for evaluating foundation models." arXiv preprint. arXiv:2304.06364.