PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

This comprehensive review delves into the pivotal role of prompt engineering in unleashing the capabilities of Large Language Models (LLMs). The development of Artificial Intelligence (AI), from its inception in the 1950s to the emergence of advanced neural networks and deep learning architectures, has made a breakthrough in LLMs, with models such as GPT-4o and Claude-3, and in Vision-Language Models (VLMs), with models such as CLIP and ALIGN. Prompt engineering is the process of structuring inputs, which has emerged as a crucial technique to maximize the utility and accuracy of these models. This paper explores both foundational and advanced methodologies of prompt engineering, including techniques such as self-consistency, chain-of-thought, and generated knowledge, which significantly enhance model performance. Additionally, it examines the prompt method of VLMs through innovative approaches such as Context Optimization (CoOp), Conditional Context Optimization (CoCoOp), and Multimodal Prompt Learning (MaPLe). Critical to this discussion is the aspect of AI security, particularly adversarial attacks that exploit vulnerabilities in prompt engineering. Strategies to mitigate these risks and enhance model robustness are thoroughly reviewed. The evaluation of prompt methods is also addressed, through both subjective and objective metrics, ensuring a robust analysis of their efficacy. This review also reflects the essential role of prompt engineering in advancing AI capabilities, providing a structured framework for future research and application
Unleashing the potential of prompt engineering in
Large Language Models: a comprehensive review
Banghao Chen1, Zhaofeng Zhang1, Nicolas Langren´e1*,
Shengxin Zhu2,1*
1Guangdong Provincial Key Laboratory of Interdisciplinary Research and
Application for Data Science, BNU-HKBU United International College,
Zhuhai 519087, Guangdong, China.
2Research Center for Mathematics, Beijing Normal University, No.18,
Jingfeng Road, Zhuhai 519087, Guangdong, China.
*Corresponding author(s). E-mail(s): nicolaslangrene@uic.edu.cn;
shengxin.zhu@bnu.edu.cn;
Contributing authors: chenbanghao@u.nus.edu;zhangzf@umich.edu;
Abstract
This comprehensive review delves into the pivotal role of prompt engineering in
unleashing the capabilities of Large Language Models (LLMs). The development
of Artificial Intelligence (AI), from its inception in the 1950s to the emer-
gence of advanced neural networks and deep learning architectures, has made a
breakthrough in LLMs, with models such as GPT-4o and Claude-3, and in Vision-
Language Models (VLMs), with models such as CLIP and ALIGN. Prompt
engineering is the process of structuring inputs, which has emerged as a cru-
cial technique to maximize the utility and accuracy of these models. This paper
explores both foundational and advanced methodologies of prompt engineering,
including techniques such as self-consistency, chain-of-thought, and generated
knowledge, which significantly enhance model performance. Additionally, it
examines the prompt method of VLMs through innovative approaches such as
Context Optimization (CoOp), Conditional Context Optimization (CoCoOp),
and Multimodal Prompt Learning (MaPLe). Critical to this discussion is the
aspect of AI security, particularly adversarial attacks that exploit vulnerabili-
ties in prompt engineering. Strategies to mitigate these risks and enhance model
robustness are thoroughly reviewed. The evaluation of prompt methods is also
addressed, through both subjective and objective metrics, ensuring a robust
analysis of their efficacy. This review also reflects the essential role of prompt
engineering in advancing AI capabilities, providing a structured framework for
future research and application.
Keywords: prompt engineering, Large Language Models, AI-generated content,
adversarial attacks, evaluation, AI agent, GPT-4, Vision-Language Models
1
arXiv:2310.14735v5 [cs.CL] 5 Sep 2024
1 Introduction
In recent years, a significant milestone in artificial intelligence research has been
the progression of natural language processing capabilities, primarily attributed to
Large Language Models (LLMs). Many popular models, rooted in the transformer
architecture [1], undergo training on extensive datasets derived from web-based text.
Central to their design is a self-supervised learning objective, which focuses on pre-
dicting subsequent words in incomplete sentences. Those models are called Artificial
Intelligence-Generated Content (AIGC), and their ability to generate coherent and
contextually relevant responses is a result of this training process, where they learn
to associate words and phrases with their typical contexts.
LLMs operate by encoding the input text into a high-dimensional vector space,
where semantic relationships between words and phrases are preserved. The model
then decodes this representation to generate a response, guided by the learned sta-
tistical patterns [2]. The quality of the response can be influenced by various factors,
including the prompt provided to the model, the model’s hyperparameters, and the
diversity of the training data.
These models, including LLMs such as the GPT series [3,4] by OpenAI, along
with many others (e.g. Gemini [5,6] and Gemini (BARD) [7] by Google, Claude series
by Anthropic [8,9], and Llama series open-source model from Meta [10,11]), have
revolutionized tasks ranging from information extraction to the creation of engaging
content [12]. In parallel, the development of multimodal large models (MMLMs) has
introduced the ability to process and generate not just text, but also images, audio,
and other forms of data, showcasing their flexibility and effectiveness. These mod-
els integrate multiple data modalities into a single framework, demonstrating strong
capabilities in tasks such as image description and visual question answering (VQA).
Early MMLMs include the DALL-E series [1315], which can generate images from
textual descriptions, and CLIP, which can understand and relate text and image data
in a unified manner [16,17]. More powerful models such as GPT-4o by OpenAI [18]
and Claude 3.5 Sonnet by Anthropic [8,9] excel in multimodal tasks involving text
generation and understanding, integrating natural language processing with various
forms of data to perform diverse and complex tasks. While numerous advanced mod-
els are currently capable of processing audio, the majority of accessible Application
Programming Interfaces (APIs) remain focused on text and vision modalities. With
the gradual introduction of audio APIs, a broad expansion of research in this modal-
ity can be expected [19]. The evolution of LLMs reflects significant strides in AI
research, characterized by increasing model complexity, enhanced training method-
ologies, and broader application potentials. These advancements underline the critical
role of prompt engineering in maximizing the utility and accuracy of these models,
ensuring that they can effectively cater to diverse and dynamic user needs. While this
survey is mainly focused on prompt engineering for LLMs, the inclusion of vision-
language models (VLMs) offers a broader perspective, revealing the potential and
challenges of prompt engineering in handling multimodal data. By integrating research
from both types of models, we can gain a deeper understanding of the applications of
prompt engineering and provide valuable insights for future research and practice.
In real applications, the prompt is the input of the model, and prompt engineering
can result in significant output differences [20]. Modifying both the structure (e.g.,
altering length, arrangement of instances) and the content (e.g., phrasing, choice of
illustrations, directives) of the prompt can exert a notable influence on the model’s
behavior [21,22].
Prompt engineering refers to the systematic design and optimization of input
prompts to guide the responses of LLMs, ensuring accuracy, relevance, and coher-
ence in the generated output. This process is crucial in harnessing the full potential
of these models, making them more accessible and applicable across diverse domains.
Over time, prompt engineering has evolved from an empirical practice into a well-
structured research domain. As illustrated in Figure 1, the historical progression of
prompt engineering showcases significant milestones from the early days of structured
2
inputs in the 1950s to advanced methodologies such as chain-of-thought prompting
[23] and self-consistency prompting [24] developed in recent years. This review will
primarily focus on techniques emerging from the period of rapid development after
2017.
Early Days of Structured Input (1950s-1980s)
Foundations of AI: Initial de velopments in AI depended on structured, ru le-based inputs, wherein the accuracy and
pertinence of these i nputs directly impacted system performanc e. While this did not cons titute prompt enginee ring
in the contemporary sense, it underscored the critical importance of formulating well-defined queries for AI systems.
The Emergence of Machine Learning (1980s-1990s)
Evolution of Feature Engineering: Concurrent with the advancement of statistical machine learning, emphasis increasingly
shifted towa rds how data was pres ented to models. Eff ective feature engineering be came paramount , as it signifi cantly
influenced a model's ability to learn and extract meaningful patterns from the training data.
Recurrent Neural Networks (RNNs) and Their Significance in Sequential
Data Processing (Late 1990s-2000s)
During the late 1990s, the adoption of R NNs underscored the critical importance of s equential data structures in
processing i nputs such as text an d speech. This era ini tiated a paradigm shi ft towards concep tualizing prom pts as
strategic guides to shape the responses of models over data sequences.
2006: The introdu ction of deep learning concepts marked a significant ad vancement in A I. The realizat ion that netwo rks with
greater depth could e xtract intricate patterns directly from raw data led to a rene wed focus on optimizi ng how data is structu red
for input, thereby enhancing the networks' learning capabilities.
2010: The deployment of dee p neural networks in handling more sophisti cated tasks involving unstructured text and image data
highlighted the importance of int elligent input configu ration. This period s aw the nascent development of what would lat er be
recognized as prompt engineering, aiming to refine how data inputs could more effectively guide neural network responses.
Deep Learning and Complex Inputs (2006-2010)
2015: The development of at tention mechanisms, which later bec ame fundamental in mod els such as Transforme r,
marked a pi votal advance in model architecture. Th ese mechanisms enabled models to selectivel y concentrate o n
var ious s egmen ts of the in put d ata, t hereb y enha ncing their abili ty to under stand conte xt. T his in novat ion
unders cored the increase d importa nce of ca refully designing input st ructures to maximi ze the ef fectiven ess of
attention-driven processing capabilities.
Attention Mechanisms and Contextual Inputs (2015-2017)
2017: The Transfo rmer model's debut rev olutionized mac hine learning i nput handling. This architect ure demonstrat ed that prompts
could effectively condition models, directly influencing their outputs, thereby highlighting the strategic use of input design.
2018: The emergen ce of models like BERT and GPT extended the use of pro mpts beyond spe cific tasks to a broad range of general
applicat ions. This shif t turned promp t engineering i nto an essenti al competency for levera ging the ful l potentia l of these a dvanced
models.
2020: With the release of GPT-3, the capacity for gene rating contextually appropriate and nuance d responses based solely on prompts,
without requiring additional training, emphasized the critical importance of meticulous prompt design in achieving desired outcomes.
Rise of Transformers and Explicit Prompt Engineering (2017-Present)
2020 onwards: Deve lopment of tech niques such as prompt progra mming, chain- of-tho ught pr omptin g, and
systematic prompt design, which are seen as ways to control and guide AI behavior more effectively.
Advanced Prompt Engineering Techniques (2020-Present)
Fig. 1 History of the development in prompt engineering.
Contemporary prompt engineering encompasses a spectrum of techniques, rang-
ing from foundational approaches such as role-prompting [25] to more sophisticated
methods such as chain-of-thought prompting [23]. The domain remains dynamic, with
emergent research continually unveiling novel techniques and applications in prompt
engineering. The importance of prompt engineering is accentuated by its ability to
guide model responses, thereby amplifying the versatility and relevance of LLMs in
various sectors. Importantly, a well-constructed prompt can counteract challenges
such as machine hallucinations [26,27]. The influence of prompt engineering extends
to numerous disciplines. For instance, it has facilitated the creation of robust fea-
ture extractors using LLMs, thereby improving their efficacy in tasks such as defect
detection and classification [28].
This paper aims to provide a comprehensive review of the prompt engineering
techniques proposed so far within the realm of LLMs. The structure of the paper is
organized as follows: Section 2explores the foundational methods of prompt engi-
neering, emphasizing the importance of clear and precise instructions, role-prompting,
and iterative attempts to optimize outputs. In Section 3, advanced methodologies
such as chain-of-thought, self-consistency, and generated knowledge are introduced to
3
guide models in generating high-quality content. Section 4discusses methodologies
specific to VLMs, including Context Optimization (CoOp), Conditional Context Opti-
mization (CoCoOp), and Multimodal Prompt Learning (MaPLe), which enhance the
performance of VLMs [29]. Section 5assesses the efficacy of various prompt methods
through both subjective and objective evaluations, ensuring a robust analysis of their
effectiveness. Section 6briefly explores the applications of prompt engineering across
diverse fields such as education, content creation, computer programming, and reason-
ing tasks, highlighting its broad impact. Section 7addresses the security implications
of prompt engineering, identifying common vulnerabilities in LLMs and reviewing
strategies to enhance security such as adversarial training. Finally, Section 8explores
prospective methodologies, emphasizing the importance of understanding AI model
structures and the potential of AI agents in advancing AI-generated content tools.
This structured framework provides an entire overview of the pivotal role of prompt
engineering in advancing AI capabilities and guiding future research and applications.
2 Basics of prompt engineering
By incorporating just a few key elements, one can craft a basic prompt that enables
LLMs to produce high-quality answers. In this section, some essential components of
a well-made prompt will be discussed and examples of these methods will be shown.
2.1 Model introduction: GPT-4
All of the examples in the following sections are generated by GPT-4, developed by
OpenAI [4]. Vast amounts of text data have been used to train GPT-4, whose num-
ber of parameters has been estimated to be several orders of magnitude larger than
the 175 billion parameters that had been used for the earlier GPT-3 [3]. The architec-
tural foundation of the model rests on transformer [1], which essentially are attention
mechanisms that assign varying weights to input data based on the context. Similar
to GPT-3, GPT-4 was also fine-tuned to follow a broad class of written instructions
by reinforcement learning from human feedback (RLHF) [30,31], which is a technique
that uses human preferences as a reward signal to fine-tune models.
When GPT-4 receives an input prompt, the input text will be firstly converted into
tokens that the model can interpret and process. These tokens are then managed by
transformer layers [32], which capture their relationships and context. Within these
layers, attention mechanisms distribute different weights to tokens based on their rele-
vance and context. After attention processing, the model forms its internal renditions
of the input data, known as intermediate representations. These representations are
then decoded back into human-readable text [33].
A significant aspect of this process is the randomness function [32]. This function
is influenced by two primary parameters, temperature and top-ksampling. The first
one, temperature [34] balances the randomness and determinism in the output. A
higher temperature value results in more random outputs, while a lower value makes
the output more deterministic. The second one, top-ksampling [35], limits the model’s
choices to the top kmost probable tokens during each step of output generation. The
final stage of this process is the output generation, where the model crafts the final
text.
2.2 Giving instructions
The method of giving instructions, also known as re-reading [36], refers to the heuris-
tics of human reading strategy. It has been observed that the output generated
by GPT-4, introduced in Section 2.1, tends to be excessively general when pro-
vided with fundamental instructions devoid of any supplementary description [37,38].
An example prompt is shown in Figure 2. When the model is prompted with a
basic instruction, it faces a plethora of options, which makes the result quite broad.
Hence, a comprehensive description is imperative to elicit more precise and relevant
outputs [39].
4
I want to understand some cutting-edge aspects of technology.
Here are some of the key areas:
1. Artificial intelligence and machine learning: These technologies are becoming
increasingly sophisticated, with advancements in areas such as natural language pro-
cessing (like the technology you’re interacting with right now), computer vision, and
reinforcement learning. Techniques such as transfer learning and federated learning
are being used to make machine learning models more efficient and less data-hungry.
2. Quantum computing: Quantum computers use the principles of quantum mechan-
ics to process information. They hold the potential to solve certain types of problems
much more efficiently than classical computers. Companies like Google, IBM, and
many startups are investing heavily in this area.
3. 5G and beyond: 5G wireless technology began widespread deployment in 2020 and
2021, bringing faster speeds, lower latency, and the ability to connect more devices
simultaneously. Research is also being done on 6G, which is expected to integrate
even more deeply with AI and potentially bring about more transformative changes.
4. Biotechnology: ...
Model input
Model output
Fig. 2 Giving instructions without extra description.
2.3 Be clear and precise
The second basic prompt method is “to be clear and precise”. This involves formulat-
ing prompts that are unambiguous and specific, which can guide the model toward
generating the desired output.
Most LLM architectures are derived from an extensive array of textual data. It
can be conceptualized as a combination of insights from a myriad of authors. When
presented with a broad or undetailed prompt, its output predominantly exhibits a
generic nature, which, while being applicable across a range of contexts, may not be
optimal for any specific application. In contrast, a detailed and precise prompt enables
the model to generate content that is more aligned with the unique requirements of
the given scenario, as it reduces the model’s uncertainty and guides it toward the
correct response.
For instance, as shown in Figure 3, instead of asking a vague requirement such
as “I want to understand the cutting edge of technology.”, a more precise prompt
would be “I want to understand the cutting edge of technology, specifically related to
artificial intelligence and machine learning....
2.4 Role-prompting
Role-prompting is another fundamental method in prompt engineering. It involves
giving the model a specific role to play, such as a helpful assistant or a knowledgeable
expert [40]. This method can be particularly effective in guiding the model’s responses
and ensuring that they align with the desired output. For instance, if the model is
prompted to act as a historian, it is more likely to provide a detailed and contextually
accurate response when asked about a historical event [41]. Another brief example is
shown in Figure 4.
2.5 Use of triple quotes to separate
In prompt engineering, the use of triple quotes is a technique used to separate different
parts of a prompt or to encapsulate multi-line strings. This technique is particu-
larly useful when dealing with complex prompts that include multiple components
or when the prompt itself contains quotes, which makes the model understand one’s
instructions better [42].
5
I want to understand the cutting edge of technology, specifically related to artificial
intelligence and machine learning. Recently, artificial intelligence has become
extremely popular, especially the Large Language Models which have amazed me.
Please analyze the possible reasons behind the rapid advancement of this
technology from at least three perspectives.
Absolutely, the rise of AI and machine learning, especially with regard to Large
Language Models like GPT-4, can be seen as a convergence of several key factors.
I’ll analyze this from three perspectives: technological advancements, availability of
data, and socio-economic demand.
1. Technological advancements: The growth of AI and machine learning models has
been facilitated by several technological innovations. For starters, we’ve seen incredi-
ble improvements in computational power, particularly in Graphics Processing Units
(GPUs) and now even more specialized AI chips. These advances have made it ...
2. Availability of data: ...
3. Socio-economic demand: ...
Model input
Model output
Fig. 3 A clearer and more precise prompt.
You are an expert in artificial
intelligence specializing in Large
Language Models...
Here are five basic methods to
quickly get started with Large
Language Models:
1. Experiment with web-based
interfaces:
...
A role prompt
Model output
Fig. 4 Role prompting example.
2.6 Try several times
Due to the non-deterministic nature of LLMs, it is often beneficial to try several
times when generating responses. This technique, often referred to as “resampling”,
involves running the model multiple times with the same prompt and selecting the
best output. This approach can help overcome the inherent variability in the model’s
responses and increase the chances of obtaining a high-quality output [34].
2.7 One-shot or few-shot prompting
One-shot and few-shot prompting are two important techniques in prompt engineer-
ing. One-shot prompting refers to the method where the model is given a single
example to learn from, while few-shot [43] prompting provides the model with multi-
ple examples [44]. The choice between one-shot and few-shot prompting often depends
on the complexity of the task and the capability of the model. For instance, for simple
tasks or highly capable models, one-shot prompting might be sufficient. An example
is shown in Figure 5. However, for more complex tasks or less capable models, few-
shot prompting can provide additional context and guidance, thereby improving the
model’s performance.
However, as stated in [45], “examples don’t always help”, meaning that zero-shot
prompting may have better output in some scenarios. Zero-shot prompting [46,47],
in the context of prompt-based learning, involves using a pre-trained LLM to per-
form tasks without any specific training for those tasks. The model relies on its
general knowledge, acquired during pre-training, to generate predictions based on clev-
erly crafted prompts. This allows the LLMs to handle new tasks with no additional
task-specific data, making it adaptable to scenarios with minimal labeled data. [45]
6
Standard Prompt
Q: Yes or no: would a pear sink
in water?
A: Yes, a pear would typically
float in water, not sink. The rea-
son for this is that pears, like
many fruits, are composed pri-
marily of water and have a density
that is less than that of water.
Objects that are less dense than
water will float when placed in
water. ×
One-shot Prompt
Q: Yes or no: is it common to see
frost during some college
commencements?
A: The answer is yes.
Q: Yes or no: would a pear sink
in water?
A: The answer is No.
Fig. 5 Comparison of standard prompt and one-shot prompt.
Model input
Model output
Model input
Model output
investigated the intricacies of how large generative language models, such as GPT-3,
respond to prompts. One of the significant findings from this paper is that zero-shot
prompts can, in certain scenarios, outperform few-shot prompts. This suggests that
the role of few-shot examples might not be as much about teaching the model a new
task (meta-learning) but rather guiding it to recall a task it has already learned. This
insight is crucial as it challenges the conventional wisdom that more examples always
lead to better performance [3]. In the context of one-shot or few-shot prompting, it is
essential to understand that while examples can guide the model, they do not always
enhance its performance. Sometimes, a well-crafted zero-shot prompt can be more
effective than providing multiple examples [48].
2.8 LLMs settings: temperature and top-p
The settings of LLMs, such as the temperature and top-p, play a crucial role in the
generation of responses. The temperature parameter controls the randomness of the
generated output: a lower temperature leads to more deterministic outputs [49,50].
The top-pparameter, on the other hand, controls the nucleus sampling [34], which is
a method to add randomness to the model’s output [51]. Adjusting these parameters
can significantly affect the quality and diversity of the model’s responses, making them
essential tools in prompt engineering. However, it has been noted that certain models,
exemplified by ChatGPT, do not permit the configuration of these hyperparameters,
barring instances where the Application Programming Interface (API) is employed.
[52] ranks several AI text generators and text-to-image systems in terms of various
openness metrics, including the accessibility of their API and model parameters.
3 Advanced methodologies
The foundational methods from the previous section can help us produce satisfactory
outputs. However, experiments indicate that when using LLMs for complex tasks
such as analysis or reasoning, the accuracy of the model’s outputs still has room
for improvement. In this section, advanced techniques of prompt engineering will be
introduced to guide the model in generating more specific, accurate, and high-quality
content.
7
3.1 Chain-of-thought
The concept of “Chain-of-Thought” (CoT) prompting [23] in LLMs is a relatively new
development, which has been shown to significantly improve the accuracy of LLMs on
various logical reasoning tasks [5355]. CoT prompting involves providing intermedi-
ate reasoning steps to guide the model’s responses, which can be facilitated through
simple prompts such as “Let’s think step by step” or through a series of manual
demonstrations, each composed of a question and a reasoning chain that leads to an
answer [56,57]. It also provides a clear structure for the model’s reasoning process,
making it easier for users to understand how the model arrived at its conclusions.
[58] illustrates the application of CoT prompting to medical reasoning, show-
ing that it can effectively elicit valid intermediate reasoning steps from LLMs. [59]
introduces the concept of Self-Education via Chain-of-Thought Reasoning (SECToR),
and argues that, in the spirit of reinforcement learning, LLMs can successfully teach
themselves new skills by chain-of-thought reasoning. In another study, [60] used CoT
prompting to train verifiers to solve math word problems, demonstrating the tech-
nique’s potential in educational applications. [61] proposed a multimodal version of
CoT, called Multimodal-CoT, to handle more complex, multimodal tasks beyond sim-
ple text-based tasks, such as visual tasks, further expanding the potential applications
of CoT. Furthermore, many works are building upon the CoT framework, for instance,
[62] introduces Automate-CoT, an automated approach for augmenting and selecting
rationale chains to enhance the reasoning capabilities of LLMs, thereby reducing the
reliance on manually crafted CoT prompts.
3.1.1 Zero-shot chain-of-thought
The concept of “Zero-Shot chain-of-thought” (Zero-shot-CoT) prompting is an
advanced iteration of the CoT prompting mechanism, where the “zero-shot” aspect
implies that the model is capable of performing some reasoning without having seen
any examples of the task during training.
[47] discovered that the augmentation of queries with the phrase “Let’s think step
by step” facilitated the generation of a sequential reasoning chain by LLMs. This
reasoning chain subsequently proved instrumental in deriving more precise answers.
This technique is based on the idea that the model, much like a human, can benefit
from having more detailed and logical steps to process the prompt and generate a
response.
For instance, the standard prompt is illustrated in Figure 6, while the appended
phrase, “Let’s think step by step”, is depicted in Figure 7. Observations indicate that
the incorporation of “Let’s think step by step” enhances the logical coherence and
comprehensiveness of the model’s response.
Imagine an infinitely wide entrance, which is more likely to pass through it, a
military tank or a car?
Given an “infinitely wide entrance”, both a military tank and a car can pass through
it without any constraints due to width. Therefore, neither is more likely than the
other based on width alone. Both can pass through with equal likelihood.
However, if other factors such as ground conditions, accessibility, purpose, or autho-
rization come into play, the likelihood might change. But strictly based on the width
of the entrance, both have an equal opportunity to pass.
Model input
Model output
Fig. 6 Standard prompt.
8
Imagine an infinitely wide entrance, which is more likely to pass through it, a
military tank or a car? Let’s think step by step.
Given an infinitely wide entrance, both a military tank and a car could pass through
without any issues related to the width. However, there are still several factors that
could influence which is more likely to pass through:
1. Ground surface: tanks are designed to handle rough terrains that cars may struggle
with. If the ground leading to and from the entrance is too rugged or steep for a car,
the tank would be more likely to pass through.
2. Weight restrictions: if the entrance has a load limit due to the materials used in
its construction, the car, being lighter, could be more likely to pass through. The
average car weighs around 1.5 tons, while a military tank can weigh anywhere from
60 to 70 tons.
3. Height clearance: if the entrance is infinitely wide but not infinitely tall, the height
of the vehicle matters. The average height of a car is around 1.5 meters, while tanks
can be around 2.4 meters. If the entrance is shorter than around 2.4 meters, a car
would be more likely to pass through.
4. Regulations and permissions:
...
In conclusion, without specific information about the conditions at the entrance, we
would assume that a car is more likely to pass through, simply because cars are more
common, more likely to meet potential height and weight restrictions and face fewer
regulatory barriers.
Model input
Model output
Fig. 7 Adding “Let’s think step by step”.
3.1.2 Golden chain-of-thought
[63] introduced the “golden chain-of-thought”, providing an innovative approach to
generating responses to instruction-based queries. This methodology leverages a set of
“ground-truth chain-of-thought” solutions incorporated within the prompt, consider-
ably simplifying the task for the model as it circumvents the necessity for independent
CoT generation. Concurrently, a novel benchmark comprising detective puzzles has
been designed, to assess the abductive reasoning capacities of LLMs, which is also
considered an evaluation of the golden CoT. Finally, according to the experiment by
[63], in the context of the golden CoT, GPT-4 exhibits commendable performance,
boasting an 83% solve rate of puzzles in contrast to the 38% solve rate of the standard
CoT.
However, the characteristics of the Golden CoT requiring the “ground-truth chain-
of-thought solutions” as an integral part of the prompt also signifies that the Golden
CoT’s contribution to solving such problems is limited, despite its high solve rate of
83%.
3.2 Self-consistency
In the assessment of InstructGPT [64] and GPT-3 [3] on a new synthetic QA dataset
called Proof and Ontology-Generated Question-Answering (PrOntoQA) [65,66], it
was observed that although the most extensive model exhibited capability in rea-
soning tasks, it encountered challenges in proof planning and the selection of the
appropriate proof step amidst multiple options, which caused accuracy uncertainties
[65]. Self-consistency is one of the methods for LLMs to solve this situation, which is
an advanced prompting technique that aims to ensure the model’s responses are con-
sistent with each other [23,24]. This method greatly increases the odds of obtaining
highly accurate results. The principle of self-consistency in language models posits
that for a complex reasoning problem, there can be multiple reasoning paths leading
to the correct answer. In this approach, a language model generates a diverse set of
reasoning paths for the same problem. The most accurate and consistent answer is
9
then determined by evaluating and marginalizing across these varied paths, ensuring
that the final answer reflects the convergence of multiple lines of thought.
The self-consistency method contains three steps. Firstly, prompt a language model
using CoT prompting, then replace the “greedy decode” (1-Best) [32,67] in CoT
prompting by sampling from the language model’s decoder to generate a diverse set
of reasoning paths, and finally, marginalize out the reasoning paths and aggregate by
choosing the most consistent answer in the final answer set.
It is noteworthy that self-consistency can be harmoniously integrated with most
sampling algorithms, including but not limited to, temperature sampling [49,50],
top-ksampling [32,68,69], and nucleus sampling [34]. Nevertheless, such an operation
may necessitate the invocation of the model’s API to fine-tune these hyperparameters.
In light of this, an alternative approach could be to allow the model to generate
results employing diverse reasoning paths, and then generate a diverse set of candidate
reasoning paths. The response demonstrating the highest degree of consistency across
the various reasoning trajectories is then more inclined to represent the accurate
solution [70].
[2,71] have shown that self-consistency enhances outcomes in arithmetic, com-
monsense, and symbolic reasoning tasks. Furthermore, in practice, self-consistency
can be combined with other techniques to further enhance the model’s performance.
[72] found that combining self-consistency with a discriminator-guided multi-step
reasoning approach significantly improved the model’s reasoning capabilities.
3.3 Generated knowledge
The “generated knowledge” [73] approach in prompt engineering is a technique that
leverages the ability of LLMs to generate potentially useful information about a given
question or prompt before generating a final response. This method is particularly
effective in tasks that require commonsense reasoning, as it allows the model to gen-
erate and utilize additional context that may not be explicitly present in the initial
prompt.
As exemplified in Figure 6, when posing the query to the model, “Imagine an
infinitely wide entrance, which is more likely to pass through it, a military tank or a
car?”, standard prompts predominantly yield responses that neglect to factor in the
“entrance height”. Conversely, as delineated in Figure 8and Figure 9, prompting the
model to first generate pertinent information and subsequently utilizing generated
information in the query leads to outputs with augmented logical coherence and com-
prehensiveness. Notably, this approach stimulates the model to account for salient
factors such as “entrance height”.
3.4 Least-to-most prompting
The concept of “least-to-most prompting” [74] is an advanced method that involves
decomposing a complex problem into a series of simpler subproblems, which are then
addressed sequentially. The foundational premise of this approach is to systematically
break down intricate tasks into manageable components. Each subproblem is solved
in turn, with the solution to each one serving as a building block for the next. This
step-by-step process not only simplifies the problem-solving task for the language
model but also enables it to handle more complex challenges than those typically
managed by traditional prompting techniques. Figure 10 is an illustration of least-to-
most prompting applied to a mathematical word. This figure illustrates the least-to-
most prompting approach by systematically breaking down a complex problem into a
series of simpler subproblems. The process begins with the decomposition of the main
problem—calculating the distance a train travels in 2.5 hours—into two sequential
subproblems. First, the model is prompted to determine the train’s speed, and then
it uses this information to calculate the distance traveled. Each subproblem is solved
in sequence, with the solution to the first subproblem feeding into the second. The
solutions are then aggregated to arrive at the final answer. This method emphasizes
10
Generate two key analyses related to detailed size data on military tanks and cars,
and then generate three key influencing factors regarding whether an object can
pass through an infinitely wide entrance.
Detailed size analysis: military tanks vs. cars
1. Dimensions and weight: - Military tanks: typically, main battle tanks (like the M1
Abrams or T-90) have widths ranging from 10 to 13 feet, lengths from 25 to 35 feet
(including the main gun), and weights from 60 to 70 tons. - Cars: most consumer cars
have a width of about 6 to 8 feet, a length of 12 to 20 feet, and weigh between 1 to 3
tons. Some large SUVs and trucks can exceed these figures but are still substantially
smaller and lighter than tanks.
2. Maneuverability & design: - Military tanks: tanks are designed to traverse rough
terrains and have tracks that distribute weight and prevent the vehicle from getting
easily stuck in muddy or uneven ground. However, their sheer weight and size limit
their maneuverability in urban or congested settings. - Cars: designed for paved
roads, cars are generally more agile and can navigate tight spaces more efficiently
than tanks. Their design emphasizes aerodynamics and comfort, unlike tanks that
prioritize armor and firepower.
Three key influencing factors on whether an object can pass through an infinitely
wide entrance:
1. Height of the object: the height is critical. An infinitely wide entrance does not
give information about its height. If the object is taller than the entrance, it won’t
pass through.
2. Structural integrity: depending on the material and design of an object, squeezing or
forcing it through an entrance could compromise its structural integrity. For instance,
certain tanks have reactive armor or sensitive equipment that may be damaged if
scraped or bumped forcefully.
3. Ground clearance and maneuverability: ...
Model input
Model output
Fig. 8 Generating knowledge (Step1).
the key principles of problem decomposition and sequential problem solving, enabling
the model to manage and solve complex tasks more effectively.
Upon rigorous experimentation in domains including symbolic manipulation, com-
positional generalization, and mathematical reasoning, [74] substantiate that the
least-to-most prompting paradigm exhibits the capacity to generalize across challenges
of greater complexity than those initially presented in the prompts. [75] introduced
Program Aided Language models (PAL), using the LLMs to read natural language
problems and generate programs as the intermediate reasoning steps. By using least
to most prompting, PAL shows enhancement on GSM8K [60] and SVAMP [76], which
are benchmarks about complex mathematical problems for LLMs.
3.5 Tree of thoughts
The “tree of thoughts” (ToT) prompting technique in LLMs is an advanced method
that employs a structured approach to guide LLMs in their reasoning and response
generation processes. It enhances problem-solving by exploring multiple reasoning
paths, termed ‘thoughts’. Unlike traditional linear prompts, ToT allows LLMs to con-
sider various possible solutions and strategies, including looking ahead, backtracking,
and self-evaluation, making it more interactive and adaptable to the complexity of
the task at hand. This approach fosters more dynamic and deliberate decision-making
in complex problem-solving tasks, moving beyond a rigid, hierarchical prompt struc-
ture to a more flexible and adaptive reasoning process [77]. The ToT approach, for
instance, when applied to complex mathematical problem-solving, prompts the model
to generate various potential solutions and evaluate them, rather than simply asking
for a solution.
11
“““
Detailed size analysis: military tanks vs. cars
1. Dimensions and weight:
- Military tanks: typically, main battle tanks (like the M1 Abrams or T-90) have
widths ranging from 10 to 13 feet, lengths from 25 to 35 feet (including the main
gun), and weights from 60 to 70 tons.
- Cars: most consumer cars have a width of about 6 to 8 feet, a length of 12 to 20
feet, and weigh between 1 to 3 tons. Some large SUVs and trucks can exceed these
figures but are still substantially smaller and lighter than tanks.
2. Maneuverability & design...
”””
Use the above information to answer the following question: Imagine an infinitely
wide entrance, which is more likely to pass through it, a military tank or a car?
Given the information provided about the dimensions, design, and the factors influ-
encing whether an object can pass through an infinitely wide entrance, we can infer
the following:
1. Height of the object: the height is not specified for either tanks or cars. How-
ever, given that tanks are built for rough terrains and often have a turret and other
equipment on top, they generally have a taller profile compared to most cars. If the
entrance had a restrictive height, cars would generally have an advantage here.
2. Structural integrity: ...
3. Ground clearance and maneuverability:
...
Conclusion: If the entrance height is not restrictive and the surface is rough, a mili-
tary tank would be more likely to pass through it without issues. On the other hand,
if the entrance is low-height or has tight squeezes that could compromise the struc-
tural integrity of larger, robust vehicles, a car might be at an advantage.
However, given no height constraints and only the factor of width (since it is “infinitely
wide”), both a military tank and a car would easily pass through it. In such a scenario,
the distinguishing factor would be the terrain of the entrance. If it is rough or uneven,
the tank has the advantage. If it is smooth, both would pass with relative ease.
Model input
Model output
Fig. 9 Combining the generated knowledge with the question (Step2).
[77] demonstrates that this formulation is more versatile and can handle challeng-
ing tasks where standard prompts might fall short. Another research by [78] further
emphasizes the potential of this technique in enhancing the performance of LLMs by
structuring their thought processes.
[7] introduces the “tree-of-thought prompting”, an approach that assimilates the
foundational principles of the ToT frameworks and transforms them into a stream-
lined prompting methodology. This technique enables LLMs to assess intermediate
cognitive constructs within a singular prompt. An exemplar ToT prompt is delineated
in Figure 11.
3.6 Graph of thoughts
Unlike the “chain-of-thoughts” or “tree of thoughts” paradigms, the “graph of
thoughts” (GoT) framework [79] offers a more intricate method of representing the
information generated by LLMs. The core concept behind GoT is to model this infor-
mation as an arbitrary graph. In this graph, individual units of information, termed
“LLM thoughts”, are represented as vertices. The edges of the graph, on the other
hand, depict the dependencies between these vertices. This unique representation
allows for the combination of arbitrary LLM thoughts, thereby creating a synergistic
effect in the model’s outputs.
In the context of addressing intricate challenges, LLMs utilizing the GoT frame-
work might initially produce several autonomous thoughts or solutions. These
12
If a train travels 60 kilometers in 1 hour, how far will it travel in 2.5 hours?
Problem
The train travels 60 kilometers in 1 hour. What is the
speed of the train in kilometers per hour?
Determine the speed of the train.
Model input
If the train travels at 60 kilometers per hour, how far
will it travel in 2.5 hours?
Calculate the distance traveled in 2.5 hours
using the speed.
Model input
60 kilometers per hour. Using the speed from Subproblem 1, the model
calculates that 60 km/h multiplied by 2.5 hours
equals 150 kilometers.
Solution1 Solution2
Solve
Sequentially
Aggregate the
Solutions
Decompose
the Problem
The train will travel 150 kilometers in 2.5 hours.
Subproblem 1 Subproblem 2
Step1
Step2
Step3
Final answer
Fig. 10 Illustration of Least-to-Most Prompting Applied to a Mathematical Word.
Imagine three different experts answering this question.
All experts will write down 1 step of their thinking,
then share it with the group.
Then all experts will go on to the next step, etc.
If any expert realizes they’re wrong at any point then they leave.
The question is...
Tree of thoughts prompting
Fig. 11 A sample ToT prompt [7].
individual insights can subsequently be interlinked based on their pertinence and inter-
dependencies, culminating in a detailed graph. This constructed graph permits diverse
traversal methods, ensuring the final solution is both precise and comprehensive,
encompassing various dimensions of the challenge.
The efficacy of the GoT framework is anchored in its adaptability and the pro-
found insights it can yield, particularly for intricate issues necessitating multifaceted
resolutions. Nonetheless, it is imperative to recognize that while GoT facilitates a sys-
tematic approach to problem-solving, it also necessitates a profound comprehension
of the subject matter and meticulous prompt design to realize optimal outcomes [80].
3.7 Decomposed prompting
Decomposed Prompting (DECOMP) [81] is a modular approach designed to tackle
complex tasks by breaking them down into simpler, manageable sub-tasks. This
methodology leverages the capabilities of LLMs by creating a systematic process where
each sub-task is handled by specialized handlers. The approach not only simplifies
the problem-solving process but also enhances the flexibility and efficiency of task
handling.
Four key components of this method are shown in Figure 12. The core of DECOMP
involves a decomposer LLM that generates a prompting program Pfor a complex
task Q. The program Pis a sequence of steps, each step directing a simpler sub-query
to a function within an auxiliary set of sub-task functions F. The program can be
13
Key Components
1. Generates a structured sequence of sub-tasks (a prompting program) to solve a complex task.
Decomposer LLM
1. Created by Decomposer LLM.
2. Consists of a series of sub-queries and associated sub-task functions.
Prompting Program
1. Specialized modules or functions that handle specific types of sub-tasks.
2. Examples include functions for text splitting, extracting specific information from text, or concatenating strings.
Sub-task Handlers
1. Task scheduling: Determines the execution sequence of tasks.
2. Data transfer and coordination: Transfer of data between sub-task processors. Coordinate the flow of data
between the Decomposer and the sub-task handlers.
3. Status management: Tracks the execution status of a task and determines when to stop or continue the execution.
Controller
Fig. 12 Key components of DECOMP
represented as:
P={(f1, Q1, A1),...,(fk, Qk, Ak)}
where Akis the final answer predicted by P, and Qiis a sub-query directed to the
sub-task function fiF. A high-level imperative controller manages the execution of
P, passing inputs and outputs between the decomposer and sub-task handlers until
the final output is obtained.
To teach the decomposer LLM, in-context examples are used. These examples
demonstrate the decomposition of complex queries into simpler sub-queries. Each
example Ejtakes the form:
Ej=Qj,(fj,1, Qj,1, Aj,1),...,(fj,kj, Qj,kj, Aj,kj)
where Aj,kj=Ajis the final answer for Qj, and (Qj,1, . . . , Qj,kj) represents the
decomposition of Qj. Each sub-task function fis operationalized through sub-task
handlers, which can be additional LLM prompts or symbolic or learned functions [81].
An illustration of the process flow is shown in Figure 13.
Process Flow
A complex task is submitted as an initial query Q.
Initial Query
The Decomposer LLM generates the first sub-query and determines the appropriate sub-task handler .
Decomposition
1. The Controller passes 1 to the
sub-task handler 1.
2. The sub-task handler processes
1 and returns the result 1
Sub-task Execution
1. The Controller feeds back to the Decomposer LLM, which generates the
next sub-query and the corresponding sub-task handler .
2. This process continues iteratively until all sub-queries are processed and
the final output is produced.
Iterative Processing
1. The Controller checks for a stopping condition in the prompting program.
2. Once the stopping condition is met, the Controller returns the final result.
Final Output
Fig. 13 An example of the process flow of DECOMP.
14
The DECOMP approach has several advantages. First, its modularity allows each
sub-task handler to be independently optimized, debugged, and upgraded, which
facilitates systematic performance improvements and easier integration of new meth-
ods or models. Second, DECOMP can incorporate error-correcting sub-task handlers,
improving the overall accuracy and reliability of the system. Third, the approach
allows for diverse decomposition structures, including hierarchical and recursive
decompositions, which are particularly useful for handling complex and large-scale
problems. Finally, sub-task handlers can be shared across different tasks, enhancing
the efficiency of the problem-solving process.
DECOMP and Least-to-Most Prompting [74] both decompose complex tasks
to enhance large language models’ reasoning abilities, but DECOMP distinguishes
itself through its flexible, modular approach. Unlike Least-to-Most Prompting’s lin-
ear progression from easy to hard sub-questions, DECOMP allows for non-linear and
recursive decomposition, with dedicated sub-task handlers that can be independently
optimized and replaced. This modularity not only enhances flexibility and reusabil-
ity across tasks but also introduces potential error-correcting mechanisms, making
DECOMP more robust and adaptable to complex, multi-step reasoning tasks. While
DECOMP has demonstrated superior performance in specific domains, such as sym-
bolic reasoning and multi-step question answering, its advantages over Least-to-Most
Prompting may vary depending on the nature of the task [81].
In case studies, DECOMP demonstrated superior performance in various sce-
narios. For instance, in the k-th letter concatenation task, DECOMP outperformed
CoT prompting by effectively teaching the sub-task of extracting the k-th letter
through further decomposition. In list reversal, DECOMP showed better length gen-
eralization compared to CoT by recursively decomposing the task into reversing
smaller sub-lists, achieving higher accuracy for longer input sequences. In long-context
question answering (QA), DECOMP allowed for handling more examples than fea-
sible with CoT prompting, leading to improved performance. In open-domain QA,
incorporating symbolic retrieval APIs within the DECOMP framework enhanced per-
formance on multi-hop QA datasets compared to CoT prompting. Additionally, in
Math QA, DECOMP improved accuracy by post-processing CoT prompts to fix
frequent formatting errors, resulting in significant performance gains [81].
By leveraging the modular, flexible, and systematic approach of DECOMP, com-
plex tasks can be effectively decomposed and solved, showcasing its superiority over
traditional CoT prompting and other contemporary methods.
3.8 Active prompt
The adoption of the active prompt [82] method marks a significant advancement in
the utilization of LLMs for complex reasoning tasks. The active prompt method does
not involve the traditional process of prefix-tuning [83]. Instead, it focuses on improv-
ing the reasoning capabilities of LLMs through strategic selection and annotation of
task-specific examples. By systematically selecting and annotating the most uncertain
questions, this method not only refines the model’s understanding but also lever-
ages human expertise more effectively [84]. The process begins with the generation
of multiple predictions for each question, followed by the calculation of uncertainty
(uncertainty estimation) [85,86] using various metrics such as disagreement, entropy,
and variance. This strategic selection process ensures that the most informative ques-
tions are prioritized for annotation. The human annotation phase is crucial, as it
involves providing detailed chain-of-thought reasoning and answers, which are then
used to prompt the LLM during inference. This annotated data serves as exemplars,
guiding the model through complex reasoning pathways and enhancing its predic-
tive accuracy. The application of self-consistency [24] techniques further solidifies the
model’s reliability by selecting the most consistent answers from multiple reasoning
paths. The key innovation of this method is the thought of finding out the most effi-
cient one-shot or few-shot [43] examples, so it improves the inference ability of specific
fields. A concrete process illustration is shown in Figure 14.
15
=1
=1
=1
=1
=1
1
u = 1/5 = 0.2
=1
=2
=2
=4
=5
4
u = 4/5 = 0.8
=1
=2
=2
=4
=4
3
u = 3/5 = 0.6
=1
=3
=2
=4
=5
5
u = 5/5 = 1.0
: 1.0 : 0.4 : 0.4 : 0.8
: 0.2 : 0.8 : 1.0 : 0.6
: 0.6 : 1.0
, , … 
,
,
,
Collect
Uncertainty
Ranking
Question
(Query k times)
(k = 5)
Uncertainty
Select Most
Uncertain Questions , , …
Level 1
Level 2
Level 3
Level 4
Level 5
Level 1
Answer
+ Answer
(with rationale chains) + Answer
(with rationale chains) + Answer
(with rationale chains)
Test Question
1. Selected quantitatively
2. Annotated by human
Model Input
Exemplar set E
Annotation
1
2
4
3
Fig. 14 Illustration of the whole process. (1) Uncertainty Estimation. (2) Collection, Ranking and
Selection. (3) Annotation (by human). (4) Inference.
The active prompt method offers several key benefits, including efficient task adap-
tation and significant performance improvements across various reasoning domains.
This approach aligns with the broader trend towards more interactive and adaptive
AI systems, emphasizing the importance of responsive design in prompt engineering.
Its ability to reduce human engineering efforts by focusing on the most uncertain and
informative questions makes it an important tool for advancing LLM capabilities. This
method not only enhances the quality of task-specific prompts but also maximizes
the use of human expertise, paving the way for more sophisticated and accurate AI
systems [82].
3.9 Prompt pattern catalog
A Prompt Pattern Catalog [87] is an organized collection of prompt templates and
patterns designed to enhance the effectiveness of prompt engineering, particularly for
LLMs such as ChatGPT. This methodology involves creating a standardized set of
prompt patterns that can be applied across various tasks, ensuring consistency and
optimizing the performance of models through systematic prompt design. By develop-
ing a catalog of prompt patterns, researchers and practitioners can ensure a consistent
approach to prompt engineering, reducing variability and errors from ad hoc prompt
creation [87,88]. Predefined prompt patterns streamline the process of prompt engi-
neering, saving time and resources by allowing practitioners to select and adapt
patterns rather than crafting new prompts from scratch. A well-designed prompt pat-
tern catalog includes patterns for various contexts and applications, enabling models
to be quickly adapted to new tasks and domains by selecting the most appropriate
patterns. Systematic use of optimized prompt patterns enhances model performance
by providing more effective and contextually appropriate prompts, leading to better
task-specific results [89].
The central methodology of this research involves the conceptualization and
application of prompt patterns, which are reusable solutions to common problems
encountered when interacting with LLMs. These prompt patterns are analogous
16
to design patterns in software engineering, providing structured and documented
approaches to enhance the output and interaction quality of LLMs. The framework for
documenting these prompt patterns includes a detailed structure that ensures their
adaptability across different domains [87].
To systematically categorize these prompt patterns, the authors have divided them
into five primary categories: Input Semantics, Output Customization, Error Identifica-
tion, Prompt Improvement, and Interaction. This classification helps in organizing the
patterns based on their functional roles and the specific problems they address. Within
this framework, the research introduces a comprehensive catalog of 16 distinct prompt
patterns. Each pattern is meticulously documented with the following components:
name and classification, intent and context, motivation, structure and key ideas, exam-
ple implementation, and practical consequences. The prompt patterns cover a wide
range of functionalities. For instance, the Input Semantics category includes patterns
such as Meta Language Creation, which helps in defining custom input languages for
LLMs. The Output Customization category features patterns such as Output Auto-
mater and Visualization Generator, which tailor the generated outputs to specific
formats or visualizations. Error Identification patterns such as Fact Check List ensure
the accuracy of generated content by highlighting critical facts for verification. Prompt
Improvement patterns, including Question Refinement and Alternative Approaches,
enhance the quality of interactions by refining questions and suggesting multiple ways
to achieve a goal. Lastly, Interaction patterns such as Flipped Interaction and Game
Play facilitate dynamic and engaging user-LLM interactions [87].
The methodology also emphasizes the combinatory use of these patterns to tackle
more complex prompt engineering tasks. By providing detailed examples and practical
implementations, the research demonstrates how multiple prompt patterns can be
integrated to create sophisticated and efficient prompting strategies. This structured
approach not only improves the effectiveness of LLMs in various applications but also
contributes to the broader understanding and advancement of prompt engineering as
a field [87].
Research supports the effectiveness of prompt pattern catalogs. [87] outlines that
the development and use of a prompt pattern catalog can improve the effectiveness
and efficiency of prompt engineering with LLMs. [89] explores how predefined struc-
tured prompt patterns can enhance user interaction and improve model outputs in
conversational AI. [88] investigates the application of prompt engineering patterns in
enterprise settings, demonstrating their utility in optimizing model performance across
various tasks. Additionally, [90] highlights the benefits of using predefined structured
prompt patterns in software development, demonstrating significant improvements in
code quality, requirements elicitation, and refactoring efficiency.
3.10 Prompt optimization
In the domain of prompt engineering for LLMs, the challenge of crafting effective
prompts remains a significant barrier due to the extensive manual effort and expertise
required. Prompt optimization is a critical technique for improving the performance
of LLMs by refining the input prompts that guide their responses. The process of
prompt optimization systematically adjusts these prompts to enhance accuracy and
relevance, reducing the need for manual trial and error.
Several methods have been developed to automate prompt optimization, includ-
ing gradient-based approaches such as Prompt Optimization with Textual Gradients
(ProTeGi) [91], which uses text-based gradients to iteratively refine prompts, and
black-box methods that optimize prompts based solely on output performance without
requiring model internals. Additionally, model-adaptive techniques, such as Model-
Adaptive Prompt Optimization (MAPO) [92], tailor the optimization to the specific
characteristics of the LLM, potentially offering superior results. Each method has its
advantages: gradient-based techniques are efficient and directed, black-box approaches
are broadly applicable and easy to implement, and model-adaptive methods provide
17
customized optimization for specific models. The choice of method depends on task
requirements, model complexity, and available resources.
3.10.1 Prompt optimization with textual gradients
Prompt Optimization with Textual Gradients (ProTeGi) [91] is inspired by gradient
descent, a fundamental technique in optimization, but adapts this concept to the
discrete and non-parametric nature of natural language processing. Instead of relying
on numerical gradients, ProTeGi generates “textual gradients”, which are natural
language descriptions of the flaws in a given prompt based on its performance on
a small batch of data. These gradients indicate the semantic direction in which the
prompt needs to be improved.
ProTeGi further enhances this optimization process by applying these textual gra-
dients to modify the prompt in the opposite semantic direction, akin to a reverse
gradient descent in the language space. This iterative process is guided by a beam
search algorithm combined with a bandit selection strategy, which efficiently explores
the space of possible prompts and selects the most promising candidates for further
refinement [91].
The effectiveness of ProTeGi has been demonstrated across multiple NLP tasks,
including sentiment analysis, fake news detection, and the novel problem of LLM jail-
break detection. Experimental results indicate that ProTeGi can significantly improve
prompt performance, with reported gains of up to 31% over initial prompts, while
also surpassing existing prompt optimization methods in efficiency and accuracy. This
method provides a robust, data-driven approach to prompt engineering, offering a scal-
able solution that can adapt to various tasks without requiring access to the internal
states of LLMs [91].
3.10.2 Black-box prompt optimization
In recent prompt engineering research, the challenge of aligning LLMs with human
intent without model retraining has garnered significant attention. Traditional align-
ment techniques, such as Reinforcement Learning from Human Feedback (RLHF) and
Direct Preference Optimization (DPO), typically require substantial computational
resources and direct access to model parameters, which are not always feasible or effi-
cient, particularly with closed-source models such as GPT-4 or Claude-2. In response
to these limitations, a novel method called Black-box Prompt Optimization (BPO) [93]
has been introduced, providing a promising alternative for enhancing model alignment
through prompt optimization alone.
BPO shifts the focus from model-centric to input-centric optimization, where the
key idea is to refine the user’s prompts rather than altering the model’s internal
parameters. This approach leverages feedback from pre-existing datasets that contain
human preferences, creating pairs of original and optimized prompts. These pairs are
then used to train a sequence-to-sequence model designed to rewrite prompts in a way
that improves the alignment of LLM outputs with human expectations [93].
The BPO method offers several advantages. First, it is model-agnostic, allowing
it to be applied across various LLMs, whether open-source or API-based, without
requiring access to the model’s internals. Second, it enhances interpretability, as the
changes made to prompts are transparent and directly observable, providing clear
insights into how and why a particular prompt leads to better alignment. Third,
empirical results demonstrate that BPO not only improves the alignment of models
such as GPT-3.5 and LLaMA-2 but also outperforms RLHF and DPO when used
independently or in conjunction with these methods [93].
3.10.3 Model-adaptive prompt optimization
Traditionally, prompt optimization has focused on tailoring prompts to specific tasks
to enhance model performance. However, [92] highlighted the necessity of adapting
prompts not just to tasks but also to the specific characteristics of different LLMs.
This shift in perspective has led to the development of Model-Adaptive Prompt
18
Optimization (MAPO), a novel approach designed to fine-tune prompts for individ-
ual LLMs, thereby maximizing their effectiveness across various downstream tasks.
MAPO addresses the inherent variability in how different LLMs respond to the same
prompt by introducing a two-phase optimization process. The first phase involves
establishing a warm-up dataset, where candidate prompts are generated and evaluated
for their suitability to each LLMs. This is followed by a combination of Supervised
Fine-Tuning (SFT) and Reinforcement Learning (RL), particularly employing tech-
niques such as Proximal Policy Optimization (PPO) and Ranking Responses from
Model Feedback (RRMF). This joint learning approach refines the prompts, ensuring
they align with the specific preferences of each LLMs.
Empirical studies demonstrate that MAPO significantly improves performance in
tasks such as question-answering, classification, and text generation when compared to
conventional task-specific prompt optimization methods. By shifting the focus from a
one-size-fits-all approach to a more nuanced, model-specific strategy, MAPO enhances
the robustness and generalization of LLMs, making it a powerful tool in the prompt
engineering toolkit [92].
3.10.4 PromptAgent
The PromptAgent method suggests framing prompt optimization as a strategic plan-
ning problem. A kind core of this method is the use of Monte Carlo Tree Search
(MCTS), a principled planning algorithm that strategically navigates the vast space
of expert-level prompts. Unlike conventional methods that generate prompts through
local variations, PromptAgent employs a trial-and-error mechanism, inspired by
human problem-solving strategies. This approach allows the model to iteratively
refine prompts based on error feedback, simulating future rewards and prioritizing
high-reward paths [94]. Another core of this method, for instance, PromptSource
[95], collects over 2,000 open-source prompts for roughly 170 datasets, by dataset
exploration, prompt writing and documentation to provide an enhanced prompt.
PromptAgent’s effectiveness has been demonstrated across a diverse set of tasks,
spanning general NLP challenges and domain-specific applications such as biomedical
text processing. By autonomously generating prompts that incorporate domain-
specific knowledge and detailed task instructions, PromptAgent consistently outper-
forms both human-designed prompts and other automated optimization methods
[94], highlighting the importance of integrating strategic planning and self-reflection
capabilities into prompt optimization frameworks.
3.10.5 Reinforcement learning
Reinforcement Learning (RL) for prompt optimization is an advanced technique
designed to enhance the performance of LLMs by iteratively refining the prompts
used during training and inference. This method utilizes the principles of reinforce-
ment learning to navigate the complex parameter space of large models, optimizing
the prompts for improved task-specific performance. In RL for prompt optimization,
a reward function is defined to evaluate the effectiveness of different prompts based
on the model’s output. The model then uses this feedback to adjust and optimize the
prompts through a series of iterations, ensuring that the prompts evolve to maximize
performance on the target task by leveraging the model’s ability to learn from its
interactions with the environment [96].
Consider the task of VQA, where the goal is to generate accurate answers to
questions based on visual input. Using RL for prompt optimization, the model can
start with a set of initial prompts and iteratively refine them based on the accuracy
of the generated answers. For instance, if the model is asked, “What is the color
of the car in the image?” the initial prompts might produce varied responses. The
reward function will assess these responses, favoring prompts that lead to correct
answers. Over multiple iterations, the model learns to generate more precise prompts,
improving its ability to accurately answer similar questions in the future [97].
19
3.10.6 GPTs (plugins)
Before ending this discussion on prompt optimization techniques, we need to mention
the use of external prompt engineering assistants that have been developed recently
and exhibit promising potential. Unlike the methods introduced previously, these
instruments can help us to polish the prompt directly. They are adept at analyzing
user inputs and subsequently producing pertinent outputs within a context that is
defined by itself, thereby amplifying the efficacy of prompts. Some of the plugins pro-
vided by the OpenAI GPT store are good examples of such tools [98]. Some popular
GPT store apps that specialize in generating or optimizing prompts are shown in
Figure 15.
Fig. 15 Examples of GPT apps that specialize in generating or optimizing prompts [99].
In certain implementations, the definition of a plugin is incorporated into the
prompt, altering the output [100]. Such integration may impact the manner in which
LLMs interpret and react to the prompts, illustrating a connection between prompt
engineering and plugins. Plugins mitigate the laborious nature of intricate prompt
engineering, enabling the model to more proficiently comprehend or address user
inquiries without necessitating excessively detailed prompts. Consequently, plugins
can bolster the efficacy of prompt engineering while promoting enhanced user-centric
efficiency. These tools, akin to packages, can be seamlessly integrated into Python
and invoked directly [101,102]. For instance, the “Prompt Enhancer” pluging [103],
developed by AISEO [104], can be invoked by starting the prompt with the word
“AISEO” to let the AISEO prompt generator automatically enhance the LLM prompt
provided. Similarly, another plugin called “Prompt Perfect”, can be used by starting
the prompt with ‘perfect’ to automatically enhance the prompt, aiming for the “per-
fect” prompt for the task at hand [105,106]. Nevertheless, while the use of plugins to
enhance prompts is simple and handy, it is not always clear which prompt engineer-
ing technique, or combination of techniques, is implemented by a given plugin, given
the closed-source nature of most plugins.
20
3.11 Retrieval augmentation
Another direction of prompt engineering research is to aim to reduce hallucinations.
When using AIGC tools such as GPT-4, it is common to face a problem called “hal-
lucinations”, which refer to the presence of unreal or inaccurate information in the
model’s generated output [26,107]. While these outputs may be grammatically cor-
rect, they can be inconsistent with facts or lack real-world data support. Hallucinations
arise because the model may not have found sufficient evidence in its training data to
support its responses, or it may overly generalize certain patterns when attempting
to generate fluent and coherent output [108].
An approach to reduce hallucinations and enhance the effectiveness of prompts is
the so-called retrieval augmentation technique, which aims at incorporating up-to-date
external knowledge into the model’s input [109,110]. It is emerging as an AI frame-
work for retrieving facts from external sources. [111] examines the augmentation of
context retrieval through the incorporation of external information. It proposes a
sophisticated operation: the direct concatenation of pertinent information obtained
from an external source to the prompt, which is subsequently treated as founda-
tional knowledge for input into the expansive language model. Additionally, the
paper introduces auto-regressive techniques for both retrieval and decoding, facilitat-
ing a more nuanced approach to information retrieval and fusion. [111] demonstrates
that in-context retrieval-augmented language models, when constructed upon read-
ily available general-purpose retrievers, yield significant LLM enhancements across a
variety of model dimensions and diverse corpora. In another research, [112] showed
that GPT-3 can reduce hallucinations by studying various implementations of the
retrieval augmentation concept, such as Retrieval Augmented Generation (RAG)
[113], Fusion-in-Decoder (FiD) [114], Seq2seq [115117] and others. [118] developed
the Chain-of-Verification (CoVe) approach to reduce hallucinations, based on letting
the LLM deliberate on its own responses before self-correcting them. They suspect
that extending this approach with retrieval augmentation would likely bring further
gains. UNIWEB [119] converting knowledge-intensive tasks into a unified text-to-text
framework and treating the web as a general source of knowledge.
3.12 Reasoning and active interaction
This subsection explores two advanced techniques that enhance the capabilities of
LLMs by integrating reasoning with interaction through external tools or other action
abilities. Automatic Reasoning and Tool Usage (ART) combines CoT prompting
with the use of specialized tools. By guiding LLMs through multi-step reasoning and
incorporating resources such as calculators and databases, ART improves the log-
ical coherence and accuracy of model outputs. The ReAct Framework (Reasoning
and Acting) synergizes reasoning with actionable steps. It prompts LLMs to devise
logical sequences and interact dynamically with external tools, enabling them to han-
dle complex, multi-step tasks efficiently. Both ART and ReAct represent significant
advancements in prompt engineering, enhancing the range and reliability of tasks that
LLMs can perform through the integration of reasoning and interaction.
3.12.1 Automatic reasoning and tool usage
ART is an advanced prompting technique that combines the principles of automatic
CoT prompting with the strategic utilization of external tools. This method aims
to enhance the reasoning capabilities of LLMs by guiding them through multi-step
reasoning processes and leveraging specialized tools to achieve more accurate and
relevant outputs [120].
ART builds on the CoT prompting technique, which encourages models to generate
intermediate reasoning steps before arriving at a final answer. In ART, these reasoning
steps are augmented by incorporating external tools such as calculators, databases,
or other software applications. The integration of tools helps LLMs to perform tasks
that require precise calculations, access to updated information, or specialized data
processing that the model alone may not handle effectively.
21
For example, a prompt designed using ART might guide an LLM to first outline the
steps required to solve a complex mathematical problem and then use a calculator tool
to perform the necessary calculations. This combination of reasoning and tool usage
ensures that the model’s outputs are both logically coherent and computationally
accurate.
[121] have demonstrated that ART can help models navigate complex problem
spaces more effectively by breaking down tasks into manageable steps and utiliz-
ing appropriate tools at each stage. For instance, the integration of ART in natural
language processing tasks has shown promising results in areas such as automated
customer service, where models need to access and process information dynamically
[122].
Moreover, ART’s approach aligns with ongoing efforts to develop more robust and
versatile AI systems capable of handling real-world tasks that demand a combina-
tion of cognitive and computational skills. [123] explores advanced ART techniques to
achieve better accuracy and reliability in AI applications. These findings underscore
the importance of ART in enhancing the functionality and performance of LLMs,
making them more adept at handling a broader range of tasks, in particular techni-
cal problem-solving tasks that require specific and precise outputs such as financial
calculations or data analysis.
3.12.2 ReAct framework
The ReAct Framework, which stands for Reasoning and Acting, synergizes the pro-
cesses of reasoning and action to enable LLMs to not only think through problems
but also interact with external tools and environments to achieve more accurate and
contextually appropriate outcomes.
The ReAct Framework operates by prompting LLMs to generate both reason-
ing traces and task-specific actions. This dual approach ensures that the model first
contemplates the problem, devises a logical sequence of thoughts, and then executes
actions that may involve querying external databases, using calculators, or interacting
with other software tools. This method is particularly effective in scenarios requir-
ing detailed reasoning followed by specific actions, thus ensuring the LLM can handle
complex, multi-step tasks efficiently [124].
For example, in a task involving financial analysis, the ReAct framework would
first prompt the LLM to outline the necessary steps to evaluate a portfolio. Subse-
quently, the model could use financial analysis tools to gather current market data
and perform calculations, integrating these results into the final analysis. This combi-
nation of reasoning and action leads to more robust and reliable outcomes compared
to using static prompts alone. Another concrete example is shown in Figure 16.
By integrating reasoning and action, ReAct enables LLMs to make more informed
and accurate decisions, in particular in fields such as finance, healthcare, and legal
analysis, where decisions must be based on comprehensive data and logical reasoning
[125]. Additionally, the framework’s ability to interact with external tools ensures
that the information used in decision-making is up-to-date and relevant, reducing the
likelihood of errors due to outdated or incomplete data [126].
Implementing the ReAct framework is not a trivial task, as it involves developing
prompts that guide LLMs through both thought processes and actions. This requires
a detailed understanding of the task at hand and the tools available, ensuring that
the model can seamlessly transition from reasoning to action.
4 Methodologies for multimodal large models
In recent years, vision-language models (VLMs) have made significant advancements
in multimodal learning by combining visual and linguistic information. These mod-
els have demonstrated strong capabilities in tasks such as image description and
visual question answering (VQA) [127130]. While this review primarily focuses on
the potential of prompt engineering in LLMs, it is also pertinent to briefly introduce
22
Problem to solve: You have lost a key in your house and need to find it.
Detailed Description: You just came back from outside and placed your bag on the entryway table. You remember the last time you used the
key was to unlock the door. The room has an entryway table, a kitchen, a living room, a bedroom, and a bathroom. You need to find the key.
Answer: It might be in the kitchen
Result:
Reasoning: The key might be in places I frequently go to. First, check the entryway, then
check the kitchen and living room. If not found, then check the bedroom and bathroom.
Answer: Entryway, kitchen, living room, bedroom, bathroom
Result:
Thought 1: I need to check the places where I last used the key. The most likely places are the
entryway and the kitchen.
Act 1: Check the entryway
Obs 1: Did not find the key.
Thought 2: The key is not in the entryway, next step is to check the kitchen.
Act 2: Check the kitchen
Obs 2: Found the key on the kitchen table.
Result:
Standard Method CoT, Reason Only
ReAct Method (Reason + Act)
Act 1: Check the entryway
Obs 1: Did not find the key.
Act 2: Check the kitchen
Obs 2: Found the key on the
kitchen table.
Result:
Act-Only Method
Analysis: Low repeatability, relies on luck.
Analysis: Althought correct result, but blind
actions, relies on luck to find the key.
Analysis: Lacks actual actions, equivalent to having no "retrieval-augmented" capability, limited to internal information.
Analysis: Combines reasoning and action, maximizes the probability and efficiency of finding the key.
Model Input
Model Output
Fig. 16 An example of ReAct method.
the importance of VLMs and their applications in multimodal tasks to provide a more
comprehensive perspective.
VLMs are based on the transformer architecture, and are trained on extensive
datasets to learn complex semantic relationships. However, unlike early unimodal
models, VLMs process both textual and visual information, enabling them to establish
connections between image understanding and text generation. As can be expected,
this multimodal integration makes VLMs particularly effective at handling complex
tasks that involve both images and text.
To seamlessly integrate and interpret these diverse data types, VLMs require
sophisticated prompt designs that ensure contextual coherence and accuracy [131,
132]. Challenges such as data alignment, modality integration, and context preserva-
tion are addressed through advanced techniques such as Context Optimization (CoOp,
subsection 4.3) and Multimodal Prompt Learning (MaPLe, subsection 4.5). These
advanced prompt engineering techniques enhance the ability of VLMs to generate
nuanced and contextually rich outputs, thereby facilitating their effective utilization
in various applications and enabling them to tackle more complex tasks [131].
4.1 Zero-shot and few-shot prompting
Zero-shot and few-shot prompting, which have already been discussed in subsection
2.7 in the context of LLMs, are also pivotal techniques in the realm of VLMs, enabling
these models to handle tasks with minimal or no task-specific training data. Zero-shot
prompting allows models to perform tasks without any specific examples provided
during training, relying entirely on their pre-trained knowledge to generalize across
new tasks and domains. For example, a model such as CLIP can be prompted with
a textual description to classify images into categories it has never explicitly been
trained on [3]. On the other hand, few-shot prompting involves providing the model
with a small number of examples during inference, significantly enhancing the model’s
ability to generalize with limited data [16].
In relation to these methods, [97] systematically explored a range of prompt-
ing techniques for zero-shot and few-shot visual question answering (VQA) in
vision-language models (VLMs), highlighting the impact of question templates, the
integration of image captions, and the application of chain-of-thought reasoning
on model performance. [16] showed the application of these techniques in CLIP,
23
highlighting the model’s ability to generalize across different domains. Additionally,
[133] presented a method for adapting CLIP to few-shot classification tasks without
additional training, emphasizing practical benefits in real-world applications.
4.2 Continuous prompt vectors
Advancements in prompt engineering have enabled more effective adaptation of pre-
trained VLMs to a wide range of downstream tasks. A promising approach in this
domain is the use of continuous prompt vectors to fine-tune models such as CLIP for
complex video understanding tasks. Unlike traditional handcrafted prompts, which
require expert knowledge and manual effort, continuous prompt vectors [134] are
learned during the training process, allowing for more flexible and efficient model
adaptation. This method involves appending or prepending sequences of random
vectors to the input text, which the model then interprets as part of its textual
input. These vectors are optimized to effectively bridge the gap between the static
image-based pre-training objectives and the dynamic requirements of video tasks,
such as action recognition, action localization, and text-video retrieval. Additionally,
lightweight temporal modeling using Transformers is applied to capture the temporal
dependencies inherent in video data.
The efficiency of this approach lies in its minimal computational requirements;
only a few parameters are trained, while the core model remains frozen. Despite this,
the method has demonstrated competitive performance across various benchmarks,
highlighting its potential in extending the capabilities of VLMs to handle resource-
intensive video tasks with greater flexibility and accuracy. This continuous prompt-
based adaptation represents a significant step forward in the evolution of prompt
engineering, offering a scalable and effective solution for leveraging pre-trained models
in more complex and diverse applications [134].
4.3 Context optimization
Context Optimization (CoOp) [135] is an innovative prompt learning approach
specifically designed for VLMs. CoOp focuses on enhancing the adaptability and per-
formance of these models by optimizing context-specific prompts. This methodology
involves the creation of learnable context vectors that are embedded within the model’s
architecture, enabling it to dynamically adjust to different downstream tasks.
CoOp leverages the dual-stream architecture of VLMs, such as CLIP [16] and
ALIGN [136], by performing context optimization on top of these pre-trained models.
CoOp introduces learnable context vectors that are fine-tuned to minimize classifica-
tion loss, thus avoiding extensive manual prompt engineering. By utilizing learnable
context vectors, CoOp fine-tunes the prompts to align with the specific characteristics
of the complex input data. This results in improved performance and better generaliza-
tion across various scenarios [137]. This method is particularly valuable in applications
such as image recognition and VQA, where the context can vary significantly [138].
To illustrate the practical application of CoOp, consider a VQA task [127130]. In
a VQA scenario, the model is presented with an image and a corresponding question,
and it must generate an accurate answer based on the visual and textual information.
By leveraging CoOp, the model utilizes learnable context vectors to optimize the
prompts specific to the context of the input image and question. This process enhances
the model’s ability to interpret the visual elements and comprehend the textual query,
leading to more precise and contextually relevant answers. For instance, if the model
is shown an image of a beach scene with the question “What activity are the people
engaged in?”, CoOp would utilize learnable context vectors to optimize the textual
prompts. These context vectors help the text encoder generate features that focus
on relevant aspects of the image, such as identifying people, recognizing activities,
and understanding the overall context of the scene. By aligning these optimized text
features with the image features extracted by the image encoder, CoOp enables the
model to generate a precise and contextually relevant answer, such as “The people
are playing volleyball on the beach.
24
Regarding CoOp’s effectiveness, [135] showed that models using CoOp signifi-
cantly outperform traditional models in tasks such as image recognition and VQA.
Additionally, [138] highlighted the benefits of ensembling context optimization, which
further enhances the model’s performance by combining multiple context vectors. This
approach has been shown to improve the robustness and generalization of VLMs in
real-world applications [139].
4.4 Conditional prompt learning
Conditional Context Optimization (CoCoOp) [140] is a methodology that dynami-
cally tailors prompts based on specific conditions or contexts. Specifically, CoCoOp
employs a lightweight neural network to generate input-conditional prompt vectors
for each image, ensuring that the pre-trained model parameters remain unchanged.
By leveraging contextual information, CoCoOp can provide more precise and relevant
guidance to the model, which is particularly useful in complex, multimodal scenarios
where the interplay between different types of data must be carefully managed.
One significant advantage of CoCoOp is its ability to adapt to new and unseen
data without the need for fine-tuning the pre-trained model, thanks to the context-
specific prompts generated by the lightweight neural network. In other words, a VLM
enhanced with conditional prompts can more accurately interpret and respond to
images and questions it has not encountered during training [140,141]. This capability
is critical for applications such as image captioning, VQA, and scene understanding,
where the context can vary widely.
Consider an image captioning task where the goal is to generate descriptive
captions for images. Using CoCoOp, the model enhances its performance with dynam-
ically generated prompts tailored for different types of scenes. Specifically, CoCoOp
extends the CoOp method by training a lightweight neural network to generate input-
conditional tokens for each image. As detailed in [140], this allows the model to adapt
to various contexts without extensive retraining, resulting in more accurate and con-
textually relevant captions. For example, a prompt for an outdoor scene might include
contextual cues related to nature, weather, and activities, while a prompt for an indoor
scene might focus on objects, people, and interactions. For an image of a bustling
market, the conditional prompt could include cues such as “Identify the types of
products being sold” or “Describe the interactions between vendors and customers”.
This enables the model to produce a caption such as “Vendors selling fresh fruits and
vegetables in a crowded market, with customers browsing and purchasing items” [140].
This dynamic adaptation improves caption accuracy and enhances the model’s
ability to generalize to novel scenes, addressing the limitations of static prompt
methods such as CoOp. Besides image captioning, the improved generalization capa-
bilities of this technique make the model more robust in tasks such as VQA, image
classification, and other real-world applications [142].
4.5 Multimodal prompt learning
The core idea of Multimodal Prompt Learning (MaPLe) is to introduce and optimize
prompts for both the vision and language components simultaneously. By embedding
prompts at various stages within the transformer architecture, MaPLe ensures that
the model can adaptively learn contextual information pertinent to the specific task
at hand [143]. This hierarchical approach allows the model to progressively refine its
understanding and integration of multimodal inputs, leading to improved performance
across a range of applications.
One of the critical innovations of MaPLe is its ability to enhance task relevance.
Traditional prompt engineering often focuses on either vision or language prompts
in isolation, which can limit the model’s ability to fully leverage the complementary
information available in multimodal data. MaPLe overcomes this limitation by jointly
optimizing prompts for both modalities, thereby facilitating a more integrated and
coherent representation of the input data [141,143]. A detailed comparison between
the MaPLe and traditional method (CoOp and CoCoOp) is shown in Figure 17.
25
CoOp
context optimization
CoCoOp
Conditional Context Optimization
MaPLe
Multi-modal Prompt Learning
Sep 2021
Mar 2022
Oct 2022
None
None
Yes
(Coupling between Vision and Language
prompts)
Limited to seen classes and specific tasks
Improved adaptability to unseen classes and
various tasks
High adaptability across various tasks and
unseen classes
No
No
No
Poor
Improved
Excellent
Less effective, as it focuses on language
prompts
Effective, as it integrates both image and
language prompts
Highly effective, as it integrates both vision
and language prompts
Moderate
Increased, due to dynamic prompt
generation
High, due to multi-modal and multi-level
prompt learning
Simplifies prompt engineering; performs
well on seen classes
Dynamic prompts enhance generalization to
unseen classes; performs well across tasks
and datasets
Multi-modal prompt learning and coupling
enhance model collaboration and
generalization
Static prompts perform poorly on unseen
classes, limited generalization; less adaptive
to different tasks and datasets.
Increased computational complexity,
potentially requiring more computational
resources.
More complex implementation, may
require more computational resources and
training time.
Method
Characteristic
Fig. 17 Comparison between the MaPLe and traditional method (CoOp and CoCoOp).
Another important mechanism of MaPLe, the hierarchical learning mechanism,
allows the model to process and integrate information at multiple levels of abstraction.
This is particularly beneficial for complex tasks that require a deep understanding of
both visual and textual elements. By optimizing prompts at different layers within
the transformer, MaPLe can better capture the intricate dependencies between vision
and language inputs [143,144].
[143] showed that MaPLe significantly outperforms baseline models in tasks such as
image recognition and VQA. Similarly, [141] highlighted the importance of Multimodal
prompt learning in enhancing the adaptability and generalization of VLMs.
To illustrate the practical application of MaPLe, consider the task of VQA [127
130]. In a typical VQA scenario, a model is provided with an image and a related
question, and it must generate a correct and contextually relevant answer. Using
MaPLe, the model can be fine-tuned with multimodal prompts that simultaneously
address both the visual content and the textual question. For instance, given an
image of a bustling market and the question “What fruit is the vendor selling?”,
MaPLe would embed prompts at various levels of the transformer’s vision and lan-
guage branches. These prompts might include visual prompts that focus on identifying
objects and text prompts that guide the model to look for specific answer-relevant
details. By processing these prompts hierarchically, the model can effectively inte-
grate visual cues (like recognizing apples and oranges in the image) with the textual
context (understanding the question) to generate an accurate answer (e.g., “The ven-
dor is selling apples and oranges”). This multimodal approach ensures that the model
leverages both the visual and textual information in a coherent and integrated man-
ner, resulting in improved performance on VQA tasks compared to models that do
not utilize such comprehensive prompt learning strategies.
5 Assessing the efficacy of prompt methods
There exist several ways to evaluate the quality of the output of an LLM. Evaluation
methods can generally be divided into subjective and objective categories to assess
the efficacy of current prompt methods in AIGC tools
5.1 Subjective and objective evaluations
The task of prompt engineering can be challenging because it is difficult to determine
how a prompt is more effective solely based on its raw text form [145]. Therefore,
evaluating prompts requires a combination of subjective and objective methods. Sub-
jective evaluations primarily rely on human evaluators to assess the quality of the
26
generated content. Objective evaluations, also known as automatic evaluation meth-
ods, use algorithms to score the quality of text generated by LLMs or test on various
benchmarks to quantitatively measure the efficacy of prompt methods.
Subjective evaluation and objective evaluation methods each have their advantages
and disadvantages. Subjective evaluation is more in line with human intuition, but it
is also more expensive and time-consuming [146]. Objective evaluation is less expen-
sive and quicker than subjective evaluation. For instance, despite numerous pieces of
research highlighting the limited correlation between BLEU and alternative metrics
based on human assessments, their popularity has remained unaltered [147,148]. The
best way to evaluate the quality of LLM output depends on the specific application
[149].
5.1.1 Subjective evaluations
Subjective evaluations depend on human evaluators to judge the quality of the gen-
erated content. Human evaluators can read the text generated by LLMs and score
it for quality. Subjective evaluations typically include aspects such as fluency, accu-
racy, novelty, and relevance [34]. [150] builds a human evaluation for their “Chain of
Density” (CoD) method based on “good summary” standard [151]. The four writers
of the paper scored 100 summaries which include randomly shuffled CoD summaries
to evaluate the performance. [77] using human judgments to compare outputs from
other methods and “tree-of-thought” by asking the model to finish creative writing.
They averaged the score for each output and found that the score from human judg-
ment was consistent, which means that the results from human judges are credible.
[152] invites 3 human annotators to create a set to explore the alignment between
human and automatic evaluation. [146] assesses the quality with three human judges
who indicated whether the generated norms and moral actions were relevant to the
given moral story. Above these, subjective evaluations are increasingly used to assess
content generated by models in areas that are difficult to represent with datasets and
are more abstract, such as writing and summary.
5.1.2 Objective evaluations
Objective evaluations, also known as automatic evaluation methods, use algorithms
to assess the quality of content generated by LLMs or to conduct tests on various
benchmarks, quantitatively measuring the effectiveness of different prompt meth-
ods. Human-AI Language-based Interaction Evaluation (HALIE) [153], components
of human-LM interactive systems and evaluation metrics, putting interaction at the
center of LM evaluation. One kind of objective evaluation employs automated met-
rics, such as BiLingual Evaluation Understudy (BLEU) [154], which assigns a score
to system-generated outputs, offering a convenient and rapid way to compare various
systems and monitor their advancements. Other evaluations such as Recall-Oriented
Understudy for Gisting Evaluation (ROUGE) [155], and Metric for Evaluation of
Translation with Explicit ORdering (METEOR) [156], assess the similarity between
the generated text and reference text. More recent evaluation methods, such as
BERTScore [157], aim to assess at a higher semantic level.
However, these automated metrics often fail to capture the assessment results
of human evaluators fully and therefore must be used with caution [158]. So many
researchers evaluate their methods by quantitating the performance of the model
under specific tasks. Some of the tasks are traditional games, such as Game of 24
and 5x5 Crosswords [77]. The other tasks, in other words, called benchmarks, are
datasets that contain instructions for models to finish. Exclude the comprehensive set
of benchmarks such as Beyond the Imitation Game benchmark (BIGbench) [159] and
Big-Bench Hard (BBH) [160], which evaluates the logical soundness of arguments,
there are four kinds of benchmarks concluded below. These benchmarks provide stan-
dardized tasks and datasets that facilitate consistent and comparable assessments
of different approaches. For testing prompt engineering methods, it is not to pursue
the ”best” benchmark but to choose the one that is most suitable for evaluating the
27
model’s abilities, because not a single model can perform best in all kinds of tasks
[161].
Math Word Problems (MWPs)
Objective evaluations about MWP test a model’s ability to understand numerical-
related questions. The task is challenging because the model needs to understand
relevant information from natural language text as well as perform mathematical rea-
soning to solve it. The complexity of MWPs can be measured along multiple axes, e.g.,
reasoning and linguistic complexity and world and domain knowledge. Similar to ear-
lier benchmark MATH23K [162] and Hybrid Math Word Problems dataset (HMWP)
[163], simple Variations on Arithmetic Math word Problems (SVAMP) [76] is a kind
of MWP benchmark to solve elementary-level math word problems, which evaluates
the performance of models by asking them to give equations and answers based on
the questions in elementary school. Dolphin1878 [164] is a kind of number-word prob-
lem over 1,500 number-word problems. ARIS [165] and AllArith [166] are arithmetic
word problems and MAth Word ProblemS (MAWPs) [167] present algebraic word
problems to test problem-solving skills. Different from these benchmarks contain one
category of field, Academia Sinica Diverse MWP Dataset (ASDiv) [168], Algebra
Question Answering (AQuA) [169] and MathQA [170] including more domains than
others, such as arithmetic, algebraic and domain knowledge problems. SingleEQ [171]
is construed with both single-step and multi-step math problems from mixed sources.
MultiArith [172] includes elementary math problems with multiple steps. MATH [173]
and GSM8K [60] require models to solve complex mathematical problems, empha-
sizing the need for a deep understanding of mathematical concepts and reasoning.
Process-supervised Reward Models (PRM) 800K [174] includes 4.5K MATH test
problems, and contains about 800,000 step-level labels over 75,000 solutions.
Question Answering (QA) Tasks
QA tasks require models to return feedback due to the given question. Massive Multi-
task Language Understanding (MMLU) [175] is a QA benchmark designed to measure
knowledge acquired during pretraining by evaluating models exclusively in zero-shot
and few-shot settings. Many QA benchmarks are also related to knowledge-based
tasks. Fact Extraction and VERification (FEVER) [176] focuses on fact verification,
requiring models to act for claims generated by altering sentences extracted from
Wikipedia. MIDTERMQA [177] focuses on the 2022 U.S. midterm elections since the
knowledge cutoff of black-box LLMs is often 2021 or earlier. These benchmarks play a
critical role in assessing the models’ abilities to comprehend, analyze, and synthesize
information from diverse sources. NarrativeQA [178] built by materials such as movies
and books, with nearly 63k tokens of input in each question. The Question Answering
with Long Input Text, Yes (QuALITY) [179] is a multiple-choice QA dataset contain-
ing 2k–8k tokens from English source articles. CommonsenseQA [180,181] focuses on
commonsense question answering based on ConceptNet 5.5 [182], an open multilin-
gual graph of general knowledge. HotPotQA [183] is collected by crowdsourcing such
as Wikipedia articles and AI2 Reasoning Challenge (ARC) [184] includes 14M science
sentences, 787 science questions, all non-diagram, and multiple choices. GovReport
[185] dataset focuses on summarizing complex government reports, testing the mod-
els’ ability to distill and synthesize critical information. QA benchmarks challenge
models’ reasoning and use of commonsense knowledge ability.
Language Understanding Tasks
In early efforts for language understanding and inductive tasks, Text REtrieval Confer-
ence (TREC) [186] focuses on the problem of retrieving answers rather than document
lists. Stanford Sentiment Treebank (SST) [187] is constructed with fully labeled parse
trees, enabling a comprehensive analysis of the compositional effects of sentiment in
language and named SST-2 & SST-5 based on its number of labels. Summarization
tasks, as tested by datasets like SummScreenFD [188] measure the effectiveness of the
28