Content uploaded by Edward Y Chang
Author content
All content in this area was uploaded by Edward Y Chang on Feb 27, 2025
Content may be subject to copyright.
Content uploaded by Edward Y Chang
Author content
All content in this area was uploaded by Edward Y Chang on Feb 24, 2025
Content may be subject to copyright.
arXiv:2502.18836v1 [cs.AI] 26 Feb 2025
REALM-Bench: A Real-World Planning Benchmark
for LLMs and Multi-Agent Systems
Longling Geng
gll2027@stanford.edu
Stanford University
Stanford, CA, USA
Edward Y. Chang
echang@cs.stanford.edu
Stanford University
Stanford, CA, USA
Abstract
This benchmark suite provides a comprehensive evaluation frame-
work for assessing both individual LLMs and multi-agent systems
in real-world planning scenarios. The suite encompasses eleven de-
signed problems that progress from basic to highly complex, incor-
porating key aspects such as multi-agent coordination, inter-agent
dependencies, and dynamic environmental disruptions. Each prob-
lem can be scaled along three dimensions: the number of parallel
planning threads, the complexity of inter-dependencies, and the
frequency of unexpected disruptions requiring real-time adapta-
tion. The benchmark includes detailed specifications, evaluation
metrics, and baseline implementations using contemporary frame-
works like LangGraph, enabling rigorous testing of both single-
agent and multi-agent planning capabilities. Through standardized
evaluation criteria and scalable complexity, this benchmark aims
to drive progress in developing more robust and adaptable AI plan-
ning systems for real-world applications.
Keywords
LLMs, Multi-Agent Systems, Planning Benchmark
1 Introduction
As large language models (LLMs) continue to advance in reason-
ing and planning, as demonstrated by OpenAI’s GPT-4o-Task [19],
DeepSeek’s R1 [9], Anthropic’s Claude 3.5 Sonnet [1], and Gem-
ini [24], the research community is increasingly focusing on devel-
oping multi-agent systems (MAS) powered by these models. Re-
cent innovations include AutoGen [25], CAMEL [16], CrewAI [18],
LangGraph [15], Dspy [14], and XAgent [26], among others. Al-
though individual LLMs demonstrate significant capabilities, their
true potential is realized when they collaborate to tackle complex
real-world problems [5].
Most AI benchmarks emphasize perception, language understand-
ing, or basic reasoning. However, real-world challenges, such as
supply chain management, disaster response, healthcare logistics,
and investment strategies, demand co ordinated planning and decision-
making among specialized agents. There is a pressing need for ro-
bust benchmarks that can evaluate the performance of both single-
agent systems and MAS in these complex, high-stakes domains.
1.1 The REALM Benchmark Suite
REALM-Bench (Real-world Planning Benchmark for LLMs and Multi-
Agent Systems) addresses the need for rigorous evaluation with
carefully curated planning challenges. These scenarios are designed
to be both tractable, enabling human validation and debugging,
and sufficiently complex to push the boundaries of current AI sys-
tems. Each challenge requires reasoning and validation over se-
quential actions parallel processes, resource constraints, and un-
expected disruptions [3, 4, 6, 7, 13].
The suite consists of eleven scenarios that progressively increase
in complexity across three key dimensions:
1. Parallel Planning Threads: The number of concurrent plan-
ning processes that must be coordinated.
2. Inter-Dependencies: The complexity of relationships and con-
straints between these planning threads.
3. Disruption Frequency and Impact: The rate and severity of
unexpected events that require the adaptation of the plan.
Next, we describe how each scenario can be scaled along these
three dimensions.
1.2 Benchmark Scalability
While the base versions of each scenario enable detailed analysis
and debugging, they can be scaled along the three dimensions de-
fined above: parallel planning threads, inter-dependencies, and dis-
ruption frequency and impact.
For example, an urban ride-sharing scenario becomes increas-
ingly complex as the number of vehicles and passengers grows,
with interdependent carpooling routes and frequent traffic disrup-
tions necessitating real-time plan adjustments.
This scalability allows AI planning systems to be evaluated un-
der progressively challenging conditions, while still allowing de-
tailed analysis of failure modes in simpler scenarios.
1.3 Availability and Access
The REALM-Bench Suite V1.0 is available on GitHub [11]. In ad-
dition, we plan to host competitions and workshops at major AI
conferences in 2025 to foster community engagement and further
development.
2 Related Benchmark
Planning benchmarks have evolved from testing basic STRIPS-style
planning to evaluating increasingly sophisticated planning capa-
bilities. The International Planning Competition (IPC) has been a
primary driver of planning benchmarks since 1998, using PDDL
to specify domains like BlocksWorld, Logistics, and Rovers [23].
While valuable for testing classical planning algorithms, these bench-
marks focus on deterministic environments with complete infor-
mation and lack the dynamic disruptions common in real-world
scenarios.
More recent benchmarks, such as the Process Planning Compe-
tition (PPC), have shifted toward continuous processes and tempo-
ral constraints [22]. Their manufacturing scenarios include parallel
1
ACM, February, 2025 Longling Geng and Edward Y. Chang
activities and resource dependencies, but the disruptions remain
limited to machine breakdowns with known repair distributions.
Similarly, the Dynamic Planning Competition introduces environ-
mental changes during plan execution, yet it focuses primarily on
path planning and navigation scenarios [10].
The annual Automated Negotiation Agents Competition (ANAC),
established in 2010, has evolved to incorporate planning elements
within its supply chain scenarios [17]. However, its scope remains
primarily focused on bilateral negotiations rather than comprehen-
sive planning under uncertainty. For example, the 2024–25 compe-
tition featured a main challenge titled “Split the Pie,” an artificial
yet simplified negotiation scenario where agents divide resources
between parties. The supply chain problems in ANAC do not in-
volve contingency planning, resource reallocation, or adaptation
to unexpected disruptions.
Specifically for testing LLMs’ planning capabilities, TimeBench
[8] and TaskBench [21] represent two approaches to evaluating
AI planning. TimeBench focuses on temporal reasoning by testing
systems’ ability to understand time dependencies and scheduling
constraints, though it often relies on synthetic scenarios that fail to
capture the dynamic nature of real-world temporal relationships,
where deadlines shift and durations remain uncertain. TaskBench,
on the other hand, evaluates practical task automation and step-
by-step planning; it provides valuable insights into an AI system’s
ability to decompose complex goals into manageable steps, but its
scenarios may oversimplify the challenges of real-world automa-
tion, where outcomes are uncertain and processes are deeply in-
terconnected.
This landscape reveals several gaps in existing benchmarks:
1. Limited Disruption Modeling: Most benchmarks treat un-
certainties as static probability distributions rather than dy-
namic, interdependent events that can cascade through sys-
tems.
2. Simplified Dependencies: Real-world planning problems in-
volve rich networks of temporal, resource, and causal depen-
dencies that exceed the complexity found in current bench-
marks.
3. Restricted Scope: Benchmarks tend to focus on specific sub-
problems (path planning, task allocation, etc.) rather than end-
to-end planning scenarios that combine multiple challenges.
4. Artificial Constraints: Many benchmarks use simplified rep-
resentations (like PDDL) that cannot capture the nuanced con-
straints and objectives found in real-world planning problems.
5. Limited Scalability: Few benchmarks allow systematic scal-
ing of complexity along multiple dimensions while maintain-
ing problem tractability for analysis.
6. LLM Specific Challenges: Although LLMs have achieved re-
markable successes, the transformer architecture exhibits cer-
tain limitations. For example, an attention sink phenomenon
can cause certain tokens to be neglected, potentially skewing
model predictions [27]. Additionally, maximum likelihood train-
ing can introduce biases that limit output diversity and quality
[4, 12]. Finally, chain-of-thought approaches may suffer from
pitfalls such as error propagation and inconsistent reasoning [2,
20]. A test suite should specifically examine these LLM-related
issues.
Our proposed REALM-Bench addresses these limitations by pro-
viding scenarios that combine dependencies, dynamic disruptions,
and scalability while remaining tractable for systematic evaluation.
This allows testing of planning systems under conditions that bet-
ter reflect the challenges of real-world applications.
3 Benchmark Structure
Problems are categorized into three difficulty levels based on the
number of parallel execution threads, the complexity of dependen-
cies, and real-time disruptions.
3.1 Entry Level (1-2 threads)
Problems focusing on basic coordination with limited dependen-
cies:
- Single or dual thread execution
- Basic timing and resource constraints
- Simple disruption scenarios
- Example: Campus tour coordination with one and two groups
3.2 Intermediate (3-4 threads)
Problems requiring significant coordination across multiple execu-
tion paths:
- Three to four parallel threads
- Complex timing dependencies
- Resource sharing constraints
- Example: Wedding logistics with multiple vehicles and tasks
3.3 Advanced (5+ threads)
Problems with real-world complexity:
- Five or more parallel threads
- Complex inter-dependencies
- Multiple resource conflicts
- Dynamic disruption scenarios
- Example: Thanksgiving dinner coordination, natural disaster re-
lief, and supply chain management
3.4 Evaluation Metrics
Each problem is evaluated across five key dimensions:
-Planning Quality: Effectiveness of initial plan generation
-Coordination: Management of parallel thread execution
-Adaptation: Response to disruptions and changes
-Resource Management: Resolution of resource conflicts
-Constraint Satisfaction: Maintenance of problem constraints
4 Benchmark Problem Specifications
REALM-Bench comprises eleven foundational problem frameworks
that systematically evaluate both sequential and reactive planning.
Building on the key dimensions introduce d earlier (parallel threads,
inter-dependencies, and dynamic disruptions), these frameworks
progress from straightforward single-thread execution to complex
multi-agent scenarios with real-time challenges.
Considerations #1 Problem Complexity: Each framework can
be further scaled to create more challenging variants:
* Expanding the scale of agents and resources (e.g., from dozens
to thousands)
2
REALM-Bench: A Real-World Planning Benchmark
for LLMs and Multi-Agent Systems ACM, February, 2025
Table 1: Single Tour Campus Navigation Problem
Metrics:
-Total tour time: Minimize while meeting all constraints
-Visit coverage: All locations must be visited
Locations: Five campus buildings 𝐿={𝑆 , 𝐿, 𝐵, 𝐴, 𝐷 }
-𝑆: Student Center
-𝐿: Library
-𝐵: Lab Building
-𝐴: Athletics Center
-𝐷: Dormitory
Travel Times: (minutes) 𝑆-𝐿: 10, 𝑆 -𝐵: 15, 𝑆-𝐴: 20, 𝑆 -𝐷: 15 𝐿-𝐵:
10, 𝐿-𝐴: 25, 𝐿-𝐷: 20 𝐵-𝐴: 15, 𝐵-𝐷: 25 𝐴-𝐷: 20
Visit Requirements:
- Start time given
- Each location requires 30-minute visit
- Tour starts/ends at Student Center (𝑆)
- Group size: 20 people
Time Constraints:
- Lab Building: Only 9 AM - 4 PM
- Library: After 10 AM
- Total tour must complete by 5 PM
Table 2: Multi-Group Campus Tour Problem
Groups:
-𝐺1: 15 people (domestic students)
-𝐺2: 20 people (international students)
Locations: Ten campus buildings 𝐿={𝑆, 𝐿, 𝐵, 𝐴, 𝐷 ,𝐶 , 𝑀, 𝑅, 𝐻 , 𝑃 }
with capacities 𝑐𝑎𝑝𝑙:
𝑐𝑎𝑝𝑙=
40 𝑙∈ {𝑆, 𝐴 }
30 𝑙∈ {𝐿, 𝐷 , 𝐶 }
25 𝑙∈ {𝐵, 𝑀 }
20 𝑙∈ {𝑅, 𝐻 , 𝑃 }
Constraints:
- Tour start time for each group: between 9 AM and 10 AM
- Total visitors must not exceed location capacity
- Each location requires 30-minute visit
- Both tours start/end at Student Center
- Complete all tours by 5 PM
Additional Requirements:
- Lab tours (𝐵) only 9 AM - 4 PM
- Dining (𝐶) must be visited between 11 AM - 2 PM
- Library (𝐿) after 10 AM
* Widening geographic distribution (e.g., local to global)
* Increasing disruption frequency and severity (e.g., isolated events
to cascading failures)
* Introducing uncertainties in execution times and outcomes (e.g.,
probabilistic durations)
* Adding hierarchical dependency networks (e.g., sub-networks
with internal dependencies)
* Accounting for agent properties (e.g., atomicity, idempotency)
This extensible design enables researchers to incrementally assess
their systems’ capabilities while preserving each scenario’s funda-
mental planning challenge.
Considerations #2 Implementation: Three main approaches
can be used to tackle these problems:
Table 3: Urban Ride Sharing Problem
Metrics:
-On-time performance: No penalty for early arrivals.
-Total distance traveled.
Locations: Seven locations: 𝑉={𝐴, 𝐵, 𝐶, 𝐷, 𝐸, 𝐹 , 𝐺 }, where 𝐺is
Boston Logan Airport (BOS). Urban locations 𝐴–𝐹are all 10 km of
each other, while distances to BOS are 30+ km.
𝐴 𝐵 𝐶 𝐷 𝐸 𝐹
𝐴−𝐹10 10 10 10 10 10
→𝐺35 33 36 34 32 31
Travel speed: (𝐴–𝐹) 60 km/h, and (𝐴–𝐹→𝐺) 100 km/h.
Passengers: Each passenger specifies an arrival time at BOS (𝐺). The
dispatcher will instruct drivers when to pick up passengers to ensure
on-time arrival at BOS.
Ride Requests (Desired BOS arrival time given):
-𝑟1: Pickup at 𝐴, to 𝐺by 08:45 - 𝑟2: Pickup at 𝐵, to 𝐺by 08:50
-𝑟3: Pickup at 𝐶, to 𝐺by 08:55 - 𝑟4: Pickup at 𝐷, to 𝐺by 09:00
Available Vehicles (Capacity 2 passengers):
-𝑘1: at 𝐴,𝑘2: at 𝐶, and 𝑘3: at 𝐸
Scheduling Constraints: - The dispatcher determines the pickup
times based on a feasible schedule. Pickup times must allow the dri-
ver to first reach the passenger location (𝐴-𝐹) and then drive to 𝐺in
time.
•Rudimentary Manual Approach: Hand-craft a directed graph,
choose a solver, and input parameter values to obtain a solution.
Although direct, this method demands considerable human ef-
fort and domain expertise.
•LLM-Assisted Semi-automation: Use an LLM to suggest algo-
rithms, extract parameters from specifications, and assist with
coding. This approach can handle moderate complexity but be-
comes less feasible at scale.
•Full Automation: Provide the problem statement to a multi-
agent framework (e.g., MACI [6]) that executes end-to-end, in-
cluding validation and replanning. This approach, combined w ith
human oversight, scales best for complex scenarios with large
numbers of nodes and frequent environmental changes.
4.1 P1: Campus Single-Tour Navigation
Problem Statement: A single autonomous agent must navigate
a predefined campus environment to complete a sequence of way-
points while minimizing travel time. The scenario assumes a static
environment without disruptions.
Problem Specification:
-Environment: A known map with a finite set of locations.
-Goal: Visit all designated waypoints within a given timeframe.
-Constraints: Opening hours of each location, each location at
least 30 minutes, and must be completed before 5 PM.
-Optimization Metric: Shortest path (time or distance).
A meta-plan, shown in Table 1, provides the high-level struc-
ture and constraints for the problem. This meta-plan serves as in-
put to specialized solvers, such as dynamic programming or Monte
Carlo algorithms, which then generate detailed, executable work-
flows. The process transforms abstract planning requirements into
concrete, implementable sequences of actions while respecting all
specified constraints and optimization objectives.
3
ACM, February, 2025 Longling Geng and Edward Y. Chang
4.2 P2: Multi-Group Campus Tour
Problem Statement: Multiple groups of visitors require guided
tours in different locations on a university campus, with optimized
scheduling of multiple tour guides. This problem shares the same
metrics as P1.
Problem Specification:
- Multiple agents (tour guides) must coordinate to serve different
groups of visitors.
- Each group has predefined preferences and time constraints.
- Agents must follow non-overlapping paths while minimizing
idle time.
- Location visiting hours must be observed.
Details are provided in Table 2.
4.3 P3: Urban Ride-Sharing (URS)
Problem Statement: Optimize real-time ride assignments for mul-
tiple vehicles and passengers in an urban environment, balancing
efficiency, fuel use, and service quality.
Problem Specification:
-City Map: A graph 𝐺=(𝑉 , 𝐸 )where locations 𝑉and roads 𝐸
have distances and travel times.
-Ride Requests: A set of requests 𝑅, each defined by:
– Passenger ID, pickup/drop-off locations, time windows.
-Vehicles: A set of available vehicles 𝐾, each with:
– Location, battery/fuel level, passenger capacity and speed.
See Table 3 for detailed specifications.
P4: Urban Ride Sharing (URS) with Disruptions
Problem Specification: Details are provided in Table 4, includ-
ing two disruption scenarios: airport route traffic delay and local
road closure. Other possible disruptions could involve cancelation
of passengers or late arrivals.
Table 4: Urban Ride-Sharing Reactive Planning Problem
Vehicles: Three vehicles 𝑉={𝑣1, 𝑣2, 𝑣3}
- Capacity: 2 passengers each
- Initial vehicle location: City center
- Operating hours: all day
Passengers: Five passengers 𝑃={𝑝1, 𝑝2, 𝑝3, 𝑝 4, 𝑝5}
-𝑝1: Airport by 8:30 AM
-𝑝2: Airport by 9:00 AM
-𝑝3: Airport by 9:30 AM
-𝑝4: Airport by 9:45 AM
-𝑝5: Airport by 9:45 AM
Travel Times:
- City center to pickup locations: 15-30 minutes
- Pickup locations to airport: 45-60 minutes
- Between pickup locations: 20-30 minutes
Disruptions:
- Airport route traffic delay
- Certain local road closure
Objectives:
- Minimize total vehicle travel time
- Meet all passenger deadlines
P5: Wedding Reunion
Problem Specification: Table 5 presents a coordinated wedding
event travel problem. Several friends arrive at different times and
locations before a 3:00 PM wedding photo session. The challenge
includes managing two vehicles for airport pick-ups (aimed at those
who cannot drive or wish to cut costs) and completing critical er-
rands, such as collecting the wedding gift and retrieving formal
attire from the tailor. All activities must be scheduled to ensure
that everyone arrives at the wedding venue before the photo time.
This problem introduces more constraints than the URS prob-
lems in P3 and P4, and it also lays the groundwork for a more chal-
lenging disruption case discussed in P8.
Table 5: Wedding Reunion Logistics Problem
Metrics:
-On-time performance: Must arrive at the venue for 3:00 PM pho-
tos.
Locations: Four locations: 𝑉={𝐵, 𝐺, 𝑇 ,𝑊 }, where 𝐵is Boston Air-
port, 𝐺is Gift shop, 𝑇is Tailor shop, and 𝑊is Wedding venue.
Travel time: (minutes)
𝐵-𝐺: 45, 𝐵-𝑇: 30, 𝐵 -𝑊: 40, 𝐺-𝑇: 20, 𝐺-𝑊: 25, 𝑇-𝑊: 15.
Arrival Times:
- Alex: At 𝐵at 11:00 AM from Chicago (need a ride)
- Jamie: At 𝐵at 12:30 PM from Atlanta (need a ride)
- Pat: At 𝑊at 12:00 PM driving from NYC (has 5-seater car)
Required Tasks:
- Gift collection from 𝐺(after 12:00 PM)
- Clothes pickup from 𝑇(by 2:00 PM)
- Photos at 𝑊(3:00 PM sharp)
Available Resources:
- One car (5-seater) with Pat, available after he is Boston
- Local friend Chris (5-seater) available after 1:30 PM at 𝑊
Scheduling Constraints: - All tasks must complete before 3:00 PM
photo time - Gift store opens at 12:00 PM - Tailor closes at 2:00 PM -
Two cars must accommodate all transport needs
P6: Thanksgiving Dinner
Consider a Thanksgiving dinner scenario in which a family of five
must return home in a Boston suburb for dinner. The problem in-
volves coordinating departure times, managing travel logistics (in-
cluding possible traffic delays), and ensuring timely arrival. Table 6
formalizes these challenges as a sequential planning problem.
This scenario also lays the groundwork for a more advanced
disruption case, which has proven difficult for standalone LLMs,
as discussed in P9.
Problem Specification:
Setup:
* Mom (Sarah) hosts dinner at 6:00 PM in Boston.
* Family arrivals:
–Dad (James) from San Francisco, lands at 1:00 PM ET.
–Sister (Emily) from Chicago lands at 2:30 PM.
–Brother (Michael) driving from NY arrives at 3:00 PM.
* Grandma, who is healthy, needs to pick up nearby.
Constraints:
* James must rent a car post-landing.
* Emily needs an airport pickup (no alternatives).
* Turkey requires 4 hours to cook; someone must be home once
it’s in the oven for safety.
4
REALM-Bench: A Real-World Planning Benchmark
for LLMs and Multi-Agent Systems ACM, February, 2025
* Side dishes need 2 hours of preparation.
* Travel times:
–Home to BOS Airport: 1 hour.
–BOS Airport to Grandma’s: 1 hour.
–Home to Grandma’s: 30 minutes.
Key Planning Questions:
1. When should cooking start?
2. Who picks up Emily, and when?
3. When and by whom should Grandma be picked up?
Table 6: Thanksgiving Dinner Coordination Problem
Objective: Coordinate family arrivals and dinner preparation for 6:00
PM dinner in Boston
Family Members and Arrivals:
- Sarah (Mom): Host, at home
- James (Dad): Lands at BOS 1:00 PM from SF
- Emily (Sister): Lands at BOS 2:30 PM from Chicago
- Michael (Brother): Driving, arrives 3:00 PM from NY
- Grandma: Needs pickup from suburban Boston
Cooking Requirements:
- Turkey: 4 hours cooking time
- Side dishes: 2 hours preparation
- Someone must stay home during cooking
Transportation Constraints:
- James must rent car after landing
- Emily requires airport pickup
- Travel times:
–Home to BOS Airport: 60 min
–BOS Airport to Grandma’s: 60 min
–Home to Grandma’s: 30 min
Key Requirements:
- All family members at home for 6:00 PM dinner
- Turkey and sides ready by dinner time
- All pickups completed with available drivers
- Cooking supervision maintained
P7: Disaster Relief Logistics Problem
Problem Specification: Table 7 summarizes the problem.
P8: Wedding Reunion with Disruptions
Problem Extension: This problem extends P5 with road closures
and dynamic rerouting.
The disruption scenario becomes more challenging because, when
a road closure is announced, the planner must know each vehicle’s
current location to determine w hether it is affected. Since LLMs are
inherently stateless, they cannot keep track of previous scheduling
events and thus struggle to adapt the plan in real-time.
P9: Thanksgiving Dinner with Disruptions
This problem extends P6 by introducing flight delays. Specifically,
when a flight from SFO to BOS is delayed by 𝑡hours, the new
arrival time is confirmed at the originally expected arrival time
minus the flight’s scheduled duration. Although this early notice
provides an opportunity to adjust travel and dinner plans, current
LLM-based systems fail to leverage this information in a timely
manner, only beginning to react at the original arrival time and
missing the window for earlier intervention.
Table 7: Disaster Relief Logistics Problem
Network: 𝐺=(𝑉 , 𝐸 )where 𝑉represents locations and 𝐸represents
routes
Locations (𝑉):
- Supply nodes: CW (central warehouse), AP (airport, capacity: 5 tons)
- Demand nodes: DC1, DC2, DC3 (distribution centers)
- Critical nodes: H1, H2 (hospitals), FS (fuel station)
Resources:
𝑐𝑎𝑝𝑡 𝑟 𝑢𝑐𝑘 =4 tons, 𝜏𝑡𝑟 𝑢𝑐𝑘 =120 min
𝑐𝑎𝑝ℎ𝑒 𝑙𝑖 =1 ton, 𝜏ℎ𝑒𝑙 𝑖 =30 min
Demand Requirements:
- DCs: food/water d elivery by 𝑇𝑑𝑐 =20 : 00
- Hospitals: medicine within Δ𝑇ℎ=6 hours
- Fuel Station: refuel b y 𝑇𝑓𝑠 =12 : 00
Critical Deadlines:
- Food/Water: All DCs by 8:00 PM
- Medicine: Hospitals within 6 hours
- Fuel: FS by 12:00 PM
- Airport: Clear excess beyond 5 tons immediately
Dynamic Disruptions:
- Unpredictable donation arrivals
- Road blockages requiring rerouting
- Emergency hospital demands
- Fuel shortage delays
Key Planning Requirements:
- Resource distribution scheduling
- Transportation mode optimization
- Delivery prioritization
- Airport overflow management
- Real-time disruption handling
For example, consider James’s flight, which was originally sched-
uled to arrive in Boston at 1 PM but has been delayed to 4 PM. He
learns of this new arrival time at 10 AM EST (the flight’s intended
departure from SFO), offering a three-hour lead for adjustments.
However, existing LLM-based solutions do not adapt the plan until
1 PM, thus squandering the opportunity to re-optimize the sched-
ule.
P10: Global Supply Chain
Table 8 presents a comprehensive problem in data center GPU de-
ployment that captures the complexity of large-scale infrastruc-
ture projects. The objective is to complete a 1 million GPU data
center in 15 months while minimizing total costs. The problem
encompasses procurement decisions between NVIDIA (15k/unit)
and AMD (10k/unit) GPUs, where each vendor has different main-
tenance risks (20% vs 50% of unit price over one year) and quarterly
shipment capacities.
The construction process is organized around 50,000 GPU clus-
ters, which require coordinated deployment of power, cooling, and
networking infrastructure. Each cluster demands significant resources:
150 MW of power capacity and 1 million gallons per day of cool-
ing water. Infrastructure development follows strict dependencies:
Power and cooling systems must be operational before network-
ing installation can begin, and each cluster must complete testing
before becoming operational.
5
ACM, February, 2025 Longling Geng and Edward Y. Chang
Table 8: Summary of the Data Center Construction Problem Statement
Component Details
GPU Procurement & Shipments
Total GPU Target 100,000 units
Vendors NVIDIA: $15k/unit, 200k units/quarter capacity, 20% maintenance risk (of unit price over 1 year)
AMD: $10k/unit, 150k units/quarter capacity, 50% maintenance risk (of unit price over 1 year)
Order Timing All orders placed on Day 1
Shipment Schedule Quarterly shipments; e.g., a 600,000 unit order results in 3 shipments of 200,000 units at the end of Q1, Q2, and Q3
Payment Terms 50% deposit at order (day 1), 50% at delivery (each quarter)
Natural Disaster Impacts
Risk Probabilities Earthquake: 10% per quarter; Typhoon: 10% per quarter (assessed over entire project)
Delay Impact Cascading 1-month delay on all subsequent shipments if a disaster occurs in a quarter
Cost Impact 30% price increase applied only to the affected quarter’s shipment
Shipment Expedition Impact Expedite one shipment by 1 month and all subsequent shipments are moved up by 1 month (lock stepped)
Cluster Construction & Infrastructure
Cluster Definition Each cluster requires 50,000 GPUs; clusters can be built concurrently once required GP Us are delivered
Networking Infrastructure (per cluster)
Duration 1 months
Cost $25 million
Dependencies Can begin only after GPUs have arrived, power is ready, and cooling is operational
Power & Cooling (per cluster)
Requirements Power: 150 MW; Cooling: 1M gallons/day
Lead Time Options Regular: 2 months at $30M per cluster
Expedited: 1 months at $75M per cluster (150% cost premium)
Testing & Certification (per cluster)
Normal Process 2 months, $15 million
Expedited Process 1 month, $45 million (150% cost premium)
Expedition Options Expedite power/cooling, networking (50% time reduction, 150% cost premium), and testing independently per cluster
Business Impact of Delays
Delay Costs $50 million lost revenue + $10 million additional operating costs per month of delay
Overall Objective
Objective Minimize the end-to-end construction cost (procurement, construction, maintenance, delay costs, expedition premiums)
while meeting the 15-month deadline. If the 15-month not met, adding delay costs
The problem incorporates real-world complexities such as risks
of natural disasters (10% probability per quarter for both earth-
quakes and typhoons), which can cause cascading delays and cost
increases. Management can expedite various components; power
/ cooling installation can be accelerated from 4 to 2.5 months for
a 80% cost premium, while testing can be shortened from 2 to 1
month by doubling the cost.
The financial implications are significant, as each month of de-
lay incurs 60M in combined revenue loss and additional operat-
ing costs. This creates a complex optimization challenge: balanc-
ing procurement costs, expedition premiums, and risk mitigation
strategies while adhering to physical and temporal constraints in
the construction sequence.
P11: Stock Prediction/Forecasting
Consider a stock market prediction scenario where an automated
system must forecast future stock prices while integrating multi-
ple data streams and accounting for market dynamics. The prob-
lem involves processing real-time data, managing prediction up-
dates, and responding to market events. Table 9 formalizes this as
a sequential planning problem, presenting a comprehensive frame-
work for building an adaptive prediction system. The problem state-
ment details the requirements for creating a robust p rediction pipeline
that can handle real-time market data, maintain high accuracy, and
adapt to changing market conditions. A sample workflow demon-
strating the system’s architecture and data flow is provided in Ta-
ble 9.
Problem Specification:
Setup:
*Objective: Predict a portfolio of sto ck prices for time 𝜏, enabling
actionable insights for risk management, investment decisions,
and automated trading.
*Historical Data:
–Training period: 2010–2020.
–Validation period: 2021–2023.
–Real-time data feed required during operation.
*Data Sources: Yahoo Finance, SEC EDGAR, news feeds, macroe-
conomic indicators.
*Required Features:
–Technical Indicators: MA, MACD, RSI, Bollinger Bands.
–Market Context: Sector indices, comp etitor stock performance,
correlation metrics.
–External Events: Economic reports, policy changes, earnings
announcements.
–Sentiment Analysis: Real-time news, social media, financial
disclosures.
Constraints:
*Processing latency: Under 5 minutes for real-time predictions.
*Prediction confidence interval: 95% confidence bounds required.
6
REALM-Bench: A Real-World Planning Benchmark
for LLMs and Multi-Agent Systems ACM, February, 2025
Market Data
News Feed
Economic Data
Data Collection Feature Extraction Model Training
𝑡𝑟𝑒 𝑓 𝑟 𝑒𝑠ℎ =5𝑚𝑖𝑛 𝑡𝑝𝑟𝑜𝑐𝑒𝑠𝑠 =5𝑚𝑖𝑛
Train
Valid?
Prediction
Accurate?
Trading Alert
Yes
No
Yes
No
Figure 1: Stock Market Prediction Workflow with Validation Loops. Two decision points validate: 1) model accuracy on testing
data, and 2) real-time prediction accuracy for trading notifications.
*Model Dependencies:
–Data collection before feature extraction.
–Feature extraction before model training.
–Model training before prediction generation.
*Performance Requirements:
–Mean Absolute Percentage Error (MAPE) <5% on validation.
–False alert rate <1%.
–System uptime >99.9% during market hours.
State Space:
*Processing Nodes (𝑉):
–Data Nodes: DC (data collection), FE (feature extraction).
–Model Nodes: MT (model training), PG (prediction genera-
tion).
–Integration Nodes: IG (integration), AG (alert generation).
*Dependencies (𝐸):
–Pipeline: 𝐷𝐶 →𝐹 𝐸 →𝑀𝑇 →𝑃𝐺 →𝐴𝐺.
–Feedback: 𝐴𝐺 →𝐷𝐶 for adaptive updates.
–External: Market events affect all nodes.
Market Regime Adaptation:
* Detect shifts in volatility, liquidity, and trend strength.
* Adjust prediction models based on regime detection.
* Implement fallback strategies for extreme market shifts.
Adaptive Refresh Rates:
*𝑡Drefresh (data collection) adjusts based on market volatility.
*𝑡Nrefresh (news updates) increases during high-impact events.
Decision Module:
*Trading Strategy: Buy/sell signals, stop losses, risk thresholds.
*Risk Assessment: Hedging, diversification, leverage constraints.
*Dynamic Adjustment: Rebalancing frequency, asset weighting,
market response.
Backtesting and Iterative Validation:
* Run models on historical data to evaluate predictive accuracy.
* Compare performance on different market regimes/conditions.
* Adjust hyperparameters based on validation results.
Table 9: Stock Market Prediction Problem Framework
Network Structure: 𝐺=(𝑉 , 𝐸 )where𝑉represents processing nodes
and 𝐸represents data flows
Processing Nodes (𝑉):
- Data nodes: DC (data collection), FE (feature extraction)
- Model nodes: MT (model training), PG (prediction generation)
- Integration nodes: IG (integration), AG (alert generation)
Dependencies (𝐸):
- Technical Analysis: 𝐷𝐶 →𝐹 𝐸 →𝑀𝑇
- Market Context: External events affect all nodes
- Model Chain: 𝑀𝑇 →𝑃𝐺 →𝐴𝐺
- Feedback Loop: 𝐴𝐺 →𝐷𝐶 for adaptive updates
Data Requirements:
- Historical: Training (2010–2020), Validation (2021–2023)
- Technical Indicators: MA, BB, MACD, RSI, ATR, OBV
- Price Data: OHLCV with adjustments
- External: Policy, economic indices, news sentiment
System Parameters:
𝑡𝑟𝑒 𝑓 𝑟𝑒 𝑠ℎ =5 min,conf𝑖𝑛𝑡 𝑒𝑟 𝑣𝑎𝑙 =95%
MAPE <5%,uptime =99.9%
Workflow Stages:
- Stage 1: Data Collection & Feature Extraction
- Stage 2: Model Training & Validation
- Stage 3: Prediction Generation
- Stage 4: Alert & Decision Support
- Stage 5: Iterative Refinement
Dynamic Adaptations:
- Market regime shifts trigger model updates
- News events modify prediction weights
- Volatility affects refresh rates
- Anomalies initiate retraining cycles
Performance Metrics:
- Prediction accuracy (RMSE, MAPE)
- System responsiveness (latency)
- Alert precision (false positive rate)
- Model adaptability (regime changes)
7
ACM, February, 2025 Longling Geng and Edward Y. Chang
4.4 Execution Example: Urban Ride Sharing
For a benchmark suite, we encourage users to devise novel method-
ologies to solve the problems. However, for this illustrative exam-
ple of Urban Ride Sharing (URS) (P3), we employ a customized
specification language that extends PDDL (Planning Domain Def-
inition Language) and workflow networks to support:
- Dynamic constraints and real-time updates.
- Integration with streaming data sources.
- Explicit representation of uncertainty.
- Temporal and spatial dependencies.
The solver, whether manual or automatic, approaches the plan-
ning problem in three steps:
1. Convert the problem statement into a formal specification:
- Key objectives and constraints.
- Required resources and their capabilities.
- Performance metrics and success criteria.
- Temporal and spatial dependencies.
2. Transform the specification into a workflow graph:
- Nodes represent processing stages, decision points, or actions.
- Edges capture dependencies, data flow, and execution sequence.
- Agents are assigned to both nodes and edges.
- Concrete parameters and thresholds are specified.
3. Select and apply solving algorithms:
- Test multiple solution approaches (e.g., dynamic programming,
Monte Carlo).
- Evaluate solutions against specified metrics.
- Select and validate the best-performing solution.
- Present results with performance analyses.
We walk through these three steps to solve the URS problem.
4.4.1 URS Problem Specification. Already presented in Table 3.
4.4.2 URS Workflow. Figure 2 shows the workflow representation
of the URS problem.
A
B
C
D
E
F
G (BOS)
𝑟1(8:45)
𝑟2(8:50)
𝑟3(8:55)
𝑟4(9:00)
𝑘1[2]
𝑘2[2]𝑘3[2]
Figure 2: Consider a network 𝐺=(𝑉 , 𝐸 )with urban travel
times 𝜏𝑖 𝑗 =10 minutes and airport routes 𝜏𝑖𝐺 ={19, . . . , 22}
minutes. Schedule three vehicles 𝑘𝑖(capacity 𝑐𝑘=2) to de-
liver four passengers 𝑟𝑖to airport during [8 : 45,9 : 00].
4.4.3 URS Results. Figure 3 presents an optimal solution with a
total travel distance of 87 km, outperforming both GPT-4o-Task
and DeepSeek R1 (Figure 4), which require 123 km. This represents
a significant improvement of 41.37%.
Extended Materials. We also used LangGraph to implement both
P3 and P4, as presented in Appendix A. Furthermore, we pro-
vide a solution to P11 for the prediction of the stock value. These
sample implementations validate the completeness of the problem
statements in this benchmark.
Y
k1
A B
CD
Y
k2 G (BOS)
8:00-8:10
8:10-8:30
8:00-8:10
8:10-8:30
𝑟1(8:45) 𝑟2(8:50)
𝑟3(8:55) 𝑟4(9:00)
Figure 3: This solution routes showing optimal vehicle as-
signments. Vehicle k1 (blue) starts at A and collects passen-
gers r1 and r2, while k2 (orange) starts at C and serves r3 and
r4. Both vehicles arrive at BOS at 8:30, meeting all passenger
arrival deadlines. Total distance traveled = 87 km (Optimal).
Y
k1
A B
C
E D Y
k2
Y
k3
G (BOS)
7:40-7:50 7:50-8:10
7:55-8:17
7:50-8:00
8:00-8:19
𝑟1(8:45) 𝑟2(8:50)
𝑟3
𝑟4(9:00)
Figure 4: GPT-4o-Tas k and DeepSeek R1 both sche dule three
routes. Vehicle k1 picks up r1 from A at 7:40, then r2 from
B at 7:50. Vehicle k2 picks up r3 from C at 7:55. Vehicle k3
must first drive from E to D to pick up r4 at 8:00. All meet
deadlines. Total travel distance = 123 km.
5 Conclusion
REAL-Bench represents a significant step toward systematically
evaluating AI systems’ capabilities in real-world planning scenar-
ios. By providing 11 carefully designed problems that progress in
complexity, the benchmark enables researchers to:
- Assess planning capabilities in multiple dimensions of difficulty.
- Test system performance on real-world planning challenges.
- Evaluate handling of unexpected interruptions and adaptations.
- Compare different approaches using standardized metrics.
The benchmarks are designed to be both tractable for systematic
evaluation and challenging for current systems. Each problem can
be scaled along multiple dimensions, including the number of par-
allel threads, complexity of dependencies, and frequency of disrup-
tions, allowing researchers to progressively stress-test their sys-
tems. Inclusion of validation metrics and baseline implementations
facilitates meaningful comparisons between different approaches.
Looking ahead, we envision this benchmark suite evolving with
community contributions and feedback. Future extensions might
include more complex scenarios, additional evaluation metrics, and
8
REALM-Bench: A Real-World Planning Benchmark
for LLMs and Multi-Agent Systems ACM, February, 2025
expanded validation tools. For instance, in many workflows, trans-
action properties must be preserved:
-Atomicity: An operation either completes entirely or not at all,
with no partial execution state (e.g., a ride-sharing trip must ei-
ther complete fully or be canceled entirely).
-Idempotency: Multiple identical requests produce the same out-
come as a single request, preventing duplicate actions (e.g., mul-
tiple identical order submissions should not result in multiple
orders).
Most importantly, by providing a common framework for eval-
uating planning capabilities of both individual LLMs and multi-
agent systems, we hope to accelerate progress toward more ro-
bust and capable AI planning systems that can handle real-world
complexity and uncertainty. The REALM benchmark suite, along
with detailed documentation and baseline implementations, will be
available as an open source resource after the peer review process.
References
[1] Anthropic. 2024. Claude Technical Report. https://www.anthropic.com
[2] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan,
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, et al. 2020. Language Models are Few-Shot Learners. arXiv preprint
arXiv:2005.14165 (2020).
[3] Edward Y. Chang. 2023. CRIT: Prompting Large Language Models With the
Socratic Method. IEEE 13𝑡ℎ Annual Computing and Communication Workshop
and Conference (March 2023).
[4] Edward Y Chang. 2023. Examining GPT-4’s Capabilities and Enhancement with
SocraSynth. In The 10𝑡ℎ International Conf. on Computational Science and Com-
putational Intelligence.
[5] Edward Y. Chang. 2024. Multi-LLM Agent Collaborative Intelli-
gence: The Path to Artificial General Intelligence. SocraSynth.com.
https://www.amazon.com/dp/1962463079
[6] Edward Y. Chang. 2025. MACI: Multi-Agent Collaborative Intelligence
for Adaptive Reasoning and Temporal Planning. arXiv:2501.16689 [cs.AI]
https://arxiv.org/abs/2501.16689
[7] Junzhe Chen, Xuming Hu, Shuodi Liu, Shiyu Huang, Wei-Wei Tu, Zhaofeng He,
and Lijie Wen. 2024. L LMArena: Assessing Capabilities of Large Language Mod-
els in Dynamic Multi-Agent Environments. In Proceedings of ACL. Association
for Computational Linguistics, 13055–13077. doi:10.18653/v1/2024.acl-long.705
[8] Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Haotian Wang,
Ming Liu, and Bing Qin. 2023. Timebench: A Comprehensive Evaluation of
Temporal Reasoning Abilities in Large Language Models. ArXiv preprint (2023).
arXiv:2311.17667
[9] DeepSeek-AI, Daya Guo, Dejian Yang, and more. 2025. DeepSeek-R1:
Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.
arXiv:2501.12948 [cs.CL] https://arxiv.org/abs/2501.12948
[10] J. Doe and R. Roe. 2022. The Dynamic Planning Competition 2022: Challenges
and Results. In Proceedings of the International Conference on Automated Planning
and Scheduling (ICAPS). ICAPS, Virtual.
[11] Longling Geng and Edward Y. Chang. 2025. REALM-Bench Github Reposi-
tory: A Real-World Planning Benchmark for LLMs and Multi-Agent Systems.
https://github.com/genglongling/REALM-Bench
[12] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020.
The Curious Case of Neural Text Degeneration. arXiv:1904.09751 [cs.CL]
https://arxiv.org/abs/1904.09751
[13] Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu
Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. 2024. Under-
standing the planning of LLM agents: A survey. arXiv:2402.02716 [cs.AI]
https://arxiv.org/abs/2402.02716
[14] Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, and more.
2023. DSPy: Compiling Declarative Language Model Calls into Self-Improving
Pipelines. arXiv:2310.03714 [cs.CL] https://arxiv.org/abs/2310.03714
[15] LangChain AI. 2024. LangGraph: Building Structured Applications with LLMs.
https://github.com/langchain-ai/langgraph.
[16] Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin,
and Bernard Ghanem. 2023. CAMEL: Communicative Agents for "Mind"
Exploration of Large Language Model Society. arXiv:2303.17760 [cs.AI]
https://arxiv.org/abs/2303.17760
[17] E. Miller and F. Davis. 2023. Automated Negotiation Agents Competition 2023:
Benchmarking Negotiation Strategies. In Proceedings of t heInternational Confer-
ence on Autonomous Agents and Multiagent Systems (AAMAS). IFAAMAS, Vir-
tual.
[18] Joao Moura. 2024. CrewAI Framework.
https://github.com/joaomdmoura/crewAI.
[19] OpenAI. 2024. Hello GPT-4o. https://openai.com/index/hello-gpt- 4o/ Accessed:
Jan. 30, 2025.
[20] Bart Prystawski, M. Y. Li, and Noah Goodman. 2023. Why think step by step?
Reasoning emerges from the locality of experience. In NeurIPS.
[21] Yongliang Shen et al. 2023. Taskbench: Benchmarking large language models
for task automation. arXiv preprint arXiv:2311.18760 (2023).
[22] A. Smith and B. Johnson. 2020. The Process Planning Competition 2020: An
Overview. In Proceedings of the International Conference on Automated Planning
and Scheduling (ICAPS). ICAPS, Virtual.
[23] Ayal Taitler, Ron Alford, Joan Espasa, Gregor Behnke, Daniel Fišer, Michael
Gimelfarb, Florian Pommerening, Scott Sanner, Enrico Scala, Dominik Schreiber,
Javier Segovia-Aguas, and Jendrik Seipp. 2024. The 2023 International Plan-
ning Competition. AI Magazine 45, 2 (2024), 280–296. doi:10.1002/aaai.12169
arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/aaai.12169
[24] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste
Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk , et al. 2023. Gemini: A Family
of Highly Capable Multimodal Models. arXiv:2312.11805 [cs.CL]
[25] Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, and Chi Wang. 2024. Au-
toGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. In
COLM 2024.
[26] XAgent Team. 2023. XAgent: An Autonomous Agent for Complex Task Solving.
https://github.com/OpenBMB/XAgent.
[27] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike
Lewis. 2024. Efficient Streaming Language Models with Attention Sinks.
arXiv:2309.17453 [cs.CL] https://arxiv.org/abs/2309.17453
Appendices
In these Appendices, we provide sample implementations of se-
lected problems, illustrating that each problem specification is im-
plementable and can produce feasible solutions. This serves as a
verification of completeness for the problem definitions. We en-
courage readers to review the code and use it as a reference for
designing improved solutions to these challenges.
The three problems selected are P3,P4, and P11, representing
sequential planning, reactive planning, and the most complex plan-
ning scenario, respectively.
A P3 & P4: Urban Ride Sharing Sample
Implementation
This appendix presents an implementation of P3 and P4: Urban
Ride Sharing without and with interrupts, using LangGraph [15].
A.1 Agentic Workflow Formulation
In this first stage, we define agents to manage the nodes of the
workflow, including data collection, route planning, vehicle dis-
patch, traffic adjustment, monitoring and alert, and logging agents.
At the end, we use the »syntax to specify dependencies among
agents. The transition from problem specifications to workflow for-
mulation is handled automatically by MACI [6] in LangGraph.
Listing 1: Collaborative Agents and Prompts
# ---- Data Collection Agent ---- #
DC_Agent = Agent(
name=`` Data Collection Agent'',
backstory=`` You collect basic traffic data, road
closure updates, and estimated travel times
between locations in Bay Area suburb.'',
task_description=``Retrieve traffic conditions,
road closures, and estimated travel
durations for all routes involved in
passenger transport.'',
9
ACM, February, 2025 Longling Geng and Edward Y. Chang
task_expected_output=``Structured travel time
data, including:
1) CityMap: A graph G = (V, E), where the
locations V and roads E have distances and
travel times.
2) Ride Requests: A set of requests R, each
defined by: PassengerID, pickup/drop-off
locations, and time windows.
3) Vehicles: A set of available vehicles K, each
with location, battery/fuel level, passenger
capacity, and speed.'')
# ---- Route Planning Agent ---- #
RP_Agent = Agent(
name=` `Route Planning Agent' ',
backstory=`` Y ou determine optimal routes for
vehicles to minimize total travel time while
ensuring all passengers arrive on time.'',
task_description=`` Use traffic data and
constraints to compute the best routes for
each vehicle, ensuring on-time airport
arrivals.'',
task_expected_output=``Optimized vehicle
assignments and travel routes.'')
# ---- Vehicle Dispatch Agent ---- #
VD_Agent = Agent(
name=` `Vehicle Dispatch Agent'',
backstory=`` Y ou assign passengers to vehicles and
ensure each vehicle follows the optimal
planned route,'',
task_description=`` Assign passengers to vehicles
based on capacity constraints and route
efficiency,''
task_expected_output=``Vehicle assignment list
and dispatch schedule.'')
# ---- Traffic Adjustmen t Agent (e.g. disruptions,
special cases)---- #
TA_Agent = Agent(
name=` `Traffic Adjustment Agent'',
backstory=`` Y ou monitor live traffic updates and
adjust vehicle routes dynamically in case of
delays in Bay Area suburb.'',
task_description=`` Recompute routes in real time
when disruptions occur (traffic, road
closures), ensuring minimal delays.'',
task_expected_output=``Updated travel plans for
affected vehicles.'')
# ---- Monitoring & Alert Agent ---- #
MA_Agent = Agent(
name=` `Monitoring & Alert Agent'',
backstory=`` Y ou track vehicle movements and
notify if there are risks of missing
passenger deadlines in Bay Area suburb.'',
task_description=`` Send alerts for potential
delays and recommend contingency plans.'',
task_expected_output=``Timely notifications for
alternative route adjustments or emergency
responses.'')
Writer_agent = Agent(
name=` `Writer Agent'',
backstory=`` Y ou are a language model specialized
in writing text into .json files'',
task_description=`` Write the json response into
./p3_output.json'',
task_expected_output=``A .json file containing
the given string,
tools=write_str_to_txt'')
# ---- Define Dependencies (With Disruption) ---- #
DC_Agent >> RP_Agent >> VD_Agent >> TA_Agent >>
MA_Agent >> Writer_agent
# ---- Define Dependencies (Without Disruption) ----
#
DC_Agent >> RP_Agent >> VD_Agent >> MA_Agent >>
Writer_agent
A.2 P3: Execute Meta Plan without Disruption
Now that the meta-plan workflow has been constructed, the sec-
ond step involves providing real data for workflow execution. The
following code snippet illustrates how vehicle and passenger lo-
cations are specified, followed by the corresponding agent execu-
tions.
Listing 2: Sample JSON Output from Agentic Workflow
==================================================
RUNNING AGENT: Data Collection Agent
==================================================
1) \textbf{CityMap: A graph G=(V,E)}
- \textbf{Locations V:}
- V1: Downtown
- V2: Golden Gate Park
- V3: Fisherman Wharf
- V4: Union Square
- V5: San Francisco Airport (SFO)
- \textbf{Roads E with distances and travel times:}
- E1: Downtown to Golden Gate Park, Distance: 5
miles, Travel Time: 20 minutes
- E2: Golden Gate Park to Fisherman Wharf, Distance:
6 miles, Travel Time: 25 minutes
- E3: Fisherman Wharf to Union Square, Distance: 2
miles, Travel Time: 10 minutes
- E4: Union Square to SFO, Distance: 15 miles,
Travel Time: 30 minutes
- E5: Downtown to SFO, Distance: 13 miles, Travel
Time: 35 minutes
2) \textbf{Ride Requests: A set of requests R}
- Request 1:
- PassengerID: P001
- Pickup: Downtown
- Drop-off: Union Square
- Time Window: 9:00 AM - 9:30 AM
- Request 2:
- PassengerID: P002
- Pickup: Golden Gate Park
- Drop-off: SFO
- Time Window: 10:00 AM - 11:00 AM
- Request 3:
- PassengerID: P003
- Pickup: Fisherman Wharf
- Drop-off: Downtown
- Time Window: 1:00 PM - 1:30 PM
3) \textbf{Vehicles: A set of available vehicles K}
- Vehicle 1:
- Location: Downtown
- Battery/Fuel Level: 80%
- Passenger Capacity: 4
- Speed: 30 mph
- Vehicle 2:
- Location: Union Square
- Battery/Fuel Level: 60%
- Passenger Capacity: 6
- Speed: 25 mph
- Vehicle 3:
- Location: SFO
- Battery/Fuel Level: 90%
- Passenger Capacity: 3
- Speed: 35 mph
This structured data provides an overview of the current
traffic conditions, road closures, and estimated
travel times for passenger transport in SF.
10
REALM-Bench: A Real-World Planning Benchmark
for LLMs and Multi-Agent Systems ACM, February, 2025
================================================ ==
RUNNING AGENT: Route Planning Agent
================================================ ==
Optimized vehicle assignments and travel routes:
1) \textbf{Vehicle Assignments:}
- Vehicle 1:
- Assigned to Request 1
- Route: Downtown (V1) -> Union Square (V4)
- Estimated Travel Time: 20 minutes (E1 + E3)
- Departure Time: 9:00 AM
- Arrival Time: 9:20 AM
- Vehicle 2:
- Assigned to Request 2
- Route: Union Square (V4) -> Golden Gate Park (V2)
-> SFO (V5)
- Estimated Travel Time: 55 minutes (E3 + E2 + E4)
- Departure Time: 10:00 AM
- Arrival Time: 10:55 AM
- Vehicle 3:
- Assigned to Request 3
- Route: Fisherman Wharf (V3) -> Downtown (V1)
- Estimated Travel Time: 10 minutes (E3)
- Departure Time: 1:00 PM
- Arrival Time: 1:10 PM
2) \textbf{Travel Routes:}
- Request 1:
- Vehicle 1 will pick up Passenger P001 at Downtown
and drop off at Union Square.
- Route: V1 -> V4
- Request 2:
- Vehicle 2 will pick up Passenger P002 at Golden
Gate Park and drop off at SFO.
- Route: V4 -> V2 -> V5
- Request 3:
- Vehicle 3 will pick up Passenger P003 at Fisherman
Wharf and drop off at Downtown.
================================================ ==
RUNNING AGENT: Vehicle Dispatch Agent
================================================ ==
Vehicle Assignment List and Dispatch Schedule:
1) \textbf{Vehicle Assignments:}
- Vehicle 1:
- Assigned to Request 1
- Passengers: P001
- Route: Downtown (V1) -> Union Square (V4)
- Estimated Travel Time: 20 minutes
- Departure Time: ``9 :00 AM''
- Arrival Time: `` 9 :20 AM''
- Vehicle 2:
- Assigned to Request 2
- Passengers: P002
- Route: Union Square (V4) -> Golden Gate Park (V2)
-> SFO (V5)
- Estimated Travel Time: 55 minutes
- Departure Time: ``10:00 AM''
- Arrival Time: `` 10:55 AM''
- Vehicle 3:
- Assigned to Request 3
- Passengers: P003
- Route: Fisherman Wharf (V3) -> Downtown (V1)
- Estimated Travel Time: 10 minutes
- Departure Time: ``1 :00 PM''
- Arrival Time: `` 1 :10 PM''
2) \textbf{Dispatch Schedule:}
- 9:00 AM: Vehicle 1 departs from Downt own with
Passenger P001.
- 9:20 AM: Vehicle 1 arrives at Union Squ are.
- 10:00 AM: Vehicle 2 departs from Unio n Square with
Passenger P002.
- 10:55 AM: Vehicle 2 arrives at SFO.
- 1:00 PM: Vehicle 3 departs from Fisherman Wharf with
Passenger P003.
- 1:10 PM: Vehicle 3 arrives at Downtow n.
These assignments and schedules e nsure efficient use of
vehicles and timely arrival of all passengers.
==================================================
RUNNING AGENT: Monitoring & Ale rt Agent
==================================================
\textbf{Alert Notification:}
1) Vehicle 1:
- Current Status: On schedule.
- Recommendation: No action needed as the vehic le is
expected to arrive on time.
2) Vehicle 2:
- Current Status: Potential delay risk due to hig h
traffic expected around Golden Gate Park.
- Recommendation: Consider adjusting the route to
avoid traffic congestion. Possible alternative:
Take the route via Sunset Blvd to bypass heavy
traffic areas. Notify the passenger of the
potential delay and the alternative route.
3) Vehicle 3:
- Current Status: On schedule.
- Recommendation: No action needed as the vehic le is
expected to arrive on time.
\textbf{Contingency Plan:}
Emergency Response: If Vehicle 2 faces unexpected delays
despite the alternative route, prepare a standby
vehicle for immediate dispatch from a nearby
location to ensure Passenger P002 reaches SFO on
time.
==================================================
RUNNING AGENT: Writer Agent
==================================================
Thought: I need to write the given context string i nto a
JSON file named `` ./ p3_output.json''.
Using Tool: write_str_to_txt
Tool call dict: {omitted}
Data successfully written to ` `./p3_output.json''
Tool result: None
Observations: {0: None}
The context string has been successfully written into
``./p3_output.json'' as a JSON file.
A.3 P4: Execute Meta Plan with Disruption
This section depicts the execution of the meta-plan (workflow) un-
der traffic disruption. In this scenario, the system dynamically iden-
tifies traffic delays and integrates real-time updates into the plan-
ning process. Each agent plays a crucial role: the data collection
agent continuously monitors traffic conditions; the route planning
agent adjusts travel routes based on congestion data; and the dis-
patch agent ensures that vehicle assignments are optimized despite
delays. The workflow demonstrates how coordinated agent inter-
ventions, including dynamic rerouting and schedule adjustments,
11
ACM, February, 2025 Longling Geng and Edward Y. Chang
effectively mitigate the impact of heavy traffic, maintaining timely
transportation, and enhancing overall operational efficiency.
Listing 3: Sample JSON Output from Agent Pipeline
================================================ ==
RUNNING AGENT: Traffic Adjustment Agent
================================================ ==
1) \textbf{Vehicle V1 (Passenger 101):}
- \textbf{Current Disruption:} Heavy traffic reported
on the route from Downtown SF to SF Airpo rt.
- \textbf{Updated Route:}
- Start at Union Square
- Travel to Downtown SF (2 miles, 15 minu tes)
- Detour via I-280 South instead of the usual route
to SF Airport
- Estimated travel time via detour: 30 minutes
- \textbf{Total Updated Estimated Travel Time:} 45
minutes
- \textbf{Revised Departure Time:} ` ` 08:15'' to ensure
timely arrival within the window.
- \textbf{Special Instructions:} Maintain
communication with the passenger about the
updated route and expected arrival time.
2) \textbf{Vehicle V2 (Passenger 102):}
- \textbf{Current Disruption:} Road closure near
Fisherman Wharf due to an event.
- \textbf{Updated Route:}
- Start at Fisherman Wharf
- Detour via Bay Street to Van Ness Avenu e, then
proceed to Golden Gate Bridge
- Estimated travel time via detour: 15 minutes
- \textbf{Total Updated Estimated Travel Time:} 15
minutes
- \textbf{Revised Departure Time:} ` ` 09:15'' to ensure
timely arrival within the window.
- \textbf{Special Instructions:} Ensure fuel levels
are adequate for the detour and communicate any
changes to the passenger.
B P11: Stock Prediction Sample
Implementation on LangGraph
This appendix presents an implementation of the P11 Stock Predic-
tion problem using LangGraph [15]. The implementation demon-
strates how AI-driven planning can be applied to financial fore-
casting by leveraging historical data, technical indicators, and real-
time market updates.
B.1 Agentic Workflow Formulation
In this first stage, we define agents to manage the nodes of the
workflow, including data collection, feature extraction, model train-
ing, integration, and alert. In the end, we use the »syntax to specify
dependencies among agents.
Listing 4: Prompt and Collaborative Agent Pipeline
from src.multi_agent.crew import Crew
with Crew() as crew:
# ---- Data Collection Agent ---- #
DC_Agent = Agent(
name=` `Data Collection Agent''
backstory=`` Y ou collect 5 stocks of historical
data OHLCV from S&P 500 Kaggle Dataset at
https://www.kaggle.com/datasets/andrewmvd/sp
-500-stocks, and real-time stock data
including external economic indicators.'',
task_description=`` Retrieve stock data from
2023-10-01 to 2023-12-31.'',
task_expected_output=``Formatted OHLCV data and
external indicators, of 5 stocks.'',
)
Writer_agent = Agent(
name=`` W riter Agent'',
backstory=`` You are a language model specialised
in writing text into .json files'',
task_description=``Write the json response into
./tool_agent_example.json'',
task_expected_output=``A .json file containing
the given string'',
tools=write_str_to_txt,
)
# ---- Feature Extraction Agent ---- #
FE_Agent = Agent(
name=`` Feature Extraction Agent'',
backstory=`` You compute technical indicators from
the collected 5 stock data.'',
task_description=``Calculate MA, MACD, RSI,
Bollinger Bands, and other technical
indicators of 5 stocks.'',
task_expected_output=``Feature matrix of OHLCV,
external indicators, MA, MACD, RSI,
Bollinger Bands, of 5 stocks.''
)
# ---- Model Training Agent ---- #
MT_Agent = Agent(
name=`` Model Training Agent'',
backstory=`` You train stock prediction models
using historical data (from 2023-10-01 to
2023-11-30) and validate with (from
2023-12-1 to 2024-12-31).'',
task_description=``use financial math and machine
learning models and optimize
hyperparameters.'',
task_expected_output=``Prepare the input Feature
matrix, with 3 financial math models and 3
machine learning models ready for inference,
of 5 stocks.''
)
# ---- Prediction Generation Agent ---- #
PG_Agent = Agent(
name=`` Prediction Generation Agent'',
backstory=`` You generate stock price predictions
using the trained model.'',
task_description=``Predict next-day and intra-day
stock prices.'',
task_expected_output=``Based on input and the 6
models, predict Stock price from 2024-1-1 to
2024-1-7, with confidence intervals, of 5
stocks.''
)
# ---- Integration Agent ---- #
IG_Agent = Agent(
name=`` I ntegration Agent'',
backstory=`` You combine model predictions with
external news sentiment and macroeconomic
indicators.'',
task_description=``Adjust predictions based on
news impact and market conditions on year of
2023.'',
task_expected_output=``Refined Stock price from
2024-1-1 to 2024-1-7, with adjusted
confidence intervals, of 5 stocks.''
)
# ---- Alert Generation Agent ---- #
AG_Agent = Agent(
name=`` Alert Generation Agent'',
backstory=`` You issue buy/sell/hold signals based
on predictions and risk assessment of the
model'', task_description=``Generate trading
alerts and confidence scores'',
task_expected_output=``Based on the 6 models,
provide actionable trade signals from
2024-1-1 to 2024-1-7, of 5 stocks.''
)
12
REALM-Bench: A Real-World Planning Benchmark
for LLMs and Multi-Agent Systems ACM, February, 2025
# ---- Define Dependencies ---- #
DC_Agent >> Writer_agent >> FE_Agent >> MT_Agent #
Technical Analysis Flow
MT_Agent >> PG_Agent >> AG_Agent # Model Chain
#AG_Agent >> DC_Agent # Feedback Loop (adaptive
updates)
crew.run()
B.2 Workflow Execution With Historical Data
for Prediction
Following the dependency graph presented above (also see Fig-
ure 1), the agents are executed in sequence: starting with data col-
lection, then feature extraction, followed by model training, and fi-
nally prediction. For brevity, while data for five stocks is provided,
we illustrate the execution flow using only APPL.
Listing 5: Sample JSON Output from Data Collection
================================================ ==
RUNNING AGENT: Data Collection Agent
================================================ ==
json
{
``stocks'': [
{
``symbol'':` ` AAPL'',
``data'': [
{
date: 2023-10-01,
open: 150.00,
high: 152.00,
low: 149.50,
close: 151.00,
volume: 75000000
},
// Additional daily data up to 2023-12-31
],
``external_indicators'': {
interest_rate: 5.25,
inflation_rate: 3.0,
gdp_growth: 2.5
}
},
{
``symbol'':` ` MSFT'',
``data'': [
{
date: 2023-10-01,
open: 280.00,
high: 285.00,
low: 279.00,
close: 282.00,
volume: 45000000
},
// Additional daily data up to 2023-12-31
],
``external_indicators'': {
interest_rate: 5.25,
inflation_rate: 3.0,
gdp_growth: 2.5
}
},
{
``symbol'':` ` GOOGL'',
``data'': [
{
date: 2023-10-01,
open: 135.00,
high: 137.00,
low: 134.50,
close: 136.00,
volume: 32000000
},
// Additional daily data up to 2023-12-31
],
``external_indicators'': {
interest_rate: 5.25,
inflation_rate: 3.0,
gdp_growth: 2.5
}
},
{
``symbol'':` ` AMZN'',
``data'': [
{
date: 2023-10-01,
open: 120.00,
high: 122.50,
low: 119.00,
close: 121.50,
volume: 60000000
},
// Additional daily data up to 2023-12-31
],
``external_indicators'': {
interest_rate: 5.25,
inflation_rate: 3.0,
gdp_growth: 2.5
}
},
{
``symbol'':` ` TSLA'',
``data'': [
{
date: 2023-10-01,
open: 250.00,
high: 255.00,
low: 248.00,
close: 252.00,
volume: 50000000
},
// Additional daily data up to 2023-12-31
],
``external_indicators'': {
interest_rate: 5.25,
inflation_rate: 3.0,
gdp_growth: 2.5
}
}
]
}
==================================================
RUNNING AGENT: Feature Extracti on Agent
==================================================
To complete the task of calculating technical indicators
for five stocks, perform the following computations
for each stock:
Moving Average (MA): Compute the simple moving
average over a specified period (e.g., 20 days).
Moving Average Convergence Divergence (MACD): Compute
the MACD line, signal line, and histogram.
Relative Strength Index (RSI):} Compute the RSI over
a specified period (e.g., 14 days).
Bollinger Bands: Compute the upper band, lower band,
and the middle band (the moving average).
Other Technical Indicators: Optionally, compute
additional indicators such as Average True Range
(ATR), Stochastic Oscillator, etc., based on
the available data.
==================================================
RUNNING AGENT: Model Training Agent
==================================================
13
ACM, February, 2025 Longling Geng and Edward Y. Chang
To prepare the input feature matrix with financ ial math
models and machine learning models ready for
inference, we will follow these steps:
1. Feature Matrix Preparation:
- Collect historical stock data for the 5 selected
stocks, including the following columns: Date,
Open, High, Low, Close, and Volume.
- Calculate the necessar y financial math models (
technical indicators) for each stock:
- Moving Average (MA): Compute the simple moving
average over a 20-day period.
- Moving Average Convergence Divergence (MACD):
Calculate the MACD line, signal line, and
histogram.
- Relative Strength Index (RSI): Calculate the RSI
over a 14-day period.
- Bollinger Bands: Calculate the upper band, middle
band (moving average), and lower band.
2. Machine Learning Models:
- Train three machine learning models using the
historical data and calculated features. The
models could be:
- Linear Regression: For predicting future stock
prices based on the features.
- Random Forest Regressor: To capture non-linear
relationships and interactions between features
.
- Support Vector Machine (SVM): For robust
predictions with a focus on margin maximization
.
- Optimize hyperparameters for each model using
techniques such as Grid Search or Random Search
to enhance model performance.
3. Inference Preparation:
- Prepare the models for inference by sav ing the
trained models and their hyperparameters.
- Ensure the feature matrix is structured and ali gned
by date, ready for input into the models for
prediction.
4. Output Format:
- Compile the results into a structured data form at
such as a DataFrame or JSON, including all
calculated features and ready-to-use m odels for
each of the 5 stocks.
By following these steps, we ensure that the feat ure
matrix is comprehensive and that the machine
learning models are optimized and ready for
inference on the selected stocks.
================================================ ==
RUNNING AGENT: Prediction Generation Agent
================================================ ==
To predict the stock prices from January 1, 2024, to
January 7, 2024, for the five selected stocks, we
utilize the prepared feature matrix and the tra ined
machine learning models as described in the con text.
Here is the structured prediction output of one o f
the five stocks, including confidence intervals:
json
{
``predictions'': [
{
``stock'':` ` APPL'',
``predictions'': [
{date: 2024-01-01, predicted_price: 150.25,
confidence_interval: [148.00, 152.50],
confidence: 0.80},
{date: 2024-01-02, predicted_price: 151.00,
confidence_interval: [148.75, 153.25],
confidence: 0.80},
{date: 2024-01-03, predicted_price: 152.50,
confidence_interval: [150.00, 155.00],
confidence: 0.88},
{date: 2024-01-04, predicted_price: 153.75,
confidence_interval: [151.50, 156.00],
confidence: 0.82},
{date: 2024-01-05, predicted_price: 151.50,
confidence_interval: [150.25, 153.75],
confidence: 0.90},
{date: 2024-01-06, predicted_price: 155.00,
confidence_interval: [152.75, 157.25],
confidence: 0.83},
{date: 2024-01-07, predicted_price: 156.25,
confidence_interval: [154.00, 158.50],
confidence: 0.87}
]
},
]
}
This output provides the predicted stock prices for each
day from January 1 to January 7, 2024, for stock
APPL, along with their respective confidence
intervals. These predictions are based on the
trained models and the prepared feature matrix.
B.3 Output: Alert Generation
Based on the problem specification in Section 4.3, when the predic-
tion confidence is ≥85%, a recommended trading action is sent to
the subscriber as an alert.
Based on the above prediction, on 1/3, 1/5, and 1/7 the prediction
confidence exceeds the threshold and the stock’s moving direction
is up,down, and up, respectively. Therefore, the alerts issued be-
fore the opening of the market on those three days are: BUY,SELL,
and STRONG BUY, respectively.
14