Available via license: CC BY 4.0
Content may be subject to copyright.
Enhancing safety and efficiency in automated container terminals:
Route planning for hazardous material AGV using LSTM neural
network and Deep Q-Network
FeiLi1,†,JunchiCheng2,†,ZhiqiMao3,
✉
,YuhaoWang3,PingfaFeng1
1Department of Mechanical Engineering, Tsinghua University, Beijing 100084, China
2Information Science and Technology College, Dalian Maritime University, Dalian 116000, China
3School of Vehicle and Mobility, Tsinghua University, Beijing 100084, China
Received: January 17, 2024; Revised: March 6, 2024; Accepted: March 18, 2024
©The Author(s) 2024. This is an open access article under the terms of the Creative Commons Attribution 4.0 International License
(http://creativecommons.org/licenses/by/4.0/).
ABSTRACT: Astheproliferation and development of automated container terminalcontinue,the issues of efficiency and
safetybecomeincreasinglysignificant.Thecontaineryardisoneofthemostcrucialcargodistributioncentersinaterminal.
AutomatedGuidedVehicles(AGVs)thatcarrymaterialsofvaryinghazardlevelsthroughtheseyardswithoutcompromising
onthe safetransportationofhazardous materials,whilealsomaximizing efficiency,isacomplex challenge.Thisresearch
introduces an algorithm that integrates Long Short-Term Memory (LSTM) neural network with reinforcement learning
techniques,specifically Deep Q-Network (DQN),forroutingan AGV carrying hazardousmaterialswithinacontainer yard.
The objective is to ensure that the AGV carrying hazardous materials efficiently reaches its destination while effectively
avoidingAGVscarryingnon-hazardousmaterials.Utilizingreal datafrom theMeishan Portin Ningbo,Zhejiang,China,the
actualyardisfirst abstractedintoanundirectedgraph. SinceLSTMneuralnetworkcan efficientlyconveysandrepresents
informationinlong timesequencesanddonotcausesusefulinformationbeforelongtimetobeignored,atwo-layer LSTM
neural network with 64 neurons per layer was constructed for predicting the motion trajectory of AGVs carrying non-
hazardousmaterials, which are incorporated into the mapasbackgroundAGVs.Subsequently, DQN is employed to plan
therouteforanAGVtransportinghazardousmaterials,aimingtoreachitsdestinationswiftlywhileavoidingencounterswith
otherAGVs. Experimentaltestshave shownthatthe route planningalgorithmproposed inthisstudy improvesthelevelof
avoidance of hazardous material AGV in relation to non-hazardous material AGVs. Compared to the method where
hazardousmaterialAGVfollowtheshortestpathtotheirdestination,theavoidanceefficiencywasenhancedby3.11%.This
improvement demonstrates potential strategies for balancing efficiency and safety in automated terminals. Additionally, it
providesinsightsfordesigningavoidanceschemesforautonomousdrivingAGVs,offeringsolutionsforcomplexoperational
environmentswheresafetyandefficientnavigationareparamount.
KEYWORDS: container yard, route planning, hazardous material Automated Guided Vehicle (AGV), Long Short-Term
Memory(LSTM),DeepQ-Network(DQN)
1Introduction
The container yard within a terminal serves a critical function as a
transitional storage and coordination area. Here, goods destined
for both unloading from and loading onto ships are temporarily
stored, forming a vital buffer in the logistical chain. The
operational dynamics of this yard, where Automated Guided
Vehicles (AGVs) and cranes diligently operate, are integral to the
overall efficiency of the terminal’s functions. The correlation
between the yard’s operational efficiency and the terminal’s overall
performance has been underscored in numerous studies
(Gunawardhana et al., 2021; Li et al., 2021, 2023; Skaf et al., 2021;
Tan et al., 2024; Xiang et al., 2022). Therefore, a substantial
number of scholars have dedicated their efforts to enhancing work
efficiency and striving for practicality (Wang et al., 2018; Zhao et
al., 2020; Zhen et al., 2012; Zhu et al., 2012).
Initial studies primarily focused on optimizing the routing of
AGVs within the yard and enhancing the coordination between
AGVs and cranes for seamless operation (Chen et al., 2004; Zhen,
2016; Zhu, 2012). With the advent and continuous evolution of
artificial intelligence and autonomous AGV technology, the
concept of unmanned and intelligent AGVs has become a pivotal
area of research. Chen et al. (2022) posited that the future of
terminal operations lies in automation, particularly emphasizing
the role of driverless AGVs not only within the terminal but also
in transporting cargo beyond terminals. Filom et al. (2022)
highlighted the growing utilization of machine learning in AGV
route planning within terminals, indicating a significant trend
towards technological integration in logistics management.
Despite these advancements, the diverse nature of cargo
transported by AGVs, particularly with regard to hazardous
materials, is often overlooked in existing research. Different
cargoes, especially hazardous chemicals, necessitate distinct
logistical considerations. For instance, AGV transporting
† Fei Li and Junchi Cheng contributed equally to this work.
✉
Corresponding author.
E-mail: mzq16@tsinghua.org.cn
Journal of Intelligent and Connected Vehicles
2024, 7(1): 64−77
https://doi.org/10.26599/JICV.2023.9210041 Research Article
J Intell Connect Veh2024,7(1):64−77
hazardous materials must maintain a safe distance from those
carrying general cargo, even if they belong to the same type of
AGV. This differentiation in route planning is crucial not only for
operational efficiency but also for ensuring safety and security
within the terminal environment.
However, most existing research does not adequately address
the separate route planning required for AGVs carrying hazardous
materials. There remains a gap in ensuring that these AGVs
carrying hazardous materials are kept at a safe distance from
AGVs carrying non-hazardous materials, while still fulfilling both
their transport objectives. This aspect is critical for maintaining
high security standards in terminal operations and warrants
further investigation and development in the field of terminal
logistics and management.
In this study, we aim to develop an innovative model for the
strategic path planning of AGV transporting hazardous materials
within container yards. The primary objective of this model is to
ensure the expedient and safe routing of an AGV carrying
hazardous materials (hereafter referred to as H-AGV) within
terminal environments, particularly focusing on their navigation
in proximity to AGVs transporting non-hazardous materials
(referred to as N-AGVs). This model is designed to facilitate the
H-AGV, ensuring their timely arrival at designated locations while
maintaining a safe distance from operating N-AGVs, thereby
enhancing safety and efficiency in terminal operations. The
proposed model is based on real data from Meishan Port, Ningbo
Zhoushan Port, Zhejiang, China. The logic block diagram of the
proposed model is shown in Fig. 1, and the methodology of this
study is outlined as follows:
Firstly, this work involves precise mapping of the yard
environment at Meishan Port, which is part of Ningbo Zhoushan
Port in Zhejiang, China. This mapping is achieved by using the
specific longitude and latitude coordinates of the container yard,
which provides a detailed and accurate representation of the
operational area. At the same time, the map used in this work has
been reflected and discretized relative to the reality.
Secondly, we propose an AGV trajectory rolling prediction
model based on Long Short-Term Memory (LSTM) network.
This network is employed to analyze the historical trajectory data,
specifically the latitude and longitude coordinates of N-AGVs
within the Meishan port. The output from this analysis is used to
predict the future paths of these N-AGVs. In the context of our
model, these predicted paths are then integrated into the yard
map, playing a role similar to background elements in
cinematography, providing a dynamic context for the movement
of H-AGV.
Thirdly, we raise reinforcement learning algorithms to
determine the optimal routes for H-AGV. The H-AGV is assigned
random start and endpoint locations on the map. The
reinforcement learning framework is tasked with generating a
trajectory that not only ensures the quickest possible route to the
destination but also adheres to the safety criterion of maintaining a
safe distance from N-AGVs throughout the journey.
This study is organized as follows. In Section 2, related literature
is reviewed. Section 3 provides a comprehensive overview of the
map drawing process, along with the foundational assumptions of
the proposed model and definitions of key symbols employed
throughout the study. Section 4 focuses on elucidating how the
LSTM network predicts the trajectories of N-AGVs within
terminal environment. Section 5 is dedicated to exploring the
application of reinforcement learning in training the paths of
H-AGV. Section 6 presents the outcomes of this project. It
includes a thorough analysis and comparison of experimental
results, providing insights into the efficacy of the proposed model
in practical scenarios. Section 7 discusses the limitations of the
experiment and offers suggestions for future experiments. Section 8
synthesizes the findings of the study, offering a concise summary
of the key conclusions drawn from the research. It also discusses
the implications of these findings and potential avenues for future
research in this field.
2Literature review
This work primarily focuses on two crucial aspects. Firstly, it
addresses the configuration of background AGV movement,
specifically emphasizing the development of a motion design for
N-AGVs that strikes a balance between realism and
computational simplicity. Secondly, it involves the design of the
H-AGV’s movement strategy to ensure that it can reach its
endpoint while avoiding the N-AGVs as efficiently as possible. In
this section, we have reviewed some highly relevant and state-of-
the-art studies.
2.1AGVtrajectoryprediction
For the first part, concerning the design of N-AGVs’ motion
trajectories, the dataset utilized in this study consists of historical
trajectory latitude and longitude coordinates of numerous N-
AGVs. Regarding AGV trajectory prediction, current research
predominantly utilizes operational research methods (Pei et al.,
2023; Qu et al., 2022) and deep learning (Zhu et al., 2022).
Notably, deep learning neural network, initially developed for
processing language and text, have demonstrated considerable
History of
longitude
and latitude
coordinates
for N-AGVs
Longitude
and latitude
coordinates
for port yard
LSTM
neural
network
Set the
background
for the
training
of H-AGV
Use reinforcement learning
to plan a trajectory for H-AGV
H-AGV reaches
its destination
while avoiding
N-AGVs
Fig. 1 Logic block diagram of the model for plan the trajectory of AGV carrying hazardous materials.
Enhancing safety and efficiency in automated container terminals: Route planning for hazardous material AGV using LSTM neural ... 65
https://doi.org/10.26599/JICV.2023.9210041
effectiveness in AGV trajectory prediction and path planning (Qu
et al., 2023).
Liu et al. (2023) put forward their concept based on the
mechanism that models developed for processing natural language
treat human language as sequences of words constrained by
specific grammatical rules. Following their concept, they proposed
that any entity which can be mapped into a learnable sequence,
not just natural language, could be interpreted through a language
model. For instance, in the context of delivery problems, they
treated delivery routes as “sentences” and delivery points as
“words”. By applying language models to optimize delivery paths,
they achieved superior results. This approach exemplifies the
innovative application of natural language processing techniques
in non-linguistic domains like logistics and route planning. Fang
et al. (2022) developed a deep learning-based LSTM model for
intelligent AGV collision avoidance trajectory planning, using the
coordinates of AGVs at the time of collision as inputs. After
experimentation, it was found that this LSTM model exhibited
superior fitting accuracy compared to traditional planning models,
demonstrating greater potential for broader application. This
advancement underscores the effectiveness of LSTM network in
complex, dynamic scenarios like real-time AGV trajectory
planning and collision avoidance. Wang et al. (2022) employed an
Seq2Seq model to predict AGV trajectories and trained an LSTM
model with analogous inputs similarly. They compared the
performance of these two models in AGV trajectory prediction
and found that the LSTM model exhibited superior predictive
accuracy. However, the results show that the model’s performance
deteriorated over time, with an increase of Root Mean Squared
Error (RMSE), especially as the prediction horizon extended. This
observation highlights a common challenge in time-series
predictions using LSTM models, where longer-term predictions
tend to lose accuracy due to the accumulation of errors over time.
Yang et al. (2022) experimented with different improved LSTM
model, i.e., Social Long Short-Term Memory (S-LSTM), training
them using the same historical latitude and longitude coordinates
of AGVs. They predicted the coordinates for the next 5 s for 250
sets of data, calculating the RMSE value of the predictions every
second. According to their findings, while these models produced
similar prediction results, all of them exhibited an increase in
prediction error over time. This trend underscores a common
limitation in time-series forecasting models: As the prediction
horizon extends, the accuracy tends to decrease, a challenge
particularly pertinent in dynamic systems like AGV trajectory
prediction.
These research findings align with our objectives and
demonstrate through existing experiments that models like LSTM,
originally designed for natural language processing, are also
effective in AGV path planning. Moreover, they also highlight a
crucial consideration: when LSTM models are used for prediction
and output multiple sets of data simultaneously, the earlier
outputs tend to have smaller errors, while the errors in later
outputs increase progressively. This insight is valuable for our
work, as it underscores the need for careful management and
interpretation of LSTM model predictions, particularly when
planning over longer time horizons or for complex, dynamic
environments like traffic systems.
2.2AGVrouteplanning
An H-AGV needs to avoid continuously moving N-AGVs, all
within the confines of a structured map with strong regularity, and
reinforcement learning methods are extensively utilized in this
domain. Reinforcement learning, with its ability to learn optimal
strategies through trial and error in dynamic environments, is
particularly well-suited for scenarios like this (Han et al., 2023; Sun
et al., 2022), where the H-AGV must navigate in real-time while
adapting to the movements of N-AGVs. This approach allows the
model to learn and update its strategies based on the immediate
outcomes of its actions, making it ideal for complex situations
such as traffic and obstacle avoidance in autonomous AGV
navigation.
Zhang and Wu (2023) proposed an autonomous AGV
behavior decision model based on an ensemble of deep
reinforcement learning approaches. This model integrates three
foundational network models: Deep Q-Learning Network (DQN),
Double DQN (DDQN), and Dueling Double DQN (Dueling
DDQN). The model was tested and its generalizability was verified
in a highway simulation environment across scenarios with one-
way three, four, and five lanes. The interactions of AGVs during
five driving behaviors (i.e., lane changing to the left, lane keeping,
lane changing to the right, accelerating in the same lane, and
decelerating) were examined. It was found that deep learning can
effectively control AGV interactions over short time periods,
highlighting its potential in managing complex, real-time
vehicular behaviors and interactions. Li et al. (2022) investigated
the application of Deep Reinforcement Learning in planning AGV
trajectories during continuous lane changes amidst multiple-AGV
interactions on highways. They found that the use of the DQN
model led to enhanced outcomes in aspects such as collision
avoidance and maintaining AGV speed. This research highlights
the efficacy of DQN in managing complex driving scenarios,
demonstrating its capability to optimize vehicular movements in
dynamic, multi-agent environments where quick adaptation and
decision-making are crucial. Beyond its effectiveness in fine-
tuning the positional relationships among multiple targets within a
limited scope, reinforcement learning also plays a significant role
in scenarios involving the large-scale movement of multiple
AGVs. In such contexts, reinforcement learning is adept at
handling the complexities of larger, more dynamic environments
where AGVs interact extensively. Instead of considering its
concrete shape, AGVs are abstracted into a point. This capability
is crucial for tasks like coordinating the movements of a fleet,
managing traffic flow, or optimizing routes in real-time,
demonstrating the versatility and robustness of reinforcement
learning in diverse and expansive vehicular systems (Dai et al.,
2023; Ding et al., 2022). Cai et al. (2023) conducted research on
the interactions between AGVs and pedestrians at major traffic
intersections. They studied, on a relatively large scale, how AGVs
navigate around pedestrians and other AGVs. Their focus was on
enabling AGVs to choose the shortest possible path through
intersections while maintaining as high a speed as is safely
possible. This research is significant in understanding and
optimizing the complex dynamics at busy intersections, with the
goal of enhancing traffic flow efficiency while ensuring the safety
of all road users, including pedestrians.
These large-scale studies involving multi-objective and multi-
action reinforcement learning scenarios are similar to our work
and provide valuable insights for our project. Our work will also
take place in a larger area with numerous AGVs having extensive
mobility. Each AGV can be abstracted as a point. From the
existing literature, it is evident that the assumptions and
constraints on the movement patterns of these points significantly
66 Li F, Cheng J C, Mao Z Q, et al.
J Intell Connect Veh2024,7(1):64−77
impact the computational demands of reinforcement learning.
This aspect should be a focal point in our subsequent work. It
suggests that careful consideration of how movement patterns are
modeled and constrained can greatly influence the efficiency and
effectiveness of the reinforcement learning algorithms employed
in this study.
2.3Contributions
Drawing from the aforementioned literature, this study adapts and
innovates upon the methods of deep learning, particularly LSTM,
and reinforcement learning for predicting AGV trajectories and
managing inter-AGV relationships. Our approach integrates
insights from these fields, leveraging the strengths of LSTM for
accurate short-term movement forecasting and the dynamic
decision-making capabilities of reinforcement learning. By doing
so, our work not only builds upon existing methodologies but also
introduces novel improvements and adaptations, which is suitable
for complex, large-scale vehicular environments. The main
contributions of this paper are listed as follows:
1) The trajectories of N-AGVs are not arbitrarily created;
rather, they are derived from a LSTM neural network training
process. This method ensures a high degree of realism in the
generated paths, as they closely mimic the actual movement
patterns observed in terminal operations. In response to the
aforementioned issue where LSTM models generate larger errors
for later data points when predicting AGV trajectories with a
single output of multiple data sets, we propose a rolling prediction
method to address this problem. The rolling prediction approach
involves continuously updating the predictions as new data
becomes available, rather than relying on a single, long-term
forecast. This method helps to mitigate the error accumulation in
LSTM predictions over extended periods.
2) The project is grounded in authentic data from real-world
terminal operations. This reliance on genuine terminal data
imbues our approach with a level of realism and applicability that
is often lacking in theoretical models. By utilizing authentic data,
our model is not only more accurate but also possesses a universal
applicability, making it a potentially valuable tool for a wide range
of terminal logistics scenarios. The discretization of coordinates
through graph theory, on the other hand, allows for a more
structured and analyzable representation of the data, aiding in the
identification of patterns and relationships between AGVs more
effectively. This combination of techniques aims to enhance the
accuracy and usability of AGV trajectory predictions in complex,
dynamic environments.
3) Our primary focus is on delineating and contrasting the path
planning methodologies for H-AGV and N-AGVs. In this study,
we introduce an innovative solution specifically tailored for the
path planning of H-AGV. This solution is designed to address the
unique challenge posed by the transportation of hazardous
materials within the terminal environment, ensuring both safety
and efficiency.
3Mapping and problem formulation
3.1Cartographicabstractions
Prior to delving into the modeling process, it is essential to provide
a geographical context for our data source, Meishan Port. This
port is geographically located at approximately 121.9842°E
longitude and 29.7726°N latitude. Its layout is captured in Fig. 2,
which presents a satellite image from Google Earth software. Fig. 2
reveals that the container yard is predominantly rectangular,
stretching in a northeast–southwest direction. The yard’s
dimensions are substantial, measuring approximately 3,000 m in
length and 600 m in width.
The internal roadway network of the Meishan port exhibits a
complex grid structure, consisting of 12 roadways aligned in a
northwest–southeast direction and 9 roadways in a southwest–
northeast direction. These roadways intersect orthogonally,
forming a total of 108 junctions. For the purposes of simplifying
our model, each roadway is represented as a linear segment devoid
of width, and each intersection is approximated as a point. The
latitude and longitude coordinates for these intersection points
have been precisely determined.
To facilitate the development of the LSTM neural network and
the reinforcement learning model, a simplified map of the harbor
yard is constructed through the following two-step process.
Initially, a Cartesian coordinate system is introduced to facilitate
the map’s construction. This system is employed instead of
directly utilizing the latitude and longitude coordinates of the
intersections, allowing for a more streamlined representation.
Subsequently, the map is rotated around the origin of this newly
established Cartesian coordinate system. This rotation is
strategically executed to make roadways either horizontally or
vertically on the new map, thereby simplifying the representation
for modeling purposes.
()
(x′′
m,y′′
n)
m
n
Through meticulous observation, it has been ascertained that
the latitude of all intersection points in the Meishan port exceeds
29.75°N, and their longitudes surpass 121.98°E. Considering the
yard’s dimensions, approximately 3,000 m in length and 600 m in
width, and its roughly rectangular shape, the impact of the Earth’s
curvature on this area is deemed negligible for the purposes of our
study. Thus, for the sake of simplification, this assumption is
adopted in our modeling approach. In this study, we establish a
Cartesian coordinate system to represent the yard’s layout. The
intersection point at 29.75°N and 121.98°E is designated as the
origin of this system, with the 121.98°E meridian forming
the Y-axis and the 29.75°N parallel constituting the X-axis. The
longitude and latitude of each intersection point are initially
denoted as , where ‘ ’ and ‘ ’ represent the sequential
numbering of the roadways from left to right and top to bottom
700 m 29.7726°N, 121.9842°E
Fig. 2 Meshan port satellite photo from Google Earth software.
Enhancing safety and efficiency in automated container terminals: Route planning for hazardous material AGV using LSTM neural ... 67
https://doi.org/10.26599/JICV.2023.9210041
m∈ {,...,}
n∈ {,...,}
x′
m=x′′
m−.°
y′
n=y′′
n−.°
(x′
m,y′
n)
( and ), respectively. To adapt
these geographical coordinates to our coordinate system, we
subtract 121.98°E from each longitude and 29.75°N from each
latitude (i.e., and ). The
resulting differences are then assigned as the new coordinates
in the Cartesian frame.
According to the two-dimensional orientation methods, Eq. (1)
is applied:
{xm=x′
mθ−y′
nθ
yn=x′
mθ+y′
nθ
θ
θ
θ=−°
where is the angle it rotates. If the rotation is clockwise, will be
negative. Here a rotation of was chosen. It is important
to note that the selection of this angle is not governed by a
stringent rule; rather, it is chosen for its practical utility.
The resultant map, post-rotation, is depicted in Fig. 3a. An
observation of this map reveals that the distances between each
intersection point are not uniform. To facilitate the deployment
and movement simulation of N-AGVs and H-AGV on this map,
and to streamline training and analysis processes, we have opted
for a discrete movement model. In this model, the movement of
AGVs is analogous to checkers, with jumps from one
predetermined point to another on the mapped roads. These fixed
points are essential for creating a realistic and manageable
simulation environment.
To be more realistic, the distance between these points should
be similar and reasonable. It is therefore necessary to artificially
place such points between the intersections where AGVs are likely
to pass. In order to distinguish between these points, they will be
referred to jump points, while the intersection points at road
intersections will be called intersection points as before. The
intersection points refer to the actual road intersections, while
jump points are the artificially created points to facilitate AGVs
movement simulation.
The density and spacing of jump points have a direct impact on
the simulation’s realism and computational load. Denser jump
points create a more continuous and realistic AGV movement but
computational demands increase later. Conversely, fewer and
more widely spaced jump points result in a more discrete
movement pattern but reduce computational requirements.
Balancing these factors, we determined an optimal total of 936
points, which corresponds to an estimated real-world distance
ranging from 40 to 50 m between each point in the physical yard.
Fig. 3b illustrates the new map, clearly marked with both
intersection points and the strategically placed jump points. This
setup ensures that the deployment and movement of H-AGV and
N-AGVs are confined to these defined points, facilitating a
controlled and realistic simulation of vehicular movement within
the container yard.
3.2Utilizingofmap
After the map is created, for ease of management, each point on
the map is assigned a unique ID, starting from 0. This process
gives each of the 936 points in the graph a distinct identifier.
Additionally, the coordinates of each point on the map are
recorded and directly correlated with their respective IDs. This
system of assigning unique IDs and corresponding coordinates to
each point allows for precise tracking and efficient management of
spatial data within the map. Points that are directly connected to
any given point by dashed lines are referred to as its neighbor
points (hereafter referred to as n-points). For instance, an
intersection point at a T-junction would have three neighbor
points, reflecting the three directions of possible movement from
that junction. Similarly, an intersection point at a crossroads
would have four neighbor points, corresponding to the four
connected directions. However, for a jump point, which
presumably represents a simpler connection like a straight line,
there would only be two neighbor points.
For the subsequent movement of AGVs within the graph, there
are specific rules and requirements that need to be followed:
1) AGVs are abstracted as points on the map. The key rule is
that they can only appear at the coordinates corresponding to one
of the 936 predefined points on the map.
2) The starting points f AGVs are defined as s-points. Besides,
we define the points where the AGVs are now on as current
points (i.e., c-points) and the AGVs just left as history point (i.e., h-
points). It is noted that h-points are not all the points that the
AGVs have passed through, but the only one point they left just
after the jump).
3) Any N-AGV initially located at an s-point (which can also be
referred to as a c-point) has options to either jump to any of the n-
points connected to its current s-point or remain stationary at its
current s-point/c-point. However, once a N-AGV makes its first
jump, its options for subsequent moves, regardless of its current
position (including if it returns to its original s-point), are limited
to only two choices: It can either stay stationary at a c-point or
jump to any n-point that is not an h-point.
4) In the map with 936 points, it is permitted for more than one
AGV to occupy or stay at any given point simultaneously.
5) The process of an AGV jumping from one point to another
is instantaneous, but not all AGVs jump synchronously. This
reflects real-world conditions where AGVs with higher speeds in
the simulation would jump between points more quickly, whereas
slower AGVs would make their jumps more slowly.
6) When the simulation starts, N-AGVs are theoretically
designed to operate continuously for an indefinite period. This
design ensures that the H-AGV has sufficient time to navigate
towards its randomly assigned destination.
(a)
(b)
Fig. 3 Newly built map (a) with only intersections points marked; (b) with
intersections points and marked jump points.
68 Li F, Cheng J C, Mao Z Q, et al.
J Intell Connect Veh2024,7(1):64−77
7) In the model, there are multiple N-AGVs and only one
H-AGV. This assumption is made to simplify the scenario to a
certain extent. Once the AGVs start operating in simulation, the
number of H-AGV and N-AGVs are fixed—they cannot increase
or decrease in quantity, nor can they disappear and reappear.
The design and assumptions outlined above aim to closely
mimic real-world AGV motions while ensuring the feasibility of
subsequent calculations.
4N-AGV setting and LSTM training
4.1GeneralideaandgoalofLSTMtraining
Based on the existing literature presented in Section 2 and the
rules for AGV operation outlined in Section 3.2, the decision was
made to first use an LSTM neural network to predict the
trajectories of N-AGVs based on their historical latitude and
longitude coordinates. These predicted trajectories are then
discretized onto the created map. To address the issue of LSTM
producing larger errors in later predictions when outputting
multiple sets simultaneously, and to handle the need to predict an
indefinite number of N-AGV trajectories, a rolling prediction
method is proposed in this study. Additionally, it is important to
clarify that the prediction of N-AGVs’ trajectories and their
deployment on the map is primarily to provide a dynamic
background for subsequent H-AGV path planning. Therefore, the
primary focus for the prediction and mapping of N-AGVs’
trajectories should be on their reasonableness and realism, rather
than achieving extremely high prediction accuracy. This approach
recognizes the practical need for creating a realistic and dynamic
simulation environment that effectively supports the main
objective of optimizing H-AGV’s path planning.
4.2Datapreprocessingandnetworkframe
The dataset used in this work consists of historical coordinate data
for N-AGVs provided by Meishan Port. It contains a total of
207,165 rows and four columns. The first column contains the
names of different N-AGV (the name of an AGV takes up
multiple rows, each corresponding to different other data, and all
rows of the same car are grouped together, not mixed), the second
column contains timestamps of each AGVs’ data (the timestamp
for each car is incremental), the third column lists the longitude
coordinates of the N-AGV at that time, and the fourth column
lists the latitude coordinates. The sampling frequency of the data is
1 Hz and each N-AGV has several hundred to several thousand
sets of latitude and longitude coordinates over continuous time
periods. The dataset covers a total of 556 N-AGVs. For the
purposes of model testing and validation, 38 sets of data with
fewer coordinate points were selected as the test set, 104 sets as the
validation set, and the remaining 414 sets as the training set. Given
that all N-AGVs exhibit similar movement patterns, and with the
aim for the model to learn common features across all N-AGVs,
the latitude and longitude coordinates from the 414 N-AGV
groups are used as input to train a single model. This approach is
designed to help the model generalize across different AGVs and
conditions.
Combined with previous relevant studies (Fang, 2022; Yang,
2022), the structural design of our model is shown in Fig. 4. The
number of layers is set to be 2 and the number of hidden units in
each LSTM layer is set to 64. The outputs of the model are the
longitude and latitude coordinates.
4.3Networktrainingandresults
The LSTM model employs a toolkit provided by Pytorch. For the
Historical latitude
coordinate
Historical longitude
coordinate
Predicted longitude
coordinate
Predicted latitude
coordinate
64 units
2 layers
64 units
Fig. 4 Structure of the LSTM model.
Enhancing safety and efficiency in automated container terminals: Route planning for hazardous material AGV using LSTM neural ... 69
https://doi.org/10.26599/JICV.2023.9210041
training set, a batch size of 64 is set, while for the validation set, a
smaller batch size of 32 is used. Learning rate is set to 0.001, and
an Adam optimizer is used to adjust parameters. The loss function
is Mean Squared Error (MSE). Verification frequency is once
every 10 epochs. The convergence process of LSTM model
training is shown in Fig. 5 shows that the number of training
iterations can be set to 250. The nature of LSTM network tends to
cause an increase in error for later outputs, we focus on MSE of
the first 10, 20, 30, 40 and 50 sets of predictions respectively, which
has been calculated and showed in Table 1.
Train loss
Test loss
1e−6 Train vs. test loss over epochs (extended)
5
4
3
2
1
0
Loss
0 50 100
Epoch
150 200 250
Fig. 5 LSTM network model training convergence process.
Table 1 Results of LSTM network
Set 10 20 30 40 50
MSE 0.339 0.346 0.401 0.456 0.498
4.4Rollingprediction
N
i
[i−,i]
i+
[i−,i+]
i+
To address the issue of increasing errors in later predictions when
LSTM outputs multiple sets of data at once, and to meet the
requirement for the model to make indefinite predictions, a rolling
prediction method is proposed in this study. Use to represent
the historical trajectory points of any N-AGV. Use to represent
the last historical trajectory points (i.e., the longitude and latitude
coordinate with the latest timestamp). We use 50 sets of data from
as the input to make the prediction a warm start. Then,
after the model predicts the as a result, we then input
as the new set of data to the model and get as a
result, and so on for the rest of the process. This rolling prediction
approach effectively creates a moving window that shifts forward
with each new prediction. By always incorporating the most
recent prediction into the input window for the next prediction,
the model continually adapts to the latest and minimize the
accumulation of errors.
4.5Realworldcoordinatemappingtothemap
i+
i+
i+
The data predicted by the LSTM are real coordinates in practice,
which need to be transformed by Eq. (1) to get into the coordinate
system we set up. Next, we need to discretize these predicted
coordinates and “drag” them to the 936 points we set up
beforehand. First of all, the discretization and prediction are done
at the same time, i.e., every time a coordinate is predicted, it will be
dragged to one of the 936 points. When an AGV predicts ,
the Euclidean distances between and all 936 possible points
in the graph will be computed. Once the nearest jump point or
intersection point is found, the ID of this point is recorded and
regards that point as the AGV’s s-point (as well as its c-point
now). Next, when the model predicts , it calculates the
i+
i+
i+
i+
i+
Euclidean distance from to the c-point and all its n-points,
selects the point with the closest Euclidean distance as the new c-
point and records the ID of this point. There will be two cases, one
is that there is no movement, and then when the model predicts
, it repeats the above operation after predicting . In the
other case, if there is a movement, i.e., there is a new c-point and
the left c-point becomes h-point, then after the model predicts
, it will compute the Euclidean distances from to the c-
point and to all n-points except h-point, and select the nearest
point as the new c-point and record the ID of this point. The
subsequent process is carried out in an analogous manner. This
process will be done through the Networkx library in python. It
will allow the predicted coordinates to be mapped to the map in a
timely manner. And there is a record of the IDs of the points that
N-AGVs passes through. When the LSTM model predicts the
coordinates of multiple AGVs at the same time, since it uses the
same prediction method and inputs the same amount of data, we
can assume that the model generates the predicted data at the
same rate for each AGV trajectory, so that the number of IDs
recorded for each AGV in the same time period after mapping is
also the same. If the predicted sets of coordinates are close in
distance to each other, then it appears intuitively on the map that
AGV is generating stagnation. Conversely, if the predicted sets of
coordinates are farther apart, then it appears visually on the map
that AGV is in continuous motion. This is a good response to the
speed of the AGV and is more realistic.
5Methodology of the reinforcement learning
5.1IntroductionofDQN
Reinforcement learning (RL) has a very wide range of applications
in AGV trajectory planning (Han, 2023; Zhang and Wu, 2023).
Moreover, since the map for AGV operations contains 936 points,
using traditional Q-Learning would result in a substantial
computational load during training. Therefore, in this context,
DQN will be employed to plan the trajectory of the H-AGV.
Deep Q-Learning method interacts with the environment
through a policy-target structure, selecting appropriate actions as
policy outputs. The process of interaction with the environment is
illustrated in Fig. 6.
t
st
at
rt
st+
t+
st+
rt
The policy-target structure of DQN is composed of two neural
networks, namely the policy network and the target network. The
process of DQN interacting with the environment can be
described as follows. At time , the policy network estimates the Q
values of all executable actions based on the input state from the
environment and executes the action with the highest Q value.
The environment then provides feedback in the form of a reward
based on the executed action. Subsequently, the state of the
environment updates to the state at time . The target
network extracts information about the state after the action is
executed, as well as the obtained feedback reward , and
recalculates the Q value of the optimal action chosen by the policy
network. The loss function compares the Q value calculated by the
target network with the Q value estimated by the policy network.
It updates the policy network so that the Q values estimated by the
policy network during training progressively approximate the true
Q values. This enables the selection of the optimal action based on
the state information of the environment.
To define the Q value of an action, let us first define the reward
70 Li F, Cheng J C, Mao Z Q, et al.
J Intell Connect Veh2024,7(1):64−77
rt
t
t
fed back by the environment after executing an action at step .
When selecting actions, the DQN strategy considers not only the
immediate reward obtained but also its future impact. Therefore,
the reward fed back by the environment after executing an action
is defined as the cumulative reward. The cumulative reward
obtained after executing an action at step is the sum of the
current reward at that moment and the discounted future rewards,
which is represented as
Rt=
∞
∑
i=
βirt+i
rt+i
t+i
(,]
where represents the reward obtained at step ; β is the
discount factor, indicating the impact of future moments in the
cumulative reward. β belongs to the interval .
(a,a,a,...,an)
Rt
t
st
an
DQN, through interaction with the environment, trains to
select the optimal action based on different environmental states.
The executable actions are predefined as . Let
represents the cumulative reward obtained after executing an
action. Thus, at step , under the state , the Q value of action
is
Q(st,an) = E(Rt|s=st,a=at)
st
The expected value of the cumulative reward of the action
under state is the Q value of the action.
The policy network and target network feature implicit fitting
functions. Therefore, it is only necessary to define their inputs and
outputs. The policy network estimates the Q values of all actions
based on the current state and selects the optimal action. The
policy network is expressed as
V (ωa) = E(Q(st,a),Q(st,a),...,Q(st,an))
ωa
where represents the parameters of the policy network.
st
Equation (4) indicates that the policy network, under state ,
estimates the Q values of all actions, and outputs the action with
the highest Q value. By substituting Eq. (2) into Eq. (3), we obtain:
Q(st,ax) = E(rt+β
∞
∑
i=
βirt++i|s=st,a=ax)
=E[rt+β(Q(st+,a),Q(st+,a),...,
Q(st+,an))|s=st,a=ax]
Equation (5) indicates that the Q value of an action can be
composed of the reward value obtained at the current moment
and the Q value of the optimal action in the next moment.
The target network calculates the Q value of an action based on
the generated reward, therefore the Q value calculated by the
target network is closer to the real value than that estimated by the
policy network. The output of the target network is defined as
V (ωc) = rt+βE(Q(st+,a),
Q(st+,a),...,Q(st+,an))
ωc
where represents the parameters of the target network.
Equation (6) suggests that the target network calculates the Q
value of the current action by considering the Q value of the
optimal action in the next moment and the reward obtained after
executing the current action.
By inputting the Q values calculated by the target network and
estimated by the policy network into the loss function, and based
on the output of the loss function, the policy network is updated
through gradient descent. This process ensures that the Q values
output by the policy network continually approximate the true Q
values (i.e., target Q value), thereby achieving the goal of selecting
the optimal action under different states.
The loss function is defined as
L(ω) = E(V(ωc)−V (ωa))
=E{[rt+βE(Q(st+,a),Q(st+,a),...,
Q(st+,an))] −[E(Q(st,a),
Q(st,a),...,Q(st,an))]}
s
a
r
s
(s,a,r,s)
D
Through its unique policy–target structure interacting with the
environment, there is no need to establish an accurate model of
the environment, endowing it with the ability to autonomously
explore optimal strategies. Furthermore, during the training of the
DQN neural network, it is necessary to use experience replay to
enhance the training efficiency of the model. Experience replay
involves storing data; for example, if an agent in state performs
action and receives a reward , subsequently transitioning to
state , then the experience replay pool will save this tuple
. During training, a certain number of data samples
(represented as ) are randomly selected from the experience
replay pool for training. This method continuously optimizes the
network model.
Action
Policy
network
Environment
Parameter replication
Gradient descent
Loss function
Estimated
Q value
Target
Q value
DQN
Random sampling
Target
network
Experiment
replay
pool
State and reward
Fig. 6 Interaction of DQN with the environment.
Enhancing safety and efficiency in automated container terminals: Route planning for hazardous material AGV using LSTM neural ... 71
https://doi.org/10.26599/JICV.2023.9210041
5.2Experimentsettings
The experimental design mainly consists of two parts. Firstly,
some detailed parameters of DQN will be introduced during the
training. Secondly, the design of the four key elements of
reinforcement learning will be individually discussed in this
experiment: action space, state space, reward design, and
termination conditions.
The policy network and the target network are initialized at the
beginning. Subsequently, we collect data over a certain number of
steps. The position ID, next action, reward and the next position
ID of the H-AGV are stored in the experience replay buffer as the
current state. Next, training module is used for iterative training of
the policy network and target network, utilizing random sampling
from the experience replay buffer. The training continues until all
networks progressively converge. In this experiment, the policy
network and the target network have identical structures; however,
there is a distinction in their model parameters during the training
process. The main parameter configurations for training are
presented in Table 2.
Table 2 Parameters for DQN training
Parameter Symbol Value
Buffer size
N
2,000
Learning rate
η
10–4
Batch size
B
256
Discount factor
β
0.99
Exploration initial rate
ε
1
Exploration final rate
ε
0.05
Target update interval
C
200
Max grad norm — 10
Here, the meaning of each parameter is explained. The buffer
size is the size of the replay buffer, which controls the amount of
memory used to store and resample past experiences. The learning
rate is the rate at which the model’s weights are updated. Affects
the step size in the optimization algorithm. The batch size is the
number of samples used in each training iteration.
ε
ε
The discount factor, which determines the present value of
future rewards. The exploration initial rate and exploration final
rate are the values of at the beginning and the end. is a
probability parameter, meaning what are the chances that in
training DQN, the H-AGV will choose a random action on
certain state. It will decrease in the training process. The target
update interval is the interval of updates to the target network. The
max grad norm is the maximum norm for gradient clipping.
In order to better understand the training process of the model,
a piece of pseudo-code Algorithm 1 is shown below:
The Action space design, State space design, Reward design,
and Termination conditions design of reinforcement learning
used in this work are introduced as follows.
1) Action space design
For the H-AGV, its movement is discrete, capable of moving
only by jumping among the 936 points on the map. It can move
from any c-point to its any n-point. An additional note is that,
since the H-AGV carries hazardous materials, reflecting real-world
scenarios, we allow it to perform reversing actions. This means
that the H-AGV can jump back from a c-point to an h-point,
whereas the N-AGVs cannot. Therefore, summarizing above, the
action space is jumping towards the up, down, left, and right
directions. However, for certain c-points, there may be no n-
points in the up, down, left, or right directions. Therefore, these
directions, which do not offer a viable path for jumping, are
effectively masked in the system. To provide more details about
this process, if the H-AGV is currently at a jump point, it can only
jump to one of the two n-points up and down, or to one of the
two n-points to the left and right. If it is at a T-intersection point,
it can jump to any of the three n-points. If it is at a crossroad
intersection point, then it can jump to the four n-points in the up,
down, left, and right directions. Additionally, it is worth noting
that to simplify the design of subsequent rewards, the H-AGV will
be set to be unable to remain stationary at a point. This means that
it must move continuously.
2) State space design
The state space refers to the collection of all possible states in an
environment. For this project, since the movement of both N-
AGV and H-AGV are discrete, confined to 936 points on the
map, the state space is mainly a collection of different states of
points on the map, including five states. The first state is the
historical trajectory of the H-AGV. When the H-AGV is in
operation, the last four points that it has passed are marked. The
second state is the directional predicted points for H-AGV, i.e.,
when the H-AGV is at a certain c-point, the n-point that is with
the shortest minimum Euclidean distance to the destination
among all its n-points (except h-point) will be labeled, which
represents the direction of the destination. The third state is the
DQN with experience replay
72 Li F, Cheng J C, Mao Z Q, et al.
J Intell Connect Veh2024,7(1):64−77
current location of all N-AGVs on the map, specifically the ID of
the points where these AGVs are situated. These points are
recorded. The fourth state is the destination of the H-AGV. This
point is unique on the map, and the location of the destination
point is specially recorded. The fifth state is all the points and the
connecting dotted lines on the map, which make up the entire
undirected graph.
3) Reward design
The design of the environment’s reward incorporates two
components: positive rewards and negative rewards, the latter
commonly known as penalties. For simplicity, both will be
collectively referred to as “rewards”. Their values can be both
positive and negative. These rewards account for various factors
that influence the scenario.
rd
rd=.
rd=−.
rd=
rd
Path and distance reward: For the H-AGV, one of its most
important tasks is to move from the starting point to the
destination via the shortest path. Due to the relatively simple
structure of the map, the shortest path between any two points on
the graph can be easily found using Python’s Networkx library.
The reward received by the AGV is denoted by the symbol . The
first scenario is that the H-AGV predicts the next point it will
move to from the c-point, and simultaneously calculates the
shortest path to the destination with c-point as the starting point
using the Networkx library. If the predicted next point coincides
with the second point of the calculated shortest path, then a
reward will be given. Considering that the H-AGV needs
to avoid N-AGVs during its movement, which may lead to
deviations from the shortest path, and at the same time, we do not
the H-AGV to engage in meaningless movements that take it
further away from the destination, a balance is introduced using
the variable “distance”. “Distance” refers to the Euclidean distance
from the current c-point to the destination. “Path” is the shortest
route calculated from c-point to the destination. The second
scenario, which is the worst case, occurs when the next point the
H-AGV is about to jump to from c-point has a greater “distance”
to the destination than c-point, and the point about to jump is also
not on the shortest path corresponding to c-point. In this case, a
reward of will be incurred. Apart from the two
scenarios mentioned above, all other situations are classified as the
third scenario. In these cases, no reward is given, i.e., .
Therefore, based on the above discussion, can be represented by
Eq. (8) :
rd=
.
−. ∩
This design of the path and distance reward is a simplification
of the real-world scenario, intended to ensure that the H-AGV,
while avoiding N-AGVs and aiming to reach its destination via
the shortest path, and receives a reward as close to zero as possible.
This approach balances the need for efficient routing with the
practical necessity of avoiding obstacles.
Cover reward: The cover reward is designed to encourage the H-
AGV to avoid N-AGVs. Its mechanism is as follows: At any given
step, the current position of the H-AGV is denoted as the c-point.
The c-points surrounding are its n-points. Each of these n-points
has its own surrounding n-points. Thus, for a particular step, we
record the n-points of the H-AGV’s c-points as well as the n-
points of these n-points. These points’ IDs are then compared
with the IDs of the c-points of all N-AGVs on the map at that
moment. If there is a match (the times of total matches occurred
in the experiment will be referred to as total number of overlaps
later), it indicates that at least one N-AGV is near the H-AGV. If,
n
n
r =−.n
in a particular step, such matches occur (indicating the presence
of N-AGVs near the H-AGV), a reward is given, calculated as
. This reward system effectively incentivizes the H-
AGV to maintain a distance from N-AGVs, enhancing its
navigational efficiency and safety.
r =−
Duplication reward: The H-AGV might encounter a situation
where, especially near the edges and corners of the map, it
repeatedly jumps between two or several points, leading to an
inability to escape this loop. To monitor the H-AGV’s movement
from the starting point (s-point) to any c-point, all passed point
IDs are tracked. If the ID of any point appears three times, a
significant negative reward of is assigned. This
mechanism effectively prevents the H-AGV from repeatedly
jumping back and forth between a few points. However, there is a
degree of randomness involved. The AGV might coincidentally
pass the same point three times on its way to the destination
without being stuck in a loop. Although possible, the probability
of this occurring is relatively low. So, we ignore this situation.
rt=−.
Step reward: To encourage the H-AGV to reach its destination
from the starting point as quickly as possible, with the fewest
number of steps, a negative reward is set for each jump the H-
AGV makes. The negative reward is per jump. This
incentivizes minimizing the number of steps taken to complete the
route.
r =
Destination reward: The destination reward represents the final
reward for the H-AGV: upon reaching its destination, it receives a
significant reward of . This large reward is intended to
strongly incentivize the H-AGV to efficiently navigate to its
destination.
4) Termination conditions design
There are three termination conditions for the system, all of
which are about H-AGV. Firstly, when the ID of the H-AGV’s c-
point equals the ID of the destination point that we initially
designed, i.e., when the H-AGV reaches the destination, the
process will terminate. Secondly, if the H-AGV has been operating
for a long time and the number of actions it takes exceeds a
certain threshold, but it still not reaches the destination, a
termination will occur. The threshold is set to be 200 steps of
jumping for H-AGV. The second type is for special situations.
Thirdly, termination occurs when the H-AGV moves outside of
the 936 points, meaning that it exits the map following a certain
action. Since the H-AGV is directly generated at a specific point
on the map, unlike the N-AGVs, which has its actual path
predicted and then mapped, the exceeding situation can arise with
the H-AGV.
6Results of the experiment
6.1DQNtrainingresults
To establish clear terminology for the study, AGV movement is
referred to as “step” (i.e., when an AGV jumps from a point to
another, that is one step), whereas each iteration during the
training process is termed as “time”.
Fig. 7 illustrates the changes in the loss of the policy network.
As observed from the graph, there is an initial fluctuation in the
loss values. However, after approximately 380,000 training
iterations, the loss gradually approaches zero. This trend indicates
a stabilization and improvement in the policy network’s
performance as the training progresses. Fig. 8 represents the
number of steps required for the H-AGV to travel from the
starting point to the destination in each episode. The graph shows
that, as the number of training iterations increases, the required
Enhancing safety and efficiency in automated container terminals: Route planning for hazardous material AGV using LSTM neural ... 73
https://doi.org/10.26599/JICV.2023.9210041
number of steps gradually decreases. This trend indicates that the
H-AGV is becoming more and more efficient, able to reach its
destination faster as the training progresses. Fig. 9 depicts the
curve of changes in average rewards. The average reward value is
calculated as the ratio of the total reward obtained at the end of
each episode to the total number of steps taken by the H-AGV. It
is evident from the graph that the average reward curve gradually
increases as training progresses. Correspondingly, Fig. 8 indicates
a downward trend in the total number of steps required by the H-
AGV in each episode. This implies that, following training, the
total reward obtained in each episode increases, signifying
enhanced efficiency and performance of the H-AGV.
6.2Simulationresults
The entire process is simulated using a Python-based platform, as
illustrated in Fig. 10. In this simulation, the blue star represents the
H-AGV, and its historical four points are connected by green lines
and marked on the map. The black dots represent the N-AGVs,
and the green solid square indicates the destination of the H-
AGV. On the right side of the simulation, various parameters,
including the rewards during operation, are monitored in real-
time.
During experimentation, the system is first initialized by setting
up 10 N-AGVs with randomly generated starting points. Then,
one H-AGV is set up with randomly determined starting and
destination points. The simulation runs, tracking the number of
steps the H-AGV takes. If the H-AGV reaches its destination in
no more than 200 steps, the total number of steps taken, the total
reward received at the end, and the total number of overlaps are
recorded. If the H-AGV does not reach its destination within 200
steps, the simulation is forcibly terminated, and the total reward
and total number of overlaps at that point are recorded.
After conducting a total of 1,500 experiments, the H-AGV
successfully reached its destination within 200 steps on 1,237
occasions, accounting for 82.47% of all trials. Among these
successful instances, 63.21% had a total overlap count of zero. The
results indicate favorable outcomes from the experiment. We also
plot the step vs. total overlap number scatter diagrams as shown in
Fig. 11. The data used for the illustration is derived from 400
randomly sampled instances where the H-AGV successfully
reached its destination. The graph reveals that in nearly half of
these experiments, the H-AGV was able to reach the destination
with a total number of overlaps of zero. Moreover, it generally
reaches the destination within 100 steps. This observation
underscores the effectiveness of the routing strategy in efficiently
guiding the H-AGV to its destination while minimizing
interactions with other AGVs.
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0
Loss
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
Times
Loss of policy network
Fig. 7 Loss of policy network during training.
200
180
160
140
120
100
80
60
Step
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
Times
Number of steps for each episode
Fig. 8 Number of steps for each episode.
0
−10
−20
−30
−40
−50
−60
−70
Reward
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
Times
Average reward
Fig. 9 Average reward for each episode.
H-AGV
r_exc: 0.00, r_timeout: 0.00
r_t: −0.10, r_total: −0.80
r_dir: −1.00, r_arr: 0.00
r_dis: 0.00, r-path: 0.30
done:False, action: 2
start_id: 860, des_id: 707
topo: [1 1 0 0]
prev_a_prob: 0.00 0.00 0.53 0.47
act0: 2 act1: 3 act2: 2 act3: 3 act4: 2
r_act0: 2 r_act1: 1 r_act2: −1 r_act3: −1
t–5: 782, t–4: 781
t–3: 782, t–2: 781
t–1: 782,
Fig. 10 Python-based stimulation platform.
74 Li F, Cheng J C, Mao Z Q, et al.
J Intell Connect Veh2024,7(1):64−77
6.3Comparison
To validate the effectiveness of the experiment, a control group
was established. The initialization of the control group is identical
to that of the experimental group previously mentioned. However,
in the control group, the H-AGV navigates solely based on the
shortest path to the destination calculated using the Networkx
library. The total number of overlaps in this scenario is recorded.
Since the H-AGV in the control group, which follows the shortest
path, cannot exceed 200 steps in its operation, a comparison will
be made using the results of 400 randomly conducted runs from
the control group against those results of the experimental group
mentioned above. The results of control group are shown in
Fig. 12, while the results of both experimental group and control
group are shown in Fig. 13.
It is important to note a key consideration: Since the starting
points and destinations of the H-AGV are randomly generated in
each experiment, they may be relatively close or far apart.
Therefore, for a more accurate assessment of the experiment’s
effectiveness, we utilize the ratio of the total number of overlaps to
the total number of steps taken. It is found that the experimental
group had an average step count of 41.81 and an average total
number of overlaps of 2.45. In contrast, the control group has an
average step count of 37.78 and an average total number of
overlaps of 3.39. The average ratio for the experimental group is
5.860%, while for the control group, it is 8.973%. The average ratio
for the control group is 3.11%, which is higher than that of the
experimental group. This difference suggests that, through
training, the H-AGV is able to effectively avoid N-AGVs.
7Discussion
This experiment proposes an algorithm that combines machine
learning and reinforcement learning to accomplish the task of
route planning for AGVs carrying hazardous materials within a
container yard. The reinforcement learning method introduced
has been successful in enabling AGVs carrying hazardous
materials to avoid those transporting regular materials. While this
approach can provide guidance for the trajectories of AGVs
carrying hazardous materials, there are still areas needing
improvement, particularly in model assumptions and network
design. These aspects represent potential avenues for further
refinement to enhance the algorithm’s effectiveness and
applicability in real-world scenarios.
Regarding model assumptions, the abstraction of the real-world
map used in our study is relatively simplistic. Additionally, the
design decision that AGVs carrying hazardous materials are
unable to halt possesses certain drawbacks. In terms of algorithm
application, the LSTM neural network exhibits suboptimal
prediction performance, and the reinforcement learning approach
involves cumbersome and challenging parameter tuning.
Therefore, future research should focus on in-depth studies into
more accurate modeling of real-world container yards, and on
how to effectively design reward systems and network structures.
This would involve refining the map abstraction to better reflect
real-world complexities, exploring alternative neural network
models or architectures that could yield better predictive accuracy,
and devising more efficient and robust methods for parameter
tuning in reinforcement learning. These advancements could
significantly enhance the practicality and effectiveness of the
algorithm in real-world container yard management scenarios.
8Conclusions
The method developed in this research, which employs LSTM to
construct the environment of background AGVs and DQN to
plan the route of AGVs carrying hazardous materials, has
successfully achieved avoidance of regular cargo AGVs in the
container yard area to a certain extent. Compared to the
straightforward approach of using the shortest path, the method
proposed in this study has improved avoidance efficiency by
3.11%. The method provides valuable insights for the operation
efficiency and safety of automated container terminals. Moreover,
it offers a novel perspective for path planning involving mutual
avoidance among autonomous AGVs within a certain area. This
approach could be particularly beneficial in settings where safety
and efficiency are paramount, requiring dynamic real-time
decision-making for AGV navigation. It should also be noted that
since DQN no longer relies on the traditional Q-value table, it has
a very strong processing capability for more complex
reinforcement learning tasks in action space and state space, and
thus has a wide range of applications in related motion planning.
30
25
20
15
10
5
0
Total overlap num.
Step vs. total overlap num. for experimental group
0 25 50 75 100
Step
125 150 175 200
Fig. 11 Step vs. total number of overlaps for experimental group.
20
15
10
5
0
Total overlap num.
Step vs. total overlap num. for control group
0 10 20 30 40
Step
50 60 70 80
Fig. 12 Step vs. total number of overlaps for control group.
30
25
20
15
10
5
0
Total overlap num.
Total overlap num. for exp. group and control group
0 25 50 75 100
Step
125 150 175 200
Exp. group
Control group
Fig. 13 Total number of overlaps for experimental group and control group.
Enhancing safety and efficiency in automated container terminals: Route planning for hazardous material AGV using LSTM neural ... 75
https://doi.org/10.26599/JICV.2023.9210041
Replication and data sharing
The data and codes that support the findings of this study are
available at https://cloud.tsinghua.edu.cn/d/fcd08e5336704c7fafe8.
Acknowledgements
We sincerely thank Meishan Port in Ningbo, Zhejiang for
providing the data essential for our research. Our gratitude also
extends to the Emerging Transportation Solutions (ETS) Lab at
Tsinghua University for their invaluable technical support and
assistance.
Declaration of competing interest
The authors have no competing interests to declare that are
relevant to the content of this article.
References
Cai, P., He, J., Li, Y., 2023. Hybrid cooperative intersection management
for connected automated vehicles and pedestrians. J Intell Connect
Veh, 6, 91−101.
Chen, P., Fu, Z., Lim, A., Rodrigues, B., 2004. Port yard storage optimiza-
tion. IEEE Trans Automat Sci Eng, 1, 26−37.
Chen, R., Meng, Q., Jia, P., 2022. Container port drayage operations and
management: Past and future. Transp Res Part E Logist Transp Rev,
159, 102633.
Dai, C., Zong, C., Zhang, D., Li, G., Chuyo, K., Zheng, H., et al., 2023.
Human-like lane-changing trajectory planning algorithm for human-
machine conflict mitigation. J Intell Connect Veh, 6, 46−63.
Ding, H., Li, W., Xu, N., Zhang, J., 2022. An enhanced eco-driving stra-
tegy based on reinforcement learning for connected electric vehicles:
Cooperative velocity and lane-changing control. J Intell Connect Veh,
5, 316−332.
Fang, L., Guan, Z. W., Wang, T., Gong, J. F., Du, F., 2022. Collision
avoidance model and its validation for intelligent vehicles based on
deep learning LSTM. J Automot Saf Energy, 13, 104−111.
Filom, S., Amiri, A. M., Razavi, S., 2022. Applications of machine learn-
ing methods in port operations – A systematic literature review.
Transp Res Part E Logist Transp Rev, 161, 102722.
Gunawardhana, J. A., Perera, H. N., Thibbotuwawa, A., 2021. Rule-based
dynamic container stacking to optimize yard operations at port ter-
minals. Marit Transp Res, 2, 100034.
Han, L., Zhang, H., Fang, R. Y., Liu, G. P., Zhu, C. S., Chi, R. F., 2023.
Global path planning strategy based on an improved deep reinforce-
ment learning. J Automot Saf Energy, 14, 202−211.
Li, H., Peng, J., Wang, X., Wan, J., 2021. Integrated resource assignment
and scheduling optimization with limited critical equipment con-
straints at an automated container terminal. IEEE Trans Intell Trans-
port Syst, 22, 7607−7618.
Li, N., Sheng, H., Wang, P., Jia, Y., Yang, Z., Jin, Z., 2023. Modeling cate-
gorized truck arrivals at ports: Big data for traffic prediction. IEEE
Trans Intell Transport Syst, 24, 2772−2788.
Li, S., Wei, C., Wang, Y., 2022. Combining decision making and trajec-
tory planning for lane changing using deep reinforcement learning.
IEEE Trans Intell Transport Syst, 23, 16110−16136.
Liu, Y., Wu, F., Liu, Z., Wang, K., Wang, F., Qu, X., 2023. Can language
models be used for real-world urban-delivery route optimization?
Innovation, 4, 100520.
Pei, M., Zhu, H., Ling, J., Hu, Y., Yao, H., Zhong, L., 2023. Empowering
highway network: Optimal deployment and strategy for dynamic
wireless charging lanes. Commun Transport Res, 3, 100106.
Qu, X., Lin, H., Liu, Y., 2023. Envisioning the future of transportation:
Inspiration of ChatGPT and large models. Commun Transport Res,
3, 100103.
Qu, X., Wang, S., Niemeier, D., 2022. On the urban-rural bus transit sys-
tem with passenger-freight mixed flow. Commun Transport Res, 2,
100054.
Skaf, A., Lamrous, S., Hammoudan, Z., Manier, M. A., 2021. Integrated
quay crane and yard truck scheduling problem at port of Tripoli-
Lebanon. Comput Ind Eng, 159, 107448.
Sun, Y., Chu, Y., Xu, T., Li, J., Ji, X., 2022. Inverse reinforcement learning
based: Segmented lane-change trajectory planning with considera-
tion of interactive driving intention. IEEE Trans Veh Technol, 71,
11395−11407.
Tan, C., Qin, T., He, J., Wang, Y., Yu, H., 2024. Yard space allocation of
container port based on dual cycle strategy. Ocean Coast Manag, 247,
106915.
Wang, K., Qian, Y., An, T., Zhang, Z., Zhang, J., 2022. LSTM-based pre-
diction method of surrounding vehicle trajectory. In: 2022 Interna-
tional Conference on Artificial Intelligence in Everything (AIE),
100–105.
Wang, K., Zhen, L., Wang, S., Laporte, G., 2018. Column generation for
the integrated berth allocation, quay crane assignment, and yard
assignment problem. Transp Sci, 52, 812−834.
Xiang, X., Liu, C., Lee, L. H., Chew, E. P., 2022. Performance estimation
and design optimization of a congested automated container termi-
nal. IEEE Trans Automat Sci Eng, 19, 2437−2449.
Yang, Z., Liu, D., Ma, L., 2022. Vehicle trajectory prediction based on
LSTM network. In: 2022 International Conference on Artificial Intel-
ligence and Computer Information Technology (AICIT), 1–4.
Zhang, X. F., Wu, L., 2023. Behavior decision-making model for
autonomous vehicles based on an ensemble deep reinforcement
learning. J Automot Saf Energy, 14, 472−479.
Zhao, G., Liu, J., Tang, L., Zhao, R., Dong, Y., 2020. Model and heuristic
solutions for the multiple double-load crane scheduling problem in
slab yards. IEEE Trans Automat Sci Eng, 17, 1307−1319.
Zhen, L., 2016. Modeling of yard congestion and optimization of yard
template in container ports. Transp Res Part B Methodol, 90, 83−
104.
Zhen, L., Lee, L. H., Chew, E. P., Chang, D. F., Xu, Z. X., 2012. A compar-
ative study on two types of automated container terminal systems.
IEEE Trans Automat Sci Eng, 9, 56−69.
Zhu, J., Easa, S., Gao, K., 2022. Merging control strategies of connected
and autonomous AGVs at freeway on-ramps: A comprehensive
review. J Intell Connect Veh, 5, 99−111.
Zhu, W., Qin, H., Lim, A., Zhang, H., 2012. Iterative deepening A* algo-
rithms for the container relocation problem. IEEE Trans Automat Sci
Eng, 9, 710−722.
76 Li F, Cheng J C, Mao Z Q, et al.
J Intell Connect Veh2024,7(1):64−77
Fei Li is pursing the Ph.D. degree of engineer-
ing at Tsinghua University. He is a senior Engi-
neer. His research interests include Cloud
Computing, Big Data, AI, Digital Twin, and
technical application in the area of transporta-
tion and logistics, equipment manufacturing.
He has hosted and participated in Ningbo
Meishan Port project as chief solution architect.
Junchi Cheng now pursing the M.S. degree in
Electronic Engineering in Dalian Maritime
University, China. He is interested in machine
learning and is studying at the Emerging Trans-
portation Solutions Lab in the School of Vehi-
cle and Mobility in Tsinghua University.
Zhiqi Mao received the B.S. degree in Elec-
tronic Information Science and Technology
from the Department of Electronic Engineering,
Tsinghua University, Beijing, China, in 2020
and received the M.S. degree from the Depart-
ment of Automation, Tsinghua University, Bei-
jing, China in 2023. His current research inter-
ests include reinforcement learning, auto-
nomous driving, and operation research
Yuhao Wang received the B.S. degree in Tra-
ffic Engineering from Southeast University,
China, in 2018, and the M.S. degree in Trans-
port from Imperial College London in 2019. He
obtained the Ph.D. degree from The Hong
Kong Polytechnic University in 2023, where he
conducted research on intelligent transport sys-
tems and smart pavements. He is currently a
postdoctoral fellow at the Emerging Trans-
portation Solutions Lab in the School of Vehi-
cle and Mobility in Tsinghua University.
Pingfa Feng is now a Professor in the Depart-
ment of Mechanical Engineering of Tsinghua
University. His research interests include inte-
lligent manufacturing, ultrasonic machining,
high speed cutting, performance analysis, and
optimization of machine tools.
Enhancing safety and efficiency in automated container terminals: Route planning for hazardous material AGV using LSTM neural ... 77
https://doi.org/10.26599/JICV.2023.9210041