Access to this full-text is provided by MDPI.
Content available from Applied Sciences
This content is subject to copyright.
Academic Editor: Pedro Couto
Received: 16 December 2024
Revised: 21 January 2025
Accepted: 28 January 2025
Published: 30 January 2025
Citation: Chen, Y.; Lam, C.T.; Pau, G.;
Ke, W. From Virtual to Reality: A Deep
Reinforcement Learning Solution to
Implement Autonomous Driving with
3D-LiDAR. Appl. Sci. 2025,15, 1423.
https://doi.org/10.3390/
app15031423
Copyright: © 2025 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license
(https://creativecommons.org/
licenses/by/4.0/).
Article
From Virtual to Reality: A Deep Reinforcement Learning
Solution to Implement Autonomous Driving with 3D-LiDAR
Yuhan Chen 1,2 , Chan Tong Lam 1, Giovanni Pau 1,2,3,4,* and Wei Ke 1
1Faculty of Applied Sciences, Macao Polytechnic University, Macao SAR 999078, China;
yuhan.chen@mpu.edu.mo (Y.C.); ctlam@mpu.edu.mo (C.T.L.); wke@mpu.edu.mo (W.K.)
2Department of Computer Science and Engineering—DISI, University of Bologna, 47521 Cesena, Italy
3Autonomous Robotics Research Center, Technology Innovation Institute (TII),
Abu Dhabi P.O. Box 9639, United Arab Emirates
4Department of Computer Science, University of California, Los Angeles, CA 90095, USA
*Correspondence: giovanni.pau@unibo.it
Abstract: Autonomous driving technology faces significant challenges in processing com-
plex environmental data and making real-time decisions. Traditional supervised learning
approaches heavily rely on extensive data labeling, which incurs substantial costs. This
study presents a complete implementation framework combining Deep Deterministic Pol-
icy Gradient (DDPG) reinforcement learning with 3D-LiDAR perception techniques for
practical application in autonomous driving. DDPG meets the continuous action space re-
quirements of driving, and the point cloud processing module uses a traditional algorithm
combined with attention mechanisms to provide high awareness of the environment. The
solution is first validated in a simulation environment and then successfully migrated to a
real environment based on a 1/10-scale F1tenth experimental vehicle. The experimental
results show that the method proposed in this study is able to complete the autonomous
driving task in the real environment, providing a feasible technical path for the engineering
application of advanced sensor technology combined with complex learning algorithms in
the field of autonomous driving.
Keywords: autonomous driving; deep reinforcement learning; Deep Deterministic Policy
Gradient (DDPG); 3D-LiDAR; point cloud; attention mechanism
1. Introduction
Autonomous driving technology represents a transformative advancement in trans-
portation, significantly improving traffic safety by reducing human driving errors while
enhancing overall transportation efficiency. As a cutting-edge technology at the forefront
of artificial intelligence applications, it continues to have profound impacts on society’s
economy, environmental sustainability, and quality of life. Studies have shown that au-
tonomous vehicles could potentially reduce carbon emissions through optimized routing
and more efficient driving patterns [
1
] while also contributing to economic benefits by
reducing traffic congestion and accidents [
2
]. Self-driving vehicles need to demonstrate
remarkable capabilities in making real-time decisions within dynamic and complex en-
vironments, operating safely and independently without human intervention in various
weather conditions and traffic scenarios [3].
Supervised learning, traditionally serving as a fundamental autonomous driving
solution, trains vehicles to make decisions through carefully curated datasets and corre-
sponding labels. However, this conventional approach often reveals its limitations when
Appl. Sci. 2025,15, 1423 https://doi.org/10.3390/app15031423
Appl. Sci. 2025,15, 1423 2 of 21
facing diverse and unpredictable road scenarios, struggling to handle uncertainties that
were not present in the training data. Moreover, supervised learning’s dependency on ex-
tensive labeled data to improve model performance demands substantial human resources
and time [
4
]. Deep reinforcement learning addresses these challenges, drawing inspiration
from human learning patterns. Agents learn and make decisions through environmental
interaction and feedback without explicit human supervision for data labeling [
5
]. This
unsupervised learning approach can accomplish autonomous driving tasks while reducing
technology costs. This study implements autonomous driving tasks using reinforcement
learning with 3D-LiDAR sensors in both simulated and real environments.
While Deep Q-Network (DQN) serves as a foundational reinforcement learning al-
gorithm with proven success in many applications, it exhibits inherent limitations when
handling continuous action spaces, which are crucial for fine vehicle control in real-world
scenarios [
6
]. Vehicles typically require precise, continuous adjustments in acceleration,
braking, and steering to ensure both safety and passenger comfort. DQN is designed for
discrete action selection, making it difficult for it to perform well in such situations. To
address these issues, this study applies the Deep Deterministic Policy Gradient (DDPG)
algorithm, which directly maps sensory inputs to continuous action outputs, making it
particularly well suited for continuous control problems [
7
]. DDPG effectively combines
deep reinforcement learning benefits with an actor–critic architecture, learning optimal
driving policies through systematic trial and error. The actor network determines optimal
actions based on current states, while the critic network evaluates action quality and pro-
vides detailed feedback to refine the actor’s behavior. This combination enhances learning
efficiency, ensuring vehicles can adapt to various driving scenarios while maintaining
consistent performance.
Compared to traditional RGB camera sensors that rely primarily on vision, 3D-LiDAR
technology provides significantly more precise depth measurements, clearly detecting
object contours and positions regardless of challenging lighting conditions, including night-
time operations, strong direct sunlight, or heavily shadowed areas. It constructs detailed
3D environmental maps by emitting laser pulses and measuring their return times, gener-
ating high-precision distance information with millimeter-level accuracy. This advanced
capability helps systems accurately identify and locate surrounding vehicles, pedestrians,
and obstacles in real time [
8
]. Three-dimensional LiDAR also demonstrates exceptional
performance in the detection range, typically identifying objects hundreds of meters away.
It is a crucial capability for high-speed vehicles operating on highways or in complex urban
environments. In autonomous driving scenarios, the early detection of distant obstacles
and traffic conditions helps vehicles better adjust speed and route, significantly improving
overall safety margins. Therefore, this study develops an autonomous driving solution
adapted for 3D-LiDAR sensors. Three-dimensional LiDAR outputs point cloud data, a
three-dimensional format fundamentally different from conventional visual images. In this
study, an algorithm based on PointNet combined with an attention mechanism is used in
the processing stage of the point cloud data to perceive the environment and feed refined
environmental data into the reinforcement learning framework.
The scientific novelty and contributions of this research in autonomous driving tech-
nology manifest in several significant aspects:
1.
This study demonstrates that DDPG reinforcement learning networks can be effec-
tively trained and applied to autonomous driving technology, achieving unsupervised
learning tasks with 3D-LiDAR sensors.
2.
In the processing stage of point cloud data, this study adopts the classical PointNet-
based algorithm to directly process the point cloud data instead of voxelization
and combines with self-attention to optimize the data, which are inputted into the
Appl. Sci. 2025,15, 1423 3 of 21
reinforcement learning network to cope with the challenge of 3D-LiDAR data in a
realistic environment.
3.
After the implementation of unsupervised autonomous driving in a simulator, this
study successfully built and deployed a realistic 1/10-scale experimental vehicle,
named F1tenth, to complete autonomous driving tasks in real environments, ad-
vancing real-road condition applications and bridging the gap between simulation
and reality.
The remainder of this article is structured as follows. Section 2provides a review
of related research on applying reinforcement learning for autonomous driving and the
implementation of 3D-LiDAR in autonomous driving. Section 3presents the detailed
methodology of the DDPG reinforcement learning network implementation and point
cloud data processing solutions, including architectural decisions and optimization strate-
gies. Section 4states experimental processes and results in both simulated and real envi-
ronments. Section 5summarizes the research and proposes future directions for related
work, including potential improvements and extensions. This study aims to advance the
practical application of unsupervised learning solutions in autonomous driving technology,
contributing to the goal of safer and more efficient transportation systems.
2. Related Work
Reinforcement learning has many applications in the field of autonomous driving.
These applications focus on different aspects. In terms of training models, the study by
Liu and Diao in 2024 dealt with complex traffic road conditions such as intersections by
extracting drivers’ style features [
9
]. However, the study applies to fewer road conditions
and is not generalizable. Deep Recurrent Q-learning from the demonstration algorithm
(DRQfD) proposed by Yuan et al. also carries out efficient training by learning to predict
the future state of surrounding vehicles [
10
]. This study was demonstrated in a simulated
environment and did not discuss in depth the effectiveness of implementation in a real
environment. Wang et al. proposed deterministic experience tracing (DET), which can
strengthen the experience replay buffer to provide a more stable control performance [
11
].
However, the sample of scenarios in this study was small and not diverse enough. Bosello
et al. focused on holistic reinforcement learning strategies in their study in 2022, where they
implemented an effective application of DQN on racing cars [
12
]. The study used 2D-LiDAR
instead of more advanced high-dimensional LiDAR sensors. Li, Z. et al. in 2023 effectively
improved vehicle following and obstacle avoidance through reinforcement learning to
determine external obstacles [
13
]. Li, W. et al. also contributed to car-following [
14
].
The difference is that they used DDPG as a reinforcement learning algorithm to propose
a car-following decision-making strategy. However, these two studies rely on specific
scenarios and may not be able to perform well in untrained scenarios. Yang et al. in 2023
proposed a DDPG-based reinforcement learning algorithm to improve decision-making
on the highway [
15
]. Bellotti et al. also focused on the highway as a road environment
to increase the predictive ability of the agent by introducing an attention mechanism [
16
].
The road environment of these two studies is limited to highways and lacks generalization.
There are also many studies on the hierarchical reinforcement learning approach. Al-
Sharman et al.’s study focuses on left-turn strategies at unsignalized intersections [
17
]. The
hierarchical framework proposed by Wang et al. improves learning efficiency and decision-
making in complex environments [
18
]. The novel hierarchical reinforcement learning
method proposed by Mao et al. can train sub-networks without designing artificial rewards.
These studies focus on the algorithmic aspects of the agents, and the part involving the
sensor data of the agents is relatively lacking [
19
]. Pérez-Gil et al. in their 2022 research used
a camera as the sensor and utilized deep reinforcement learning in the simulator to complete
Appl. Sci. 2025,15, 1423 4 of 21
autonomous driving navigation tasks [
20
]. The environmental information was input in
RGB data format without employing a higher-precision LiDAR sensor. The work primarily
focused on using reinforcement learning to complete navigation tasks between two given
points. In 2024, Petryshyn et al. used two different sensor configurations, a single camera
combined with 1D-LiDAR and a stereo camera, to achieve multiple autonomous driving
schemes on a track by different deep reinforcement learning algorithms [
21
]. Previous
studies had different focuses in applying deep reinforcement learning to autonomous
driving. This study fills the gap by implementing the autonomous driving task using the
DDPG algorithm through a single 3D-LiDAR as the high-precision sensor and effectively
applies the system in both simulated and real-world environments.
Autonomous driving algorithms that can work with 3D-LiDAR are more advantageous
at the practical application level due to the high performance of 3D-LiDAR. Wang et al. in
2020 used a reinforcement learning scheme to improve the performance of 3D-LiDAR in
vehicle detection and attitude estimation [
22
]. Lee and Park’s research in 2020 enhanced
the capabilities of perception modules in autonomous driving tasks by segmenting and
detecting drivable areas and vehicles through point cloud data [
23
]. These two studies
were on specific modules rather than overall decision-making capability. Chen et al.’s
study in 2022 utilized DQN in conjunction with 3D-LiDAR for an autonomous driving
task [
24
]. However, the application was only at the simulator stage and not in the real-
world environment. Zhai et al. used 3D-LiDAR as the sensor to achieve end-to-end
autonomous navigation [
25
]. Sánchez et al.’s study in 2023 also improved the performance
of autonomous navigation using 3D-LiDAR [
26
]. Both studies demonstrate the benefits of
3D-LiDAR for environmental awareness.
3. Methodology
This section describes the technical approach that this article uses to enable a deep
reinforcement learning solution to implement autonomous driving with 3D-LiDAR in both
virtual and real environments. The overall framework used in this article to implement the
autonomous driving task is the DDPG network. Section 3.1 will explain in detail how the
reinforcement learning network of this research is built using the DDPG algorithms. A 3D-
LiDAR sensor passes the environmental information into the integral network architecture
of this research in the point cloud data format. Section 3.2 will elaborate on how to process
the input sensor data from 3D-LiDAR to pass into the DDPG network in the point cloud
data processing stage.
3.1. Reinforcement Learning Network
This study uses reinforcement learning as an overall framework for implementing
autonomous driving. It is a machine learning scheme that learns in interaction, similar to
the process of human learning knowledge. The agent continuously learns knowledge based
on the rewards or penalties it receives in its interaction with the environment in order to
explore the correct behaviors adapted to the environment on its own. In this research, the
agent is the experimental vehicle that explores unfamiliar environments through trial and
error based on the established reward–penalty mechanism. Through continuous feedback,
it self-corrects its behavior and ultimately accomplishes the intended autonomous driving
tasks in the environment. Reinforcement learning is based on the Markov Decision Process
(MDP) [
27
]. This process mainly contains the following elements: S,A,P,R,
γ
. “S” is the
state of the vehicle in the environment. “A” represents all the actions of the vehicle. “P”
represents the probability of the vehicle transitioning from state sat time tto state s’ at
time t+ 1 after executing action a. A state corresponds to an action or the probability of an
action. And with an action, the next state is determined. This means that each state can
Appl. Sci. 2025,15, 1423 5 of 21
be described by a definite value. From this, it can be determined whether a state is good
or bad. For example, if the self-driving car to the left hits an obstacle, then the state to the
left is a bad state. Thus, the goodness of a state is equivalent to the expectation of a future
return, and “R” represents the return that the state at a given time will have. “R” (reward)
is a real value that represents reward or punishment. A positive number is returned when
the vehicle performs the expected action, while a negative number is returned when the
vehicle performs the wrong action. “
γ
” is discount factor that determines the importance
of future rewards relative to immediate rewards.
The purpose of reinforcement learning is to maximize long-term future rewards, which
can also be expressed as finding the largest return U. “U” represents the sum of rewards
accumulated in all states after executing a set of actions, but the direct addition of rewards
in an infinite time sequence will result in unbiased and infinite loops of states. Therefore,
the concept of discount
γ
, with a value range of zero to one, is introduced into this function,
and the reward returned from the subsequent state is multiplied by the discount coefficient.
This means that the current reward is more important than the reward of future feedback.
The definition formula is as follows [28]:
U(s0s1s2···)=
∞
∑
t=0
γtR(st), 0 ≤γ<1
≤
∞
∑
t=0
γtRmax =Rmax
1−γ
(1)
The MDP is a cyclic process in which the agent takes actions to change its state in order
to obtain rewards and interact with the environment. The strategy of the MDP depends
entirely on the current transaction, which is a reflection of its Markovian nature.
Based on the above MDP, this research implements a DDPG network to achieve
autonomous driving tasks. This choice is motivated by the continuous nature of steering
and throttle actions during vehicle operation. DDPG combines the advantages of deep
learning and Q-learning, making it particularly suitable for handling continuous action
spaces. DDPG, proposed by Lillicrap et al. in 2015, is an actor–critic algorithm that
operates through the collaboration of two networks [
7
]. The actor network determines the
optimal policy, while the critic network evaluates the value of the current policy. Its key
characteristics include off-policy training and the use of deep neural networks to handle
complex, high-dimensional state and action spaces.
The construction of the actor network is first addressed, which is theoretically
grounded in the Deterministic Policy Gradient (DPG) algorithm proposed by Silver et al.
in 2014 [
29
]. A stochastic policy is denoted as
a∼πθ(· | s)
, whereas if the policy is
deterministic, it can be denoted as a=µθ(s).
The Deterministic Policy Gradient Theorem is:
∇θµJ=Est∼ρβh∇aQs,a|θQs=st,a=µ(st)·∇θµµ(s|θµ)s=sti(2)
It has the same structure as the strategy gradient theorem. However, the deterministic
gradient theorem is based on the fact that the value function for each state and action
is already available. For a given state in the continuous action space, this research aims
to construct an actor network that outputs the action that maximizes the value function
through a Deterministic Policy Gradient:
µ(s) = arg max
aQ(s,a)(3)
The actor network is the strategy
µ
. Next, the research creates the critic network, which
is the value function Q, to satisfy the generalized actor–critic framework.
Appl. Sci. 2025,15, 1423 6 of 21
The Bellman optimality equation is [27]
Q∗(s,a) = E
s′∼Pr(s,a) + γmax
a′Q∗s′,a′(4)
It states that the value of each state under the optimal policy must be equal to the
expected return of the optimal action in this state. The Mean Square Bellman Error (MSBE)
was used to measure the difference between the value of
Qϕ
fitted by the neural network
and the value calculated by Bellman’s equation:
L(ϕ,D) = E
(s,a,r,s′,d)∼D"Qϕ(s,a)−r+γ(1−d)max
a′Qϕs′,a′2#(5)
When
s′
is in a terminated state,
d=
1; otherwise,
d=
0. The loss function shown in
Equation (5) is derived from the Bellman equation and implemented in Deep Q-Learning,
as introduced by Mnih et al. in their work on Atari learning [6,30].
The actor network takes the environmental state sas input and outputs the action
athat maximizes the value function Qfor that state, thereby forming the deterministic
policy
µ
. This network performs gradient ascent directly on the value function Q, aiming
to find the action athat yields the highest Q-value. The Q-value used here is obtained
from the critic network’s output in the previous iteration. The critic network takes both the
environmental state sand the action aas inputs, which are generated by the actor network,
and outputs the estimated Q-value. This network is trained using target values computed
through the Bellman optimality equation.
To stabilize training, DDPG introduces the target network that mirrors the structure of
the main networks, which include actor and critic networks. This target network updates
more slowly than the main network to prevent training instability caused by frequent
updates. The target network parameters are initially copied from the main network. This
research employs soft updates, where target network parameters gradually move toward
the main network parameters with a small step size, allowing for incremental updates
during each episode. The target actor network generates the next actions, while the target
critic network estimates target Q-values used to update the main network.
In reinforcement learning, the replay buffer is used to store the experience data (states,
actions, rewards, etc.) of the agent’s interaction with the environment. It is another impor-
tant component of the DDPG algorithm. Training stability and efficiency are improved by
randomly sampling mini batches from this buffer, breaking temporal correlations in the
data. In this research, priority is given to samples with higher temporal difference (TD)
errors, focusing on more valuable learning experiences. The assumption of independent
homogeneous distributions is no longer valid when generating samples by successive
explorations in the environment. Therefore, the replay buffer maintains a buffer, which
stores trajectories, and stores the
(s,a,r,s′,d)
data derived from each sampling from the
environment in the buffer.
Unlike the basic DQN algorithm, DDPG incorporates noise to encourage broader
action space exploration during training to improve learning effectiveness. Since DDPG is a
deterministic policy algorithm where the actor network outputs specific actions, exploration
would be limited without noise as the agent would consistently output identical actions.
To address this, DDPG adds noise to the actor’s output of the action, thus increasing the
exploration. This research implements the noise component by adding Gaussian noise
directly to the actor’s action output. As training proceeds, the noise is gradually reduced
through a decay mechanism so that the agent can gradually focus on using the learned
Appl. Sci. 2025,15, 1423 7 of 21
strategy. The decay mechanism is implemented by exponential decay of the standard
deviation of the noise.
Based on the DDPG network constructed above, Figure 1illustrates the workflow of
this research implementation.
Appl. Sci. 2025, 15, x FOR PEER REVIEW 7 of 21
Figure 1. DDPG workflow.
3.2. Point Cloud Processing Algorithm
As the first deep learning architecture to directly process 3D point cloud data, Point-
Net has a simple and efficient architecture, an end-to-end feature learning capability, and
adaptability to irregular point cloud data [31]. However, while it performs well in global
feature extraction, it has limitations in capturing local and global dependencies within the
point cloud. PointNet processes each point independently and then obtains global fea-
tures through max pooling. This design is unable to adequately account for the relation-
ships between points when faced with complex 3D structures, especially in capturing
long-range dependencies and local details.
In future autonomous driving tasks, agents may encounter highly complex environ-
ments and diverse obstacles; the proposed technological approach should enhance scala-
bility and generalizability to address these challenges. The system not only needs to un-
derstand the overall structure of the surrounding environment but also needs to capture
the intricate details and relationships between objects at different distances to ensure that
the task can be performed safely and effectively. Therefore, data preprocessing algorithms
for sensor inputs must be capable of addressing this challenge. The aention mechanism
provides a flexible and effective way to improve this aspect [32]. By strengthening local
feature capture between point clouds and improving fine-grained information modeling,
the approach enhances the model’s understanding of global structures and complex de-
pendencies within point clouds. This improvement draws inspiration from the successful
application of Transformer architectures in natural language processing. To boost point
cloud data processing capabilities, the research introduces self-aention mechanisms to
enhance the model’s local feature capture, rendering it more accurate and robust when
processing irregular and complex three-dimensional point cloud scenarios.
The workflow of the point cloud processing stage used in this study is shown in Fig-
ure 2 and summarized below.
Figure 1. DDPG workflow.
3.2. Point Cloud Processing Algorithm
As the first deep learning architecture to directly process 3D point cloud data, PointNet
has a simple and efficient architecture, an end-to-end feature learning capability, and
adaptability to irregular point cloud data [
31
]. However, while it performs well in global
feature extraction, it has limitations in capturing local and global dependencies within the
point cloud. PointNet processes each point independently and then obtains global features
through max pooling. This design is unable to adequately account for the relationships
between points when faced with complex 3D structures, especially in capturing long-range
dependencies and local details.
In future autonomous driving tasks, agents may encounter highly complex envi-
ronments and diverse obstacles; the proposed technological approach should enhance
scalability and generalizability to address these challenges. The system not only needs to
understand the overall structure of the surrounding environment but also needs to capture
the intricate details and relationships between objects at different distances to ensure that
the task can be performed safely and effectively. Therefore, data preprocessing algorithms
for sensor inputs must be capable of addressing this challenge. The attention mechanism
provides a flexible and effective way to improve this aspect [
32
]. By strengthening local
feature capture between point clouds and improving fine-grained information modeling,
the approach enhances the model’s understanding of global structures and complex de-
pendencies within point clouds. This improvement draws inspiration from the successful
application of Transformer architectures in natural language processing. To boost point
cloud data processing capabilities, the research introduces self-attention mechanisms to
enhance the model’s local feature capture, rendering it more accurate and robust when
processing irregular and complex three-dimensional point cloud scenarios.
The workflow of the point cloud processing stage used in this study is shown in
Figure 2and summarized below.
Appl. Sci. 2025,15, 1423 8 of 21
Appl. Sci. 2025, 15, x FOR PEER REVIEW 8 of 21
Figure 2. The point cloud processing workflow.
1. Point cloud input
Three-dimensional LiDAR acquires the environment information and inputs the
point cloud. The point cloud data used in this study only take the 3D coordinates of each
point without other information such as color. The data format is N × 3, i.e., 3D coordinates
of N points (x, y, z).
2. Initial feature extraction
This module applies Multi-Layer Perceptron (MLP) to each point independently. It
preserves the original design of PointNet. The 3D point coordinates are embedded into
the high-dimensional feature space to provide a richer representation for subsequent fea-
ture extraction. The difference is that the dimension of the PointNet output is 64, and this
study extends the dimensions to 128. It can provide more information for subsequent local
and global feature modeling. The main task of the original PointNet is point cloud classi-
fication and simple semantic segmentation. Sixty-four-dimensional features are sufficient
for capturing local features. The design of PointNet focuses on global max-pooling, which
integrates point cloud features at the global feature level. However, this study introduces
multi-head self-aention mechanisms in the subsequent steps, which usually require high
input feature dimensions. A higher initial feature dimension allows the aention module
to establish point-to-point relationships more efficiently. In the feature extraction module,
multi-head aention usually requires high input feature dimensions in order to compute
the matrix multiplication of query, key, and value. If the initial feature dimensions are too
low, subsequent modules will not have enough information to compute the aention. The
output dimensions of the initial MLP are consistent with the local and global aention
modules, avoiding additional dimension expansion or contraction operations.
3. Local Feature Extraction (KNN and Local Aention)
In 2019, the DGCNN research proposed to dynamically construct a local neighbor-
hood graph by using k-nearest neighbor (KNN) algorithm to extract local geometric fea-
tures of points on the graph structure [33]. Firstly, for each point, the set of KNN is calcu-
lated based on the Euclidean distance of the point cloud to construct the KNN neighbor-
hood, and the obtained neighborhood point features are fed into the next local feature
processing section.
Point Cloud Transformer (PCT) proposes a transformer-based local aention mech-
anism for local feature modeling of point cloud data [34]. In this study, the single-head
aention mechanism is used in the local feature extraction module, while the multi-head
Figure 2. The point cloud processing workflow.
1. Point cloud input
Three-dimensional LiDAR acquires the environment information and inputs the point
cloud. The point cloud data used in this study only take the 3D coordinates of each point
without other information such as color. The data format is N
×
3, i.e., 3D coordinates of N
points (x, y, z).
2. Initial feature extraction
This module applies Multi-Layer Perceptron (MLP) to each point independently. It
preserves the original design of PointNet. The 3D point coordinates are embedded into the
high-dimensional feature space to provide a richer representation for subsequent feature
extraction. The difference is that the dimension of the PointNet output is 64, and this study
extends the dimensions to 128. It can provide more information for subsequent local and
global feature modeling. The main task of the original PointNet is point cloud classification
and simple semantic segmentation. Sixty-four-dimensional features are sufficient for
capturing local features. The design of PointNet focuses on global max-pooling, which
integrates point cloud features at the global feature level. However, this study introduces
multi-head self-attention mechanisms in the subsequent steps, which usually require high
input feature dimensions. A higher initial feature dimension allows the attention module
to establish point-to-point relationships more efficiently. In the feature extraction module,
multi-head attention usually requires high input feature dimensions in order to compute
the matrix multiplication of query, key, and value. If the initial feature dimensions are too
low, subsequent modules will not have enough information to compute the attention. The
output dimensions of the initial MLP are consistent with the local and global attention
modules, avoiding additional dimension expansion or contraction operations.
3. Local Feature Extraction (KNN and Local Attention)
In 2019, the DGCNN research proposed to dynamically construct a local neighborhood
graph by using k-nearest neighbor (KNN) algorithm to extract local geometric features of
points on the graph structure [
33
]. Firstly, for each point, the set of KNN is calculated based
on the Euclidean distance of the point cloud to construct the KNN neighborhood, and the
obtained neighborhood point features are fed into the next local feature processing section.
Point Cloud Transformer (PCT) proposes a transformer-based local attention mech-
anism for local feature modeling of point cloud data [
34
]. In this study, the single-head
attention mechanism is used in the local feature extraction module, while the multi-head
attention mechanism is used only in the global feature extraction. This design reduces
the computational complexity. Local multi-head attention requires computing multiple
Appl. Sci. 2025,15, 1423 9 of 21
query–key pairs for the neighborhood of each point, which is more computationally inten-
sive than single-head attention. For large-scale point clouds or embedded scenarios, it is
more efficient to use single-head attention. The local module quickly extracts geometric
features through the single-head self-attention mechanism. For the local neighborhood of
each point, the input features are the feature matrices of the neighborhood points. A linear
projection is used to generate the query, key, and value. The similarity of query and key is
calculated by dot product and then normalized by softmax to obtain the attention weights
of the local neighborhood:
Alocal =softmax Qlocal ·KT
local
√dk!(6)
Alocal
means local attention weights.
Qlocal
means local query vectors. It is a set of
query vectors extracted from a local region.
KT
local
means the transpose of the local key
vectors.
√dk
is the scaling factor, which is used to prevent the dot product values from
becoming too large. The dot product of the query and key vectors computes a similarity
score between each query vector and each key vector within the local region. A larger
value indicates that the two vectors are more related.
softmax
is the normalization function,
which transforms the raw attention scores into a valid probability distribution. This is a
weight matrix that represents the importance of each point in the neighborhood to the
other points. Using the attention weight
Alocal
to weight the feature V of the neighborhood
points, the aggregated local features can be obtained.
4. Global feature extraction (multi-head attention at full point cloud scope)
Point-to-point relationships at the global scale are more complex, and single-head
attention is not sufficient to express the dependencies between all points. Global feature
extraction needs to model a more complex context, and using multi-head attention can
better serve the global task and provide final global features. So, this module uses the
multi-head attention mechanism. The input to this stage comes from the local features
extracted in the previous stage. Three matrices, Q, K, and V, projected through local features
are used for the multi-head self-attention mechanism. The dot product of Q and K is used
and normalized to obtain the attention weights for the full pair of points. Finally, the
features are weighted and summed to output global features.
5. Multi-scale feature fusion
The Point Transformer proposes a weighted fusion strategy of local and global features,
which is used to enhance the representation of point cloud features [
35
]. According to the
strategy, this study weighted the fusion of local and global features in this step to generate
the final point cloud feature representation.
6. Global pooling
Global max pooling is the core module of PointNet, which is used to aggregate the
point features into the global features of point clouds. The previous module outputs the
features of each point, and each point has a set of feature vectors. It reflects the semantic
representation of each point in local and global contexts, but these features are specific to
the point, not the whole point cloud. It can be aggregated into a single global feature after
pooling. It is a high-level semantic representation of the entire point cloud, which represents
the shape and structure of the integral point cloud and is no longer a point-by-point fine-
grained feature. The inputs to the point cloud are permutation-invariant, and global
pooling makes the model insensitive to the order in which the points are arranged. With
the max pooling operation, the generated global features are always consistent regardless
of the input order of the points. This eliminates the effect of point alignment. Max pooling
Appl. Sci. 2025,15, 1423 10 of 21
takes the maximum of all points for each feature dimension. This ensures robustness to
noise and sparse data while effectively capturing the most significant geometric features in
the point cloud.
7. Feature Dimension Expansion
This step extends the feature dimension from 128 to 1024 using a linear layer, which
improves global feature representation. Higher dimensionality allows for better representa-
tion of the complex geometric and semantic information of the point cloud. In complex
tasks, point clouds may contain complex shapes and structures, and the higher the feature
dimensions, the better the model can express these complex structures. More information
capacity could capture more global patterns and semantics. The original PointNet algo-
rithm has a fixed global feature dimension of 1024. The design has been experimentally
validated and performs well in the task. Furthermore, this study uses the multi-head
self-attention mechanism and multi-scale feature fusion, which could increase the richness
of features. If the final global feature dimension is too low, it may limit the effectiveness of
these modules. Extending to 1024 dimensions will allow the full potential of these modules
to be utilized.
The core idea of PointNet is to consider each point as an independent input, perform
feature extraction on the point cloud using a multilayer perceptron, and then apply max
pooling techniques to obtain global features. PointNet does not explicitly capture local
relationships or spatial dependencies between points but rather obtains an overall feature
representation through global pooling. The local feature extraction module introduced
in this study is a direct improvement of PointNet, whereas PointNet treats each point
independently, this module captures the local geometric features of the point cloud through
neighborhood construction embedding. Such neighborhood aggregation introduces local
structural information in the feature extraction process.
In PointNet, global features are extracted from the features of all points by max pooling.
In this model, the attention-based network mechanism enhances the ability of the model to
capture global features. Through the self-attention mechanism, the model can dynamically
focus on the relationships between points and assign higher weights to important features,
thus effectively creating long-range dependencies in the point cloud data.
In this study, after the extraction of global features, the final output features are
obtained through global pooling. This step maintains the design concept of PointNet,
which is to use global features for decision-making output.
These improvements bring benefits to the model. First, by adding KNN and attention-
based networks, the model is able to better understand the relationships between points,
which is especially useful when passing through complex scenarios such as crowded
intersections. In addition, the attention mechanism allows the model to dynamically adjust
its focus to highlight the most relevant parts of the point cloud. This facilitates autonomous
driving, where the importance of features changes with the environment. Even though
the tasks and scenarios performed in this research are not complex, improved point cloud
processing algorithms are used to increase the extensibility of the techniques developed in
this research. It supports the effective implementation of the architecture in more complex
road situations in the future.
4. Experiments and Results
This study explores how 3D-LiDAR can be used as the sensor to implement an au-
tonomous driving task with reinforcement learning in both virtual and real environments.
Section 4.1 is the implementation of an autonomous driving task in the simulator. Section 4.2
describes the completion of the task in a real environment.
Appl. Sci. 2025,15, 1423 11 of 21
4.1. Implementation in Carla Simulator
For an iterative strategy algorithm, the research needs to easily measure the perfor-
mance of the algorithm. Training cars in real environments can be damaging and costly. It
is necessary to ensure the algorithm can operate safely and correctly in common scenarios
before deploying it in real-world environments in the future. Additionally, in practical
applications, it is not feasible to frequently update algorithms in real environments, so this
process needs to be first deployed and tested in simulators. The debugging and perfor-
mance evaluation of autonomous driving algorithms should be conducted in highly realistic
simulation environments before real environments. It is important to have an autonomous
driving simulator that can reliably reflect real scenarios. In this study, Carla is chosen as
the simulation environment for autonomous driving development and testing [36].
Carla has a 3D city scene that supports perception, planning, and control. Figure 3
shows the top view of the city map in the Carla simulator used in this experiment. It runs
on Unreal Engine 4 with a server–client architecture. In addition, it is open source and can
be installed on Linux and Windows systems. It allows the creation of environments, agents,
and required sensors via Python API, which meets the requirements of this study. In terms
of experimental configuration, this study uses a Tesla V100 as the GPU and runs in a Linux
environment. The system code is mainly based on the Python language.
Appl. Sci. 2025, 15, x FOR PEER REVIEW 11 of 21
4.1. Implementation in Carla Simulator
For an iterative strategy algorithm, the research needs to easily measure the perfor-
mance of the algorithm. Training cars in real environments can be damaging and costly.
It is necessary to ensure the algorithm can operate safely and correctly in common scenar-
ios before deploying it in real-world environments in the future. Additionally, in practical
applications, it is not feasible to frequently update algorithms in real environments, so this
process needs to be first deployed and tested in simulators. The debugging and perfor-
mance evaluation of autonomous driving algorithms should be conducted in highly real-
istic simulation environments before real environments. It is important to have an auton-
omous driving simulator that can reliably reflect real scenarios. In this study, Carla is cho-
sen as the simulation environment for autonomous driving development and testing [36].
Carla has a 3D city scene that supports perception, planning, and control. Figure 3
shows the top view of the city map in the Carla simulator used in this experiment. It runs
on Unreal Engine 4 with a server–client architecture. In addition, it is open source and can
be installed on Linux and Windows systems. It allows the creation of environments,
agents, and required sensors via Python API, which meets the requirements of this study.
In terms of experimental configuration, this study uses a Tesla V100 as the GPU and runs
in a Linux environment. The system code is mainly based on the Python language.
Figure 3. Top view of the experiment map.
Carla can simulate common scenarios in the environment such as lane conditions,
road conditions, obstacle distribution, and weather. Three-dimensional LiDAR can also
be simulated in Carla, as shown in Figure 4. To maintain consistency with sensor config-
urations employed in subsequent real-world applications, in this study, the 3D-LiDAR
parameters were set in the simulator as shown in Table 1.
Figure 3. Top view of the experiment map.
Carla can simulate common scenarios in the environment such as lane conditions,
road conditions, obstacle distribution, and weather. Three-dimensional LiDAR can also
be simulated in Carla, as shown in Figure 4. To maintain consistency with sensor config-
urations employed in subsequent real-world applications, in this study, the 3D-LiDAR
parameters were set in the simulator as shown in Table 1.
The 3D-LiDAR is configured with 16 channels, which is consistent with the model
in the real environment of this research. The set rotation frequency is 10 revolutions per
second, which can be adjusted to a higher frequency in the future when capturing more
rapidly changing environments, such as high-speed scenes. The number of points per
second can be calculated by multiplying the number of channels, points per revolution, and
rotation frequency. It determines the resolution of the point cloud image and the density of
the point cloud. To reduce the computational burden, the detection distance is reduced to
50 m, even though the default maximum detection distance is 100 m. The upper fov and
lower fov together define the vertical field of view range. The difference between these
must match the number of channels to ensure accurate simulation. The horizontal fov and
vertical fov range define the scanning range. The point cloud data are generated every
Appl. Sci. 2025,15, 1423 12 of 21
0.1 s. The simulated environment works in parallel with the RL environment, and Carla
independently updates and transmits data into the RL network at this fixed interval. As a
result, the agent obtains the latest environment data in real time and outputs actions, while
the simulator executes the latest actions output by the agent.
Appl. Sci. 2025, 15, x FOR PEER REVIEW 12 of 21
Figure 4. Three-dimensional LiDAR view in Carla.
Table 1. The parameter seings of 3D-LiDAR.
Channels
16
rotation_frequency (r/s)
10.0
Range (m)
50.0
upper_fov (°)
15.0
lower_fov (°)
–15.0
horizontal_fov (°)
360.0
sensor_tick (s)
0.1
The 3D-LiDAR is configured with 16 channels, which is consistent with the model in
the real environment of this research. The set rotation frequency is 10 revolutions per sec-
ond, which can be adjusted to a higher frequency in the future when capturing more rap-
idly changing environments, such as high-speed scenes. The number of points per second
can be calculated by multiplying the number of channels, points per revolution, and rota-
tion frequency. It determines the resolution of the point cloud image and the density of
the point cloud. To reduce the computational burden, the detection distance is reduced to
50 metres, even though the default maximum detection distance is 100 metres. The upper
fov and lower fov together define the vertical field of view range. The difference between
these must match the number of channels to ensure accurate simulation. The horizontal
fov and vertical fov range define the scanning range. The point cloud data are generated
every 0.1 s. The simulated environment works in parallel with the RL environment, and
Carla independently updates and transmits data into the RL network at this fixed interval.
As a result, the agent obtains the latest environment data in real time and outputs actions,
while the simulator executes the latest actions output by the agent.
The experiment establishes a fundamental task, where the vehicle drives on urban
roads while avoiding collisions. The objective is to maximize cumulative rewards by
maintaining street driving for as long as possible. The experimental implementation fol-
lows these steps:
1. Network Initialization: The first step is to define the structure of the Actor and Critic
networks, including the input, hidden, and output layers. The learning rate of the
actor is 0.001. The learning rate of the critic is 0.0001. The discount factor γ is 0.99.
The range of weights is [–0.003, 0.003]. The weights and biases of the network are
initialized in this process. The weight matrix is initialized with random values to con-
trol the effect of the input data on each neuron, with a core role as a linear transfor-
mation. Bias vectors are initialized to zero, adjusting neuron outputs to enhance
model expressiveness. During subsequent training, the optimization algorithm min-
imizes the loss function by calculating the gradient and constantly adjusting both.
Figure 4. Three-dimensional LiDAR view in Carla.
Table 1. The parameter settings of 3D-LiDAR.
Channels 16
rotation_frequency (r/s) 10.0
Range (m) 50.0
upper_fov (◦) 15.0
lower_fov (◦)−15.0
horizontal_fov (◦) 360.0
sensor_tick (s) 0.1
The experiment establishes a fundamental task, where the vehicle drives on urban
roads while avoiding collisions. The objective is to maximize cumulative rewards by
maintaining street driving for as long as possible. The experimental implementation
follows these steps:
1.
Network Initialization: The first step is to define the structure of the Actor and
Critic networks, including the input, hidden, and output layers. The learning rate
of the actor is 0.001. The learning rate of the critic is 0.0001. The discount factor
γ
is 0.99. The range of weights is [
−
0.003, 0.003]. The weights and biases of the
network are initialized in this process. The weight matrix is initialized with random
values to control the effect of the input data on each neuron, with a core role as a
linear transformation. Bias vectors are initialized to zero, adjusting neuron outputs
to enhance model expressiveness. During subsequent training, the optimization
algorithm minimizes the loss function by calculating the gradient and constantly
adjusting both. Target networks for training stability are initialized with weights
identical to the main network. The replay buffer is initialized empty with a 10,000-
capacity storage. Each experience corresponds to a state transition tuple, containing
the current state, action, reward, next state, and terminal state Boolean.
2.
Agent Generation: To prevent policy stagnation and enhance generalization, vehicles
are spawned at random map locations for each episode, simulating real-world ran-
Appl. Sci. 2025,15, 1423 13 of 21
domness and improving exploration efficiency. To prevent the model’s strategy from
falling into a fixed pattern and to enhance the generalization ability of the agent, the
vehicle is spawned at random map locations for each episode. It also simulates the
randomness of the real world and enhances the agent’s exploration efficiency.
3.
Environmental Information by sensor: The 3D-LiDAR inputs point cloud data con-
taining three-dimensional coordinates.
4.
Point Cloud Processing: Global features are extracted using the study’s point cloud
processing algorithm.
5.
Actor-network Input: The 1024-dimensional features pass through two fully connected
layers with 64 neurons each, activated by ReLU, outputting action vectors a
1
and a
2
.
a
1
represents acceleration, ranging between
−
1 and +1, while a
2
represents steering
angle, ranging between
−
15 degrees and +15 degrees. Due to the nature of the DDPG,
the output of actions allows for smooth steering angles and throttle control instead
of the simple left or right direction in the DQN method. The state selects the action
output after adding Gaussian noise according to the current strategy. Following the
task objective, the reward function is designed as follows: the agent obtains a reward
of 0.01 if it chooses to go forward; when it collides with an obstacle, the reward is
−
1
and the episode is terminated. Since pulling out of the lane also results in a collision
with the obstacle, there is no need to set it separately. After executing an action, it
receives signals from the environment about the next state, the reward, and whether
to terminate or not. Next, the transfer data are stored in the replay buffer.
6.
Sample Data from the Replay Buffer: When the amount of data in the buffer exceeds
the minimum batch size, a small random batch is sampled from the buffer.
7.
Critic-network Implementation: This inputs state and action to output the scalar
expected returns under the current policy. States are represented by point cloud global
feature vectors. Actions come from the actor-network output. Both pass through fully
connected layers for feature extraction, and then the state features and action features
are fused into a vector that outputs the Q-value through the last layer of the fully
connected network. The critic network uses the mean square error loss to update
the weights. The output Q-value is used to guide the actor network to improve the
driving strategy of the vehicle.
8.
Actor-network Implementation: Following the critic network’s performance evalua-
tion, this step calculates actor network gradients and updates the parameters.
9.
Target Network Updates: Soft updates maintain stability after the main network
updates with τ=0.001.
The training process repeats the above steps, and the results after 100 episodes are as
follows. From the visual observation, the agent will quickly collide with the obstacles and
start the next episodes in the early stage. After more training, the agent can gradually keep
driving normally for a long time without collision damage. Figure 5shows the eval average
reward graph generated after training. The light orange line is the raw evaluation average
reward, which fluctuates significantly due to the stochastic nature of the environment and
exploration. The dark red line is a smoothed version of the reward, computed using a
moving average, to highlight the overall trend of improvement. It reflects the average
cumulative reward of the agent. In an episode, the cumulative reward is the total reward
obtained by the agent from the initial state to the termination state. Eval average reward
is the average of the cumulative rewards of multiple episodes. As shown in the figure,
the rewards gradually increase as the training progresses, which indicates a successful
implementation of reinforcement learning.
Appl. Sci. 2025,15, 1423 14 of 21
Appl. Sci. 2025, 15, x FOR PEER REVIEW 14 of 21
figure, the rewards gradually increase as the training progresses, which indicates a suc-
cessful implementation of reinforcement learning.
Figure 5. Eval avg reward diagram of the experiment.
Figure 6 shows the loss curves of the experimental vehicle after DDPG training. The
green curve represents the critic loss, and the red curve represents the actor loss. These
two main loss curves are commonly monitored during DDPG training. The loss curves of
the critic network and actor network reflect the optimization states of different networks.
The critic network is used to evaluate the value of an action under the current strategy.
The critic loss uses the mean square error (MSE) to calculate the difference between the
current estimated Q-value and the target Q-value. In the early stages of training, the critic
loss tends to be higher because the agent has limited understanding of the environment.
As training progresses, the critic loss should gradually decrease and stabilize. The actor
network loss is calculated differently, using feedback from the critic network to optimize
the strategy so that it selects actions that maximize the critic’s estimated Q-value. The actor
loss should be analyzed in conjunction with critic loss. If the actor loss does not converge,
the target value of the critic may always be biased. In this experiment, both the critic loss
and the actor loss curves demonstrate clear convergence. This proves that the training is
effective, and the algorithm can successfully implement the target task in the simulator.
Figure 6. Loss diagram of the experiment.
In addition, Figure 7 shows the noise curve generated during DDPG training in this
experiment. The light orange line represents the raw value, and the dark red line is a
smoothed version of the curve, showing the overall trend of decreasing noise. The down-
ward trend of the curve indicates that the noise gradually decays throughout the training
Figure 5. Eval avg reward diagram of the experiment.
Figure 6shows the loss curves of the experimental vehicle after DDPG training. The
green curve represents the critic loss, and the red curve represents the actor loss. These
two main loss curves are commonly monitored during DDPG training. The loss curves of
the critic network and actor network reflect the optimization states of different networks.
The critic network is used to evaluate the value of an action under the current strategy.
The critic loss uses the mean square error (MSE) to calculate the difference between the
current estimated Q-value and the target Q-value. In the early stages of training, the critic
loss tends to be higher because the agent has limited understanding of the environment.
As training progresses, the critic loss should gradually decrease and stabilize. The actor
network loss is calculated differently, using feedback from the critic network to optimize
the strategy so that it selects actions that maximize the critic’s estimated Q-value. The actor
loss should be analyzed in conjunction with critic loss. If the actor loss does not converge,
the target value of the critic may always be biased. In this experiment, both the critic loss
and the actor loss curves demonstrate clear convergence. This proves that the training is
effective, and the algorithm can successfully implement the target task in the simulator.
Appl. Sci. 2025, 15, x FOR PEER REVIEW 14 of 21
figure, the rewards gradually increase as the training progresses, which indicates a suc-
cessful implementation of reinforcement learning.
Figure 5. Eval avg reward diagram of the experiment.
Figure 6 shows the loss curves of the experimental vehicle after DDPG training. The
green curve represents the critic loss, and the red curve represents the actor loss. These
two main loss curves are commonly monitored during DDPG training. The loss curves of
the critic network and actor network reflect the optimization states of different networks.
The critic network is used to evaluate the value of an action under the current strategy.
The critic loss uses the mean square error (MSE) to calculate the difference between the
current estimated Q-value and the target Q-value. In the early stages of training, the critic
loss tends to be higher because the agent has limited understanding of the environment.
As training progresses, the critic loss should gradually decrease and stabilize. The actor
network loss is calculated differently, using feedback from the critic network to optimize
the strategy so that it selects actions that maximize the critic’s estimated Q-value. The actor
loss should be analyzed in conjunction with critic loss. If the actor loss does not converge,
the target value of the critic may always be biased. In this experiment, both the critic loss
and the actor loss curves demonstrate clear convergence. This proves that the training is
effective, and the algorithm can successfully implement the target task in the simulator.
Figure 6. Loss diagram of the experiment.
In addition, Figure 7 shows the noise curve generated during DDPG training in this
experiment. The light orange line represents the raw value, and the dark red line is a
smoothed version of the curve, showing the overall trend of decreasing noise. The down-
ward trend of the curve indicates that the noise gradually decays throughout the training
Figure 6. Loss diagram of the experiment.
In addition, Figure 7shows the noise curve generated during DDPG training in
this experiment. The light orange line represents the raw value, and the dark red line
is a smoothed version of the curve, showing the overall trend of decreasing noise. The
downward trend of the curve indicates that the noise gradually decays throughout the
training process, reflecting the agent’s transition from exploration-focused behavior to
exploitation of the learned optimal policy.
Appl. Sci. 2025,15, 1423 15 of 21
Appl. Sci. 2025, 15, x FOR PEER REVIEW 15 of 21
process, reflecting the agent’s transition from exploration-focused behavior to exploitation
of the learned optimal policy.
Figure 7. Noise diagram of the experiment.
4.2. Implement in Real Environment
After the successful implementation in the simulated environment, this study was
extended to implement the system in a real-world seing. The experimental vehicle used
in this study is the F1tenth car, which is a widely recognized platform for research in au-
tonomous driving due to its affordability, flexibility, and scalability [37]. The F1tenth car
used in this research consists of two main layers that are essential for achieving autono-
mous functionality.
The lower level serves as the foundation of the vehicle, providing the structural sup-
port for various components. This layer is equipped with the following.
− Chassis: The chassis used in this experiment is from Traxxas, a simplified version of
the chassis structure that is as close as possible to a real car. The size of the experi-
mental car is one-tenth of the vehicle on the road. It uses the Ackermann steering
mechanism, which is commonly employed in automotive design to minimize tire slip
and ensure that each wheel follows its optimum path during turns. This setup is cru-
cial for replicating the real dynamics of vehicle steering and control.
− Baery: A lithium-polymer baery with a capacity of 5000 MAH was used in this
experiment.
− Motor: A high-performance brushless motor that provides precise control over the
car’s speed and maneuverability.
The upper level is mounted on a laser-cut acrylic panel. It contains all the components
responsible for autonomous decision-making and control. This layer includes the follow-
ing.
− NVIDIA Jetson TX2: The primary computing unit that runs all the deep learning
models, including the DDPG algorithm and the point cloud processing module. The
Jetson TX2 was selected for its computational power while maintaining a small size,
which is critical for real-time processing.
− Communication: The vehicle communicates with the station through an antenna for
remote input of commands, monitoring, data logging, and algorithm tuning.
− VESC (Vedder Electronic Speed Controller): A component responsible for controlling
the speed and steering of the vehicle. The VESC is directly interfaced with the
NVIDIA Jetson TX2, enabling precise control based on the actions determined by the
DDPG model.
− Powerboard: Distributes power to all components, ensuring stable operation and
proper voltage levels.
To enable the vehicle to perceive its surroundings, this study used a Velodyne VLP-
16 3D-LiDAR sensor to meet the research requirement, which is well suited for capturing
Figure 7. Noise diagram of the experiment.
4.2. Implement in Real Environment
After the successful implementation in the simulated environment, this study was
extended to implement the system in a real-world setting. The experimental vehicle used
in this study is the F1tenth car, which is a widely recognized platform for research in
autonomous driving due to its affordability, flexibility, and scalability [
37
]. The F1tenth car
used in this research consists of two main layers that are essential for achieving autonomous
functionality.
The lower level serves as the foundation of the vehicle, providing the structural
support for various components. This layer is equipped with the following.
¯
Chassis: The chassis used in this experiment is from Traxxas, a simplified version of the
chassis structure that is as close as possible to a real car. The size of the experimental
car is one-tenth of the vehicle on the road. It uses the Ackermann steering mechanism,
which is commonly employed in automotive design to minimize tire slip and ensure
that each wheel follows its optimum path during turns. This setup is crucial for
replicating the real dynamics of vehicle steering and control.
¯
Battery: A lithium-polymer battery with a capacity of 5000 MAH was used in this
experiment.
¯
Motor: A high-performance brushless motor that provides precise control over the
car’s speed and maneuverability.
The upper level is mounted on a laser-cut acrylic panel. It contains all the components
responsible for autonomous decision-making and control. This layer includes the following.
¯
NVIDIA Jetson TX2: The primary computing unit that runs all the deep learning
models, including the DDPG algorithm and the point cloud processing module. The
Jetson TX2 was selected for its computational power while maintaining a small size,
which is critical for real-time processing.
¯
Communication: The vehicle communicates with the station through an antenna for
remote input of commands, monitoring, data logging, and algorithm tuning.
¯
VESC (Vedder Electronic Speed Controller): A component responsible for controlling
the speed and steering of the vehicle. The VESC is directly interfaced with the
NVIDIA Jetson TX2, enabling precise control based on the actions determined by the
DDPG model.
¯
Powerboard: Distributes power to all components, ensuring stable operation and
proper voltage levels.
To enable the vehicle to perceive its surroundings, this study used a Velodyne VLP-16
3D-LiDAR sensor to meet the research requirement, which is well suited for capturing
detailed spatial information. The VLP-16 provides a 360-degree horizontal field of view
and a vertical field of view of approximately 30 degrees. It has 16 laser channels that allow
for high-density point cloud generation, producing up to 300,000 points per second, with a
Appl. Sci. 2025,15, 1423 16 of 21
range of up to 100 m. Its lightweight and compact design also makes it suitable for integra-
tion into smaller autonomous platforms like the F1tenth car. To integrate the LiDAR with
the F1tenth car, a custom load-bearing platform was designed using 3D printing technology.
This custom platform was optimized to reduce the overall weight while ensuring that the
LiDAR remained stable during motion, thus enhancing the reliability of the sensor data.
The use of 3D printing also allowed for rapid prototyping and customization, which were
particularly useful for adapting the sensor setup to meet specific research needs. The car’s
system operates using ROS Melodic, a popular version of the Robot Operating System,
which provides a flexible framework for communication between different sensors and
computational units. ROS Melodic was chosen due to its stability and compatibility with
the Jetson TX2 and other components of the F1tenth platform. Figure 8shows the agent
used in this research.
Appl. Sci. 2025, 15, x FOR PEER REVIEW 16 of 21
detailed spatial information. The VLP-16 provides a 360-degree horizontal field of view
and a vertical field of view of approximately 30 degrees. It has 16 laser channels that allow
for high-density point cloud generation, producing up to 300,000 points per second, with
a range of up to 100 m. Its lightweight and compact design also makes it suitable for inte-
gration into smaller autonomous platforms like the F1tenth car. To integrate the LiDAR
with the F1tenth car, a custom load-bearing platform was designed using 3D printing
technology. This custom platform was optimized to reduce the overall weight while en-
suring that the LiDAR remained stable during motion, thus enhancing the reliability of
the sensor data. The use of 3D printing also allowed for rapid prototyping and customi-
zation, which were particularly useful for adapting the sensor setup to meet specific re-
search needs. The car’s system operates using ROS Melodic, a popular version of the Ro-
bot Operating System, which provides a flexible framework for communication between
different sensors and computational units. ROS Melodic was chosen due to its stability
and compatibility with the Jetson TX2 and other components of the F1tenth platform. Fig-
ure 8 shows the agent used in this research.
Figure 8. The experimental agent.
The system architecture of the real-world experiment is illustrated in Figure 9. Ini-
tially, the 3D-LiDAR acquires environmental information and inputs the data in point
cloud format into the NVIDIA Jetson TX2 processor. The processor performs the point
cloud processing stage and subsequently transfers the processed information to the deep
reinforcement learning algorithm. The computed action commands are then transmied
via VESC to control the motor and execute agent movements. Throughout this process,
communication between the agent and station is maintained through a WiFi antenna. The
station transmits commands (such as initiation or termination of experiments) to the
agent, while the agent returns experimental data and results back to the station.
Figure 8. The experimental agent.
The system architecture of the real-world experiment is illustrated in Figure 9. Ini-
tially, the 3D-LiDAR acquires environmental information and inputs the data in point
cloud format into the NVIDIA Jetson TX2 processor. The processor performs the point
cloud processing stage and subsequently transfers the processed information to the deep
reinforcement learning algorithm. The computed action commands are then transmitted
via VESC to control the motor and execute agent movements. Throughout this process,
communication between the agent and station is maintained through a WiFi antenna. The
station transmits commands (such as initiation or termination of experiments) to the agent,
while the agent returns experimental data and results back to the station.
Appl. Sci. 2025, 15, x FOR PEER REVIEW 17 of 21
Figure 9. The system architecture of the real-world experiment.
The real-world experiments were conducted in a controlled laboratory environment.
The algorithms and models used in the experiments were kept the same as in the simula-
tor. Since this study focuses on testing the effectiveness of the algorithms applied in real-
world environments and there is a limitation of the baery capacity in the hardware con-
ditions, the task objective of the experiment was simplified to basic obstacle avoidance
and driving behavior selection in a defined environment. The effective implementation of
the basic task is the basis for future in-depth research on difficult tasks in complex envi-
ronments. The map used for the experiment is shown in Figure 10. The closest obstacle is
in front of the right side of the experimental vehicle, and the open area is on the left side.
The experimental vehicle freely explores the map and selects action outputs to form its
own driving strategy.
Figure 10. Map of the experiment.
The experimental process consisted of different stages, where the system’s ability to
self-learn and adapt was observed through multiple trials. In the initial stage, the agent
failed to avoid the first obstacle and collided with a chair (Figure 11a). This highlights the
need for additional training to refine its decision-making capabilities. In the subsequent
trials, after more than 200 steps, the vehicle was able to bypass the first obstacle but en-
countered difficulties with the second chair, colliding with it (Figure 11b). After additional
training iterations, around 600 steps, the vehicle successfully avoided both obstacles. In-
stead of continuing toward the obstacle on the right, the vehicle adjusted its trajectory to
Figure 9. The system architecture of the real-world experiment.
Appl. Sci. 2025,15, 1423 17 of 21
The real-world experiments were conducted in a controlled laboratory environment.
The algorithms and models used in the experiments were kept the same as in the simulator.
Since this study focuses on testing the effectiveness of the algorithms applied in real-world
environments and there is a limitation of the battery capacity in the hardware conditions,
the task objective of the experiment was simplified to basic obstacle avoidance and driving
behavior selection in a defined environment. The effective implementation of the basic
task is the basis for future in-depth research on difficult tasks in complex environments.
The map used for the experiment is shown in Figure 10. The closest obstacle is in front
of the right side of the experimental vehicle, and the open area is on the left side. The
experimental vehicle freely explores the map and selects action outputs to form its own
driving strategy.
Appl. Sci. 2025, 15, x FOR PEER REVIEW 17 of 21
Figure 9. The system architecture of the real-world experiment.
The real-world experiments were conducted in a controlled laboratory environment.
The algorithms and models used in the experiments were kept the same as in the simula-
tor. Since this study focuses on testing the effectiveness of the algorithms applied in real-
world environments and there is a limitation of the baery capacity in the hardware con-
ditions, the task objective of the experiment was simplified to basic obstacle avoidance
and driving behavior selection in a defined environment. The effective implementation of
the basic task is the basis for future in-depth research on difficult tasks in complex envi-
ronments. The map used for the experiment is shown in Figure 10. The closest obstacle is
in front of the right side of the experimental vehicle, and the open area is on the left side.
The experimental vehicle freely explores the map and selects action outputs to form its
own driving strategy.
Figure 10. Map of the experiment.
The experimental process consisted of different stages, where the system’s ability to
self-learn and adapt was observed through multiple trials. In the initial stage, the agent
failed to avoid the first obstacle and collided with a chair (Figure 11a). This highlights the
need for additional training to refine its decision-making capabilities. In the subsequent
trials, after more than 200 steps, the vehicle was able to bypass the first obstacle but en-
countered difficulties with the second chair, colliding with it (Figure 11b). After additional
training iterations, around 600 steps, the vehicle successfully avoided both obstacles. In-
stead of continuing toward the obstacle on the right, the vehicle adjusted its trajectory to
Figure 10. Map of the experiment.
The experimental process consisted of different stages, where the system’s ability to
self-learn and adapt was observed through multiple trials. In the initial stage, the agent
failed to avoid the first obstacle and collided with a chair (Figure 11a). This highlights
the need for additional training to refine its decision-making capabilities. In the subse-
quent trials, after more than 200 steps, the vehicle was able to bypass the first obstacle
but encountered difficulties with the second chair, colliding with it (Figure 11b). After
additional training iterations, around 600 steps, the vehicle successfully avoided both
obstacles. Instead of continuing toward the obstacle on the right, the vehicle adjusted its
trajectory to move toward the open space on the left (Figure 11c). This showed the system’s
ability to learn from previous experiences and improve its strategy through reinforcement
learning. Figure 12 shows the reward curve of the experimental vehicle following DDPG
training. It illustrates the reward dynamics of the agent during the training process. The
reward obtained by the agent is increases as the training progresses. During the initial
training phase, the reward values remained relatively low. When the steps reached about
300, the reward value increased significantly, but it was not stable enough and fluctuated a
lot. Beyond 600 steps, the reward values stabilized and consistently maintained a high level.
This trend demonstrates the effectiveness of the DDPG training approach for the agent.
This research has successfully transitioned from implementing autonomous driving
tasks through reinforcement learning in virtual environments to effectively executing al-
gorithms and completing tasks in real-world environments. This provides a foundation
and starting point for applying reinforcement learning to achieve complex autonomous
driving in real-world scenarios. This experiment is intended to demonstrate the practical
Appl. Sci. 2025,15, 1423 18 of 21
feasibility of the algorithm, so basic experimental tasks were set and implemented in a
simple environment. Future applications and implementations could enhance the capacity
of the battery to perform more complex tasks and set up more sophisticated reward mecha-
nisms based on diverse road conditions and environmental scenarios. By establishing more
complex reward mechanisms and parameter adjustments, the agent can potentially achieve
superior performance and more intelligent behavior in real-world driving scenarios. Such
refinements would make it closer to real-world road conditions and even fully applicable
to real traffic scenarios.
Appl. Sci. 2025, 15, x FOR PEER REVIEW 18 of 21
move toward the open space on the left (Figure 11c). This showed the system’s ability to
learn from previous experiences and improve its strategy through reinforcement learning.
Figure 12 shows the reward curve of the experimental vehicle following DDPG training.
It illustrates the reward dynamics of the agent during the training process. The reward
obtained by the agent is increases as the training progresses. During the initial training
phase, the reward values remained relatively low. When the steps reached about 300, the
reward value increased significantly, but it was not stable enough and fluctuated a lot.
Beyond 600 steps, the reward values stabilized and consistently maintained a high level.
This trend demonstrates the effectiveness of the DDPG training approach for the agent.
Figure 11. Results of the experiment. (a) Initial stage (b) More than 200 steps trials (c) Around 600
steps trials
Figure 12. Reward diagram of the experiment.
This research has successfully transitioned from implementing autonomous driving
tasks through reinforcement learning in virtual environments to effectively executing al-
gorithms and completing tasks in real-world environments. This provides a foundation
and starting point for applying reinforcement learning to achieve complex autonomous
driving in real-world scenarios. This experiment is intended to demonstrate the practical
feasibility of the algorithm, so basic experimental tasks were set and implemented in a
simple environment. Future applications and implementations could enhance the capac-
ity of the baery to perform more complex tasks and set up more sophisticated reward
mechanisms based on diverse road conditions and environmental scenarios. By establish-
ing more complex reward mechanisms and parameter adjustments, the agent can poten-
tially achieve superior performance and more intelligent behavior in real-world driving
scenarios. Such refinements would make it closer to real-world road conditions and even
fully applicable to real traffic scenarios.
Figure 11. Results of the experiment. (a) Initial stage (b) More than 200 steps trials (c) Around 600
steps trials.
Appl. Sci. 2025, 15, x FOR PEER REVIEW 18 of 21
move toward the open space on the left (Figure 11c). This showed the system’s ability to
learn from previous experiences and improve its strategy through reinforcement learning.
Figure 12 shows the reward curve of the experimental vehicle following DDPG training.
It illustrates the reward dynamics of the agent during the training process. The reward
obtained by the agent is increases as the training progresses. During the initial training
phase, the reward values remained relatively low. When the steps reached about 300, the
reward value increased significantly, but it was not stable enough and fluctuated a lot.
Beyond 600 steps, the reward values stabilized and consistently maintained a high level.
This trend demonstrates the effectiveness of the DDPG training approach for the agent.
Figure 11. Results of the experiment. (a) Initial stage (b) More than 200 steps trials (c) Around 600
steps trials
Figure 12. Reward diagram of the experiment.
This research has successfully transitioned from implementing autonomous driving
tasks through reinforcement learning in virtual environments to effectively executing al-
gorithms and completing tasks in real-world environments. This provides a foundation
and starting point for applying reinforcement learning to achieve complex autonomous
driving in real-world scenarios. This experiment is intended to demonstrate the practical
feasibility of the algorithm, so basic experimental tasks were set and implemented in a
simple environment. Future applications and implementations could enhance the capac-
ity of the baery to perform more complex tasks and set up more sophisticated reward
mechanisms based on diverse road conditions and environmental scenarios. By establish-
ing more complex reward mechanisms and parameter adjustments, the agent can poten-
tially achieve superior performance and more intelligent behavior in real-world driving
scenarios. Such refinements would make it closer to real-world road conditions and even
fully applicable to real traffic scenarios.
Figure 12. Reward diagram of the experiment.
5. Conclusions and Future Work
This study demonstrates a deep reinforcement learning solution for autonomous
driving tasks using 3D-LiDAR sensors, achieving both simulation and real-world im-
plementations. By integrating the DDPG algorithm with a PointNet-based point cloud
processing approach enhanced by self-attention mechanisms, the research provides a scal-
able framework for unsupervised autonomous driving systems. The study addresses
several key challenges, including the limitations of discrete action spaces in traditional
reinforcement learning algorithms and the complexities of processing irregular and dense
3D point cloud data in dynamic environments. Moreover, by eliminating the need for
labeled datasets, the approach reduces the overall cost and labor compared to supervised
learning methods for autonomous driving. The system demonstrated consistent train-
ing performance through simulation experiments conducted in the CARLA environment.
Real-world tests using F1tenth cars further enable the completion of basic autonomous
driving tasks in a real-world environment. The effectiveness of reinforcement learning in
Appl. Sci. 2025,15, 1423 19 of 21
real-world applications was demonstrated, bridging the gap between virtual simulation
and actual implementation.
There remains considerable room for improvement in this research. Future enhance-
ments could explore more complex autonomous driving tasks and scenarios, particularly
in adapting to highly dynamic or previously unseen environments. The reliance on pre-
defined reward functions may also limit the generalization to more diverse driving tasks.
Applications in complex road environments require well-developed reward mechanisms as
well as longer exploratory training processes. They also place higher demands on hardware
such as battery capacity. In terms of sensors, the point cloud processing framework pro-
posed in this study allows for a higher density of information input in order to cope with
complex road environments. Future research can be flexible to further improve the feature
extraction capability by adding different feature processing modules. More comprehensive
environmental information from multi-sensor fusion can also be considered.
In conclusion, this study lays a foundation for advancing unsupervised learning
techniques in autonomous driving, promoting safer, more efficient, and cost-effective
solutions for future smart transportation systems.
Author Contributions: Conceptualization, Y.C., C.T.L. and G.P.; methodology, Y.C. and G.P.; formal
analysis, W.K.; investigation, C.T.L.; writing—original draft preparation, Y.C.; writing—review and
editing, W.K. and G.P.; supervision, C.T.L. and G.P.; project administration, W.K.; funding acquisition,
W.K. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by Macao Polytechnic University (project code: RP/FCA-
14/2022).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Igli´nski, H.; Babiak, M. Analysis of the potential of autonomous vehicles in reducing the emissions of greenhouse gases in road
transport. Procedia Eng. 2017,192, 353–358. [CrossRef]
2.
Uzzaman, A.; Muhammad, W. A Comprehensive Review of Environmental and Economic Impacts of Autonomous Vehicles.
Control Syst. Optim. Lett. 2024,2, 303–309. [CrossRef]
3.
Yurtsever, E.; Lambert, J.; Carballo, A.; Takeda, K. A Survey of Autonomous Driving: Common Practices and Emerging
Technologies. IEEE Access 2020,8, 58443–58469. [CrossRef]
4.
Chen, R.C.; Saravanarajan, V.S.; Chen, L.S.; Yu, H. Road Segmentation and Environment Labeling for Autonomous Vehicles. Appl.
Sci. 2022,12, 7191. [CrossRef]
5. Li, Y. Deep Reinforcement Learning: An Overview. arXiv 2017, arXiv:1701.07274.
6.
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Hassabis, D. Human-Level Control through Deep
Reinforcement Learning. Nature 2015,518, 529–533. [CrossRef]
7.
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous Control with Deep
Reinforcement Learning. arXiv 2015, arXiv:1509.02971.
8.
Zhou, J. A Review of LiDAR Sensor Technologies for Perception in Automated Driving. Acad. J. Sci. Technol. 2022,3, 255–261.
[CrossRef]
9.
Liu, Y.; Diao, S. An Automatic Driving Trajectory Planning Approach in Complex Traffic Scenarios Based on Integrated Driver
Style Inference and Deep Reinforcement Learning. PLoS ONE 2024,19, e0297192. [CrossRef]
10.
Yuan, M.; Shan, J.; Mi, K. From Naturalistic Traffic Data to Learning-Based Driving Policy: A Sim-to-Real Study. IEEE Trans. Veh.
Technol. 2024,73, 245–257. [CrossRef]
Appl. Sci. 2025,15, 1423 20 of 21
11.
Wang, C.; Cui, X.; Zhao, S.; Zhou, X.; Song, Y.; Wang, Y.; Guo, K. A Deep Reinforcement Learning-Based Active Suspension
Control Algorithm Considering Deterministic Experience Tracing for Autonomous Vehicles. Appl. Soft Comput. 2024,153, 111259.
[CrossRef]
12.
Bosello, M.; Tse, R.; Pau, G. Train in Austria, Race in Montecarlo: Generalized RL for Cross-Track F1 Tenth LiDAR-Based Races.
In Proceedings of the 2022 IEEE 19th Annual Consumer Communications and Networking Conference (CCNC), Las Vegas, NV,
USA, 8–11 January 2022; pp. 290–298.
13.
Li, Z.; Yuan, S.; Yin, X.; Li, X.; Tang, S. Research into Autonomous Vehicles Following and Obstacle Avoidance Based on Deep
Reinforcement Learning Method under Map Constraints. Sensors 2023,23, 844. [CrossRef] [PubMed]
14.
Li, W.; Zhang, Y.; Shi, X.; Qiu, F. A Decision-Making Strategy for Car Following Based on Naturalist Driving Data via Deep
Reinforcement Learning. Sensors 2022,22, 8055. [CrossRef] [PubMed]
15.
Yang, K.; Tang, X.; Qiu, S.; Jin, S.; Wei, Z.; Wang, H. Towards Robust Decision-Making for Autonomous Driving on Highway.
IEEE Trans. Veh. Technol. 2023,72, 11251–11263. [CrossRef]
16.
Bellotti, F.; Lazzaroni, L.; Capello, A.; Cossu, M.; De Gloria, A.; Berta, R. Explaining a Deep Reinforcement Learning (DRL)-Based
Automated Driving Agent in Highway Simulations. IEEE Access 2023,11, 28522–28550. [CrossRef]
17.
Al-Sharman, M.; Dempster, R.; Daoud, M.A.; Nasr, M.; Rayside, D.; Melek, W. Self-Learned Autonomous Driving at Unsignalized
Intersections: A Hierarchical Reinforced Learning Approach for Feasible Decision-Making. IEEE Trans. Intell. Transp. Syst. 2023,
24, 12345–12356. [CrossRef]
18.
Wang, J.; Sun, H.; Zhu, C. Vision-Based Autonomous Driving: A Hierarchical Reinforcement Learning Approach. IEEE Trans. Veh.
Technol. 2023,72, 11213–11226. [CrossRef]
19.
Mao, Z.; Liu, Y.; Qu, X. Integrating Big Data Analytics in Autonomous Driving: An Unsupervised Hierarchical Reinforcement
Learning Approach. Transp. Res. Part C Emerg. Technol. 2024,162, 104606. [CrossRef]
20. Pérez-Gil, Ó.; Barea, R.; López-Guillén, E.; Bergasa, L.M.; Gómez-Huélamo, C.; Gutiérrez, R.; Díaz-Díaz, A. Deep reinforcement
learning based control for Autonomous Vehicles in CARLA. Multimed. Tools Appl. 2022,81, 3553–3576. [CrossRef]
21.
Petryshyn, B.; Postupaiev, S.; Ben Bari, S.; Ostreika, A. Deep Reinforcement Learning for Autonomous Driving in Amazon Web
Services DeepRacer. Information 2024,15, 113. [CrossRef]
22.
Wang, W.; Luo, H.; Zheng, Q.; Wang, C.; Guo, W. A Deep Reinforcement Learning Framework for Vehicle Detection and Pose
Estimation in 3D Point Clouds. In Proceedings of the 6th International Conference on Artificial Intelligence and Security (ICAIS
2020), Hohhot, China, 17–20 July 2020; pp. 405–416.
23.
Lee, Y.; Park, S. A Deep Learning-Based Perception Algorithm Using 3D LiDAR for Autonomous Driving: Simultaneous
Segmentation and Detection Network (SSADNet). Appl. Sci. 2020,10, 4486. [CrossRef]
24.
Chen, Y.; Tse, R.; Bosello, M.; Aguiari, D.; Tang, S.K.; Pau, G. Enabling Deep Reinforcement Learning Autonomous Driving by
3D-LiDAR Point Clouds. In Proceedings of the Fourteenth International Conference on Digital Image Processing (ICDIP 2022),
Wuhan, China, 20–23 May 2022; Volume 12342, pp. 362–371.
25.
Zhai, Y.; Liu, Z.; Miao, Y.; Wang, H. Efficient Reinforcement Learning for 3D LiDAR Navigation of Mobile Robot. In Proceedings
of the 41st Chinese Control Conference (CCC 2022), Hefei, China, 25–27 July 2022; pp. 3755–3760.
26.
Sánchez, M.; Morales, J.; Martínez, J.L. Reinforcement and Curriculum Learning for Off-Road Navigation of a UGV with a 3D
LiDAR. Sensors 2023,23, 3239. [CrossRef] [PubMed]
27.
Bellman, R.; Kalaba, R.E. Dynamic Programming and Modern Control Theory; Academic Press: New York, NY, USA, 1965; Volume 81.
28.
Sutton, R.S.; Barto, A.G. The reinforcement learning problem. In Reinforcement learning: An introduction; MIT Press: Cambridge,
MA, USA, 1998; pp. 51–85.
29.
Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic Policy Gradient Algorithms. In Proceedings
of the International Conference on Machine Learning (ICML 2014), Beijing, China, 21–26 June 2014; pp. 387–395.
30. Mnih, V. Playing atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602.
31.
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 21–26 July 2017;
pp. 652–660.
32.
Vaswani, A. Attention Is All You Need. In Advances in Neural Information Processing Systems (NeurIPS 2017); Curran Associates,
Inc.: Long Beach, CA, USA, 2017.
33.
Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S.E.; Bronstein, M.M.; Solomon, J.M. Dynamic Graph CNN for Learning on Point Clouds. ACM
Trans. Graph. 2019,38, 1–12. [CrossRef]
34.
Guo, M.H.; Cai, J.X.; Liu, Z.N.; Mu, T.J.; Martin, R.R.; Hu, S.M. PCT: Point Cloud Transformer. Comput. Vis. Media 2021,7, 187–199.
[CrossRef]
35. Zhao, H.; Jiang, L.; Jia, J.; Torr, P.H.; Koltun, V. Point Transformer. In Proceedings of the IEEE/CVF International Conference on
Computer Vision (ICCV 2021), Montreal, QC, Canada, 10–17 October 2021; pp. 16259–16268.
Appl. Sci. 2025,15, 1423 21 of 21
36.
Dosovitskiy, A.; Ros, G.; Codevilla, F.; Lopez, A.; Koltun, V. CARLA: An Open Urban Driving Simulator. In Proceedings of the
Conference on Robot Learning (CoRL 2017), Mountain View, CA, USA, 13–15 November 2017; pp. 1–16.
37.
O’Kelly, M.; Sukhil, V.; Abbas, H.; Harkins, J.; Kao, C.; Pant, Y.V.; Bertogna, M. F1/10: An Open-Source Autonomous Cyber-
Physical Platform. arXiv 2019, arXiv:1901.08567.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
Available via license: CC BY 4.0
Content may be subject to copyright.