ArticlePDF Available

Motion Planning of Robot Manipulators for a Smoother Path Using a Twin Delayed Deep Deterministic Policy Gradient with Hindsight Experience Replay

MDPI
Applied Sciences
Authors:

Abstract and Figures

In order to enhance performance of robot systems in the manufacturing industry, it is essential to develop motion and task planning algorithms. Especially, it is important for the motion plan to be generated automatically in order to deal with various working environments. Although PRM (Probabilistic Roadmap) provides feasible paths when the starting and goal positions of a robot manipulator are given, the path might not be smooth enough, which can lead to inefficient performance of the robot system. This paper proposes a motion planning algorithm for robot manipulators using a twin delayed deep deterministic policy gradient (TD3) which is a reinforcement learning algorithm tailored to MDP with continuous action. Besides, hindsight experience replay (HER) is employed in the TD3 to enhance sample efficiency. Since path planning for a robot manipulator is an MDP (Markov Decision Process) with sparse reward and HER can deal with such a problem, this paper proposes a motion planning algorithm using TD3 with HER. The proposed algorithm is applied to 2-DOF and 3-DOF manipulators and it is shown that the designed paths are smoother and shorter than those designed by PRM.
This content is subject to copyright.
applied
sciences
Article
Motion Planning of Robot Manipulators for
a Smoother Path Using a Twin Delayed Deep
Deterministic Policy Gradient with Hindsight
Experience Replay
MyeongSeop Kim 1,† , Dong-Ki Han 1,† , Jae-Han Park 2and Jung-Su Kim 1,*
1
Department of Electrical and Information Engineering, Seoul National University of Science and Technology,
Seoul 01811, Korea; kimmyungsup57@gmail.com (M.K.); dongki.hann@gmail.com (D.-K.H.)
2Robotics R&D Group, Korea Institute of Industrial Technology (KITECH), Ansan 15588,
Korea; hans1024@kitech.re.kr
*Correspondence: jungsu@seoultech.ac.kr; Tel.: +82-2-970-6547
These authors contributed equally to this work.
Received: 2 December 2019; Accepted: 7 January 2020; Published: 13 January 2020


Abstract:
In order to enhance performance of robot systems in the manufacturing industry, it is
essential to develop motion and task planning algorithms. Especially, it is important for the
motion plan to be generated automatically in order to deal with various working environments.
Although PRM
(Probabilistic Roadmap) provides feasible paths when the starting and goal positions
of
a robot
manipulator are given, the path might not be smooth enough, which can lead to inefficient
performance of the robot system. This paper proposes a motion planning algorithm for robot
manipulators using a twin delayed deep deterministic policy gradient (TD3) which is a reinforcement
learning algorithm tailored to MDP with continuous action. Besides, hindsight experience replay
(HER) is employed in the TD3 to enhance sample efficiency. Since path planning for a robot
manipulator is an MDP (Markov Decision Process) with sparse reward and HER can deal with such
a problem, this paper proposes a motion planning algorithm using TD3 with HER. The proposed
algorithm is applied to 2-DOF and 3-DOF manipulators and it is shown that the designed paths are
smoother and shorter than those designed by PRM.
Keywords:
motion planning; Probabilistic Roadmap (PRM); Reinforcement learning; policy gradient;
Hindsight Experience Replay (HER)
1. Introduction
In the Industry 4.0 era, robots and related technologies are fundamental elements of assembley
systems in manufacturing; for instance, efficient robot manipulators for various tasks in assembly lines,
control of robots with high accuracy, and optimization methods for task scheduling [1,2].
When a task is given from a high level task scheduler, the manipulator has to move its end-effector
from the starting point to the goal point without collision with any obstacles or other robots. For this,
motion planning algorithms let robot manipulators know how to change their joint angles in order
for the end-effector to reach the goal point without collision. Currently, in practice, human experts
teach robot manipulators how to move in order to conduct various predefined tasks. Namely, a robot
manipulator learns from human experts how to change its joint angles for a given task. However,
when the tasks or the working environments change, such manual teaching (a robot manipulator’s
learning) procedure has to be done again. The other downside of the current approach is optimality or
efficiency. In other words, it is not clear if the robot manipulator moves optimally even though the
Appl. Sci. 2020,10, 575; doi:10.3390/app10020575 www.mdpi.com/journal/applsci
Appl. Sci. 2020,10, 575 2 of 15
robot manipulator can perform a given task successfully when a robot manipulator learns from human
experts.
Therefore, it is
important to teach the robot manipulators an optimal path automatically
when a task is given.
Using policy
search-based reinforcement learning, this paper presents a motion
planning algorithm for robot manipulators, which makes it possible for the robot manipulator to
generate an optimal path automatically; it is a smoother path compared with existing results [35].
For robot path planning, sampling-based algorithms find feasible paths for the robot manipulator
using a graph consisting of randomly sampled nodes and connected edges in the given configuration
space [
6
,
7
]. PRM (Probabilistic Roadmaps) and RRT (Rapid Exploring Random Trees) are two
representatives of sampling-based planning algorithms. PRM consists of two phases. The learning
phase samples nodes randomly from collision-free space in the configuration space and makes edges
with direction by connecting the sampled nodes. Then, it constructs a graph using the nodes and edges.
The query phase finds the optimal path connecting the starting node and goal node in the graph [
8
13
].
Note that the resulting path by PRM is made by connecting the sampled nodes in the configuration
space; usually, it is not smooth and might be longer than the optimal path. RRT samples nodes from
the neighbor of the starting point in the configuration space, constructs a tree by finding a feasible path
from the starting node, and expands the graph until the goal point is reached. It works for various
environments and can generate a path quickly, but its optimality is not guaranteed in general [
14
,
15
].
More recently, Fast Marching Methods (FMMs) using level sets have been proposed for path planning.
The FMMs
are mainly about efficiently solving the Eikonal equation whose solution provides the
optimal path. It is shown that FMMs lead to an asymptotically optimal path and faster convergence
than PRM and RRT [
16
]. Since FMMs, PRM, and RTT are sampling-based approaches, they need a high
number of sampling points for high dimensional configuration space in order to obtain a smoother
path, which means that they are computationally demanding in calculating the optimal path for given
arbitrary starting and ending points. Also, they can suffer from memory deficiency in high dimensional
space. However, in the proposed method, when a TD3 agent is trained, the optimal path can easily be
computed (i.e., trained neural network computation).
Reinforcement learning is a deep learning approach which finds an optimal policy for an MDP
(Markov Decision Process). The agent applies an action according to the policy to the environment
and then the agent gets the next state and reward from the environment. The agent finds the optimal
policy such that the sum of reward over the horizon is maximized [
17
,
18
]. In reinforcement learning,
there are two typical approaches to find the optimal policy. Value-based approaches estimate the
optimal (action) state value function and derive the corresponding policy from the estimate of the
value function [
19
21
]. On the other hand, policy gradient approaches search the optimal policy
directly from the set of state and reward data. It is known that policy gradient approaches show much
better performance in general [
22
28
]. Recently, deep learning-based control and operation of robot
manipulators have drawn much attention. In [
29
,
30
], robot path planning methods are proposed using
a deep
Q
-network algorithm with emphasis on learning efficiency. For path training, a stereo image is
used to train DDPG (Deep Deterministic Policy Gradient) in [31]. In [32], a real robot is trained using
reinforcement learning for its path planning.
This paper presents a policy gradient-based path planning algorithm. To this end, RAMDP (Robot
Arm Markov Decision Process) is defined first. In RAMDP, the state is the joint angle of the robot
manipulator and the action is the variation of the joint angle. Then, DDPG (Deep Deterministic Policy
Gradient) with HER (Hindsight Experience Replay) is employed for the purpose of searching the
optimal policy [
24
,
33
]. DDPG is applied since the action in RAMDP is a continuous value and DDPG
is devised for MDP with a continuous action. The twin delayed DDPG enhances performance of
DDPG so that it shows good convergent property and avoids overestimation. In fact, HER is quite
fit to robot path planning since RAMDP is an MDP with sparse reward. Sparse reward means that
when an MDP has a finite length of an episode with a specific goal state and the episodes end at
non-goal states (say, a failed episode) due to any reasons frequently, the agent can not get much
reward. Since all reinforcement learning finds the optimal policy by maximizing the sum of reward,
Appl. Sci. 2020,10, 575 3 of 15
sparse reward is critical in reinforcement learning. However, as the episodes are saved in the memory,
HER modifies the last state in a failed episode as a new goal. Then, the failed episode becomes
a normal
episode which ends at the goal state. Hence, HER enhances the sample efficiency and fits to the robot
path planning.
It is
shown that such a procedure is quite helpful in a motion planning algorithm.
In the proposed algorithm, when the state is computed after applying the action, the collision with
obstacle or reaching the goal are checked. It turns out that many states end at non-goal states in the
middle of learning. This is why conventional reinforcement learnings do not work well for robot
path planning. However, using HER, those episodes can be changed to a normal episode which
ends at goal states. Hence, the contribution of the paper is to present a path planning algorithm
using DDPG with HER. The proposed method is applied to a 2-DOF and 3-DOF robot manipulators
using simulation; experimental results are also shown for a 3-DOF manipulator. In both cases, it is
quantitatively demonstrated that the resulting paths are shorter than those by PRM.
2. Preliminaries and Problem Setup
2.1. Configuration Space and Sampling-Based Path Planning
In sampling-based path planning, configuration space
Q
(also called joint space for robot
manipulators) represents the space of possible joint angles and is a subset of
n
-dimensional Euclidean
space
Rn
where
n
denotes the number of the joints of the manipulator. The values of the joint angles
of a robot manipulator are denoted as a point in
Q
[
1
,
2
,
34
]. The configuration space consists of two
subsets: the collision-free space
Qfree
and
Qcollide
in which the robot manipulator collides with other
obstacles or itself. For motion planning, a discrete representation of the continuous
Qfree
is generated
by random sampling. Then a connected graph (roadmap) is obtained. The nodes in the graph denote
admissible configuration of the robot manipulators, and the edges connecting any two nodes mean
feasible paths (trajectory) between the corresponding configurations. Finally, when the starting and
goal configurations
q0
and
qgoal
are given, any graph search algorithm is employed to find the shortest
path connecting
q0
and
qgoal
. There exists a shortest path between
q0
and
qgoal
since they are two nodes
on the connected graph.
2.2. Reinforcement Learning
Reinforcement learning is an optimization-based method to solve an MDP (Markov Decision
Process) [
18
]. An MDP is comprised of
{S
,
A
,
P
,
R
,
γ}
where
S
is a set of the state and
A
denotes
a set
of the action. Besides,
P
consists of
P(s0|s
,
a)
which is the transition probability that the current
state
sS
with the action
aA
becomes the next state
s0S
.
R
stands for the reward function
and
γ[
0, 1
]
is the discount factor. The agent’s policy
π(a|s)
implies the distribution of the action
a
for the given state
s
. In reinforcement learning, the agent takes action
at
according to the policy
π
at time
t
and state
st
, and the environment returns the next state
st+1
and reward
rt+1
by the
transition
P
and reward function
R
. By repeating this, the agent updates its policy so as to maximize
its expected return
Eπ[
k=0γkrt+k+1]
. The resulting optimal policy is denoted by
π
. In order to find
the optimal policy, value-based methods like DQN (Deep Q-Network) estimate the optimal value
function (
i.e., estimate the
maximum return) and find the corresponding policy [
19
]. On the other hand,
policy gradient
methods compute the optimal policy directly from samples. For instance, REINFORCE,
actor-critic method, DPG (Deterministic Policy Gradient), DDPG (Deep DPG), A3C (Asynchronous
Advantage Actor-Critic), TRPO (Trust Region Policy Optimization) are representative methods of policy
gradient methods [
22
,
35
,
36
]. Training performance of reinforcement learning is heavily dependent on
samples which are several sets of the state, action, and next state.
Hence, in addition
to the various
reinforcement learning algorithms, many research efforts have been directed to study on how to use
episodes efficiently for the purpose of better agent learning, for example,
replay memory [19]
and
HER (Hindsight Experience Replay) [
33
]. In this paper, for the sake of designing
a motion
planning
Appl. Sci. 2020,10, 575 4 of 15
algorithm, a policy gradient called TD3 (twin delayed Deep Deterministic Policy Gradient) is used for
path planning.
3. TD3 Based Motion Planning for Smoother Paths
3.1. RAMDP (Robot Arm Markov Decision Process) for Path Planning
In order to develop a reinforcement learning (RL) based path planning, the robot arm MDP
(RAMDP) needs to be defined properly first [
17
]. The state
qt
of the RAMDP is the angle value
of each joint of the manipulator where the joint angle belongs to the configuration space
Q
.
Hence, for collision-free
operation of a robot manipulator, in the motion planning, the state
qt
belongs
to only
Qfree Rn
where
n
is the number of the joint in the manipulator. In the RAMDP, the action is
joint angle variation. Unlike MDP with discrete state and action such as frozen-lake ([
17
]), the RAMDP
under consideration has continuous state (i.e., joint angle) and continuous action (i.e., joint angle
variation). Due to this, DDPG or its variants are fit to find the optimal policy for the agent in the
RAMDP. In this paper, TD3 (twin delayed DDPG) is employed to find an optimal policy for the
continuous action with deterministic policy
µ
in the RAMDP. Figure 1describes how to generate the
sample (
qt
,
at
,
qt+1
,
rt+1
) in the RAMDP. Suppose that arbitrary initial state (
q0
) and goal state (
qgoal
)
are given, and that the maximum length of the episode is
T
. At state
qt
, if the action
at
is applied
to the RAMDP,
ˆ
qt+1
is produced from the RAMDP. Then, according to the reward function in
(1)
,
the next state
qt+1
and
rt+1
are determined. In view of
(1)
, if
ˆ
qt+1
does not belong to
Qfree
, then the
next state is set as the current state. Furthermore, if the next state is the goal point, then the reward
is 0, and otherwise the reward is
1. Then, the whole procedure is repeated until the episode ends.
Note that this reward function leads to as short as possible path since TD3 tries to maximize the sum
of reward and a longer path implies more 1 reward.
qt+1=(ˆ
qt+1, if ˆ
qt+1Qfree
qt, if ˆ
qt+1/Qfree
,rt=
0, if qt+1== qgoal
1, if qt+1Qcollide
1, if qt+1Qfree
(1)
Figure 1. Robot Arm Markov Decision Process (RAMDP).
There are two possibilities for the episode to end: (1) the next state becomes the goal
state, i.e.,
e= [(q0
,
a0)
,
(q1
,
r1
,
a1)
,
· · ·
,
(qgoal
,
rg)]
, and (2) the length of the episode is
T
, i.e.,
e=
[(q0
,
a0)
,
(q1
,
r1
,
a1)
,
· · ·
,
(qT
,
rT)]
and
qT6=qgoal
. For both cases, every sample in the form of
(
qt
,
at
,
rt
,
qt+1
) is saved in the memory. Note that when
ˆ
qt+1/Qfree
, the next state
qt+1
is set to the
current state
qt
in order to avoid collision. In the worst case, if this happens repeatedly, the agent is
Appl. Sci. 2020,10, 575 5 of 15
trained such that the corresponding episode does not occur. Based on these samples in the memory,
the optimal policy is determined in accordance with TD3, which is explained in the next subsection.
3.2. TD3 (Twin Delayed Deep Deterministic Policy Gradient)
In this section, TD3 is introduced [
24
], which is used to search the optimal policy in the proposed
algorithm [
17
,
22
,
35
,
36
]. To this end, it is assumed that a sufficient number of samples (
st
,
at
,
rt
,
st+1
)
are stored in the memory.
Figure 2describes the structure of TD3. The basic structure of TD3 is the actor-critic network.
However, unlike the actor critic network, there are two critic deep neural networks (DNN), one actor
DNN, and three target DNNs for each critic and actor DNNs. Hence, there are 6 DNNs (2 critic DNNs
and their target DNNs, 1 actor DNN, and its target DNN) in TD3. As seen in Figure 3, the critic DNNs
generate an optimal estimate
Qθi(st
,
at)
of the state-action value function. The input of the critic DNN
is a mini-batch from the memory and its output is
Qθi(st
,
at)
,
i=
1, 2. The mini-batch is a finite set of
samples (
st
,
at
,
rt
,
st+1
) from the memory. In TD3, it is important that the two critic DNNs are used in
order to remove overestimation bias in
Qθi(st
,
at)
function approximation. The overestimation bias can
take place when bad states due to accumulated noises are overestimated. In order to cope with this,
TD3 chooses the smaller
Qθi(st
,
at)
value out of two critic DNN outputs critic DNN as the target value.
In Figure 3,
θ1
and
θ2
are the parameters of the two critic DNNs, and
θ0
1
and
θ0
2
those of corresponding
target DNNs. In order to train the critic DNN, as the cost function, the following quadratic function of
temporal difference error δ:=Qθ(q,a)ytarget is minimized,
Jδ(θ):=1/2(Qθ(q,a)ytarget)2, (2)
where Qθ(s,a)stands for the parameterized state-action value function Qwith parameter θ,
ytarget =r+γmin
i=1,2 Qθ0
i(q,˜
a)(3)
is the target value of the function
Qθ(s
,
a)
, and the target action (i.e., the action used in the critic target
DNN) is defined as,
˜
a=µφ(q) + ¯
e, (4)
where noise
¯
e
follows a clipped normal distribution
clip[(N(
0,
σ)
,
c
,
c]
,
c>
0. This implies that
¯
e
is
a random variable with N(0, σ)and belongs to the interval [c,c].
Figure 2. Structure of TD3 (Twin Delayed Deep Deterministic Policy Gradient) with RAMDP.
Appl. Sci. 2020,10, 575 6 of 15
Figure 3. Details of critic and actor deep neural network for RAMDP.
The inputs of the actor DNN are both
Qθ1(qt
,
at)
from the critic DNN and the mini-batch from the
memory, and the output is the action. Precisely, the action is given by
at=µφ(qt) + e
where
φ
is the
parameter of the actor DNN,
µφ
is the output from the actor DNN, and a deterministic and continuous
value. Noise
e
follows the normal distribution
N(
0,
σ)
, and is added for exploration. In order to tune
the parameter φ, the following cost function is minimized.
Jµ(φ) =
q
dµ(q)Qθ(q,µφ(q) + e),(5)
where
dµ(q)
denotes the distribution of the state. Note that the gradient
Jµ(θ)
(equivalently
Qθ(q,a)) is used to update the parameter φ. This is why the method is called the policy gradient.
Between the two outputs from the two critic target DNN, the common target value in
(3)
is used
to update the two critic DNNs. Also, TD3 updates the actor DNN and all three target DNNs every
d
steps periodically in order to avoid too fast convergence. Note that the policy
µφ(q)
is updated
proportionally to only
Qθ1(q
,
µφ(q))
[
24
]. The parameters for the critic target DNN and the actor target
DNN are updated according to
θ0τθ + (
1
τ)θ0
at every
d
steps, which not only maintain small
temporal difference error, but also make the parameters in the target DNN updated slowly.
3.3. Hindsight Experience Replay
Since the agent in reinforcement learning is trained from samples, it is utmost important to have
helpful samples which mean that the action state pair improve the action value function. On the other
hand, many unhelpful samples are generated in RAMDP since many episodes end without reaching
the goal. In other words, RAMDP is an MDP with sparse reward. For the purpose of enhancing
sample efficiency, HER (Hindsight Experience Replay) is employed in this paper. For the episode
e= [(q0
,
a0)
,
(q1
,
r1
,
a1)
,
· · ·
,
(qT
,
rT)]
in the memory where
qT
is not the goal state, HER resets
qT
as
qgoal
. This means that even though the episode does not end the goal state, the episode becomes a ended
state at the goal after modification by HER. Then, the failed episodes can be use to train the agent since
the modified episodes are a goal achieved episodes. Algorithm 1 summarizes the proposed algorithm
introduced in this section.
Appl. Sci. 2020,10, 575 7 of 15
Algorithm 1
Training procedure for motion planning by TD3 with HER. The red part is for a robot
manipulator, and the blue part is for HER.
1: Initialize critic networks Qθ1,Qθ2, and actor network πφwith θ1,θ2,φ
2: Initialize target networks θ0
1θ1,θ0
2θ2,φ0φ
3: Initialize replay buffer R
4: for e=1to Mdo
5: Initialize local buffer L.memory for HER
6: for t=0to T1do
7: Randomly choose goal point qgoal Qfree
8: Select action with noise:
9: atµφ(qt||qgoal ) + e,e N (0, σ).|| denotes concatenation
10: qt+1=qt+αat
11:
12: if qt+α(µφ(qt||qgoal ) + e)/Q then
13: qt+1qt
14: else if qt+1Qcollide then
15: qt+1qt
16: else if |qt+1qgoal| 0.2 αthen
17: Terminal by goal arrival
18: end if
19:
20: rt:=r(qt,at,qgoal)
21: Store the transition (qt||qgoal,at,rt,qt+1||qgoal )in R,L
22:
23: Sample mini-batch of ntransitions (qj||qgoal ,aj,rj,qj+1||qgoal)from R
24: ˜
ajµφ0(qj+1||qgoal ) + e,eclip(N(0, σ),c,c)
25: yjrj+γmini=1,2Qθ0
i(qj+1,˜
aj)
26: Update critics θiwith temporal difference error:
27: θiJ(θi) = 1
nn
j=1θi(yjQθi(qj,aj))2
28:
29: if tmod dthen .delayed update with d
30: Update actor φby the deterministic policy gradient:
31: φJ(φ) = 1
nn
j=1φQθ1(qj,µφ(qj||qgoal ))
32: Update target networks:
33: θ0
iτθi+ (1τ)θ0
i
34: φ0τφ + (1τ)φ0
35: end if
36: end for
37: if qT6=qgoal then
38: Set additional goal q0
goal qT
39: for t=0to T1do
40: Sample a transition (qt||qgoal ,at,rt,qt+1||qgoal)from L
41: r0
t:=r(qt,at,q0
goal)
42: Store the transition (qt||q0
goal,at,rt,qt+1||q0
goal)in R
43: end for
44: end if
45: end for
Appl. Sci. 2020,10, 575 8 of 15
4. Case Study for 2-DOF and 3-DOF Manipulators
In order to show the effectiveness of the proposed method, it is applied to robot manipulators.
Table 1shows the information of the used 2-DOF and 3-DOF and manipulators.
Table 1. Parameters of 2-DOF and 3-DOF manipulators.
DOF Joint Max (degree) Joint Min (degree) Action Step Size (α)Goal Boundary
2 (140, 45, 150) (140, 180, 45) 0.1381 0.2
3 (60, 60) (0, 0) 3.0 1.0
For easy visualization, the proposed algorithm is applied to the 2-DOF manipulator first. Table 2
summarizes the tuning parameters for the designed TD3 with HER.
Table 2. Tuning parameters for the designed TD3 with HER.
Network Name Learning Rate Optimizer Update Delay DNN Size
Actor 0.001 adam 2 6 ×400 ×300 ×3
Critic 0.001 adam 0 6 ×400 ×300 ×1
Actor target 0.005 - 2 6 ×400 ×300 ×3
Critic target 0.005 - 2 6 ×400 ×300 ×1
In order to train TD3 with HER for the 2-DOF manipulator, 8100 episodes are used. Figure 4
describes the success ratio of each episode with HER when the network is training. In other words,
when the network is learning with arbitrary starting and goal points, sometimes the episodes end at
the given goal point but sometimes the episodes end before reaching the given goal point.
In Figure 4
,
the green lines denote the success ratio of every 10 episodes and the thick lines stand for the
moving average of the gray lines. Figure 5shows the reward as the number of the episode increases,
i.e., the training
is proceeding. The reward converges as the number of the episode increases. In view
of the results in Figures 4and 5, we can see that learning is over successfully.
For the purpose of testing the trained TD3, it is verified if the optimal paths are generated when
random starting and goal points are given to the trained TD3. For testing, only the actor DNN without
its target DNN is used with the input being a starting and goal point repeatedly. The input to the
trained actor DNN is
(qt
,
qgoal)
, the output is
(at
,
qt+1)
, and then this is repeated with
qt=:qt+1
as
depicted in Figure 6.
Figure 4. 2-DOF manipulator: success ration of reaching the goal point with HER.
Appl. Sci. 2020,10, 575 9 of 15
Figure 5. 2-DOF manipulator: reward from learning.
Figure 6. Path generation using the trained actor DNN.
When this is applied to the 2-DOF manipulator, the resulting paths are shown in Figure 7.
In Figure 7
, the green areas represent obstacles in the configuration space, and the rhombus denote the
starting point and the circles means the goal point. For comparison, PRM is also applied to generate the
paths with the same starting and goal points. As shown in Figure 7, the proposed method generates
smoother paths in general. This is confirmed from many other testing data as well. In Figure 7,
the red lines are the resulting paths by PRM and the blue lines by the proposed method. In average,
the resulting path by the proposed method is shortened by 3.45% compared with the path by PRM.
In order to test the proposed method for a real robot manipulator, the 3-DOF open manipulator.
For details, see http://en.robotis.com/model/page.php?co_id=prd_openmanipulator is considered.
For easy understanding, Figure 8shows the workspaces in Matlab and Gazebo in ROS (Robot Operating
System) respectively, and configuration space of the open manipulator with four arbitrary obstacles.
The tuning parameters for the TD3 with HER are also shown in Table 2.
Appl. Sci. 2020,10, 575 10 of 15
Figure 7. DDPG based path generation for arbitrary starting and goal points in C-space.
(a) Workspace in Matlab
(
b
) Workspace in Gazebo (in ROS)
(c) Configuration space
Figure 8.
Workspace and configuration space. (
a
) Workspace in Matlab; (
b
) Workspace in Gazebo
(in ROS); (c) Configuration space.
For training, 140000 episodes are used. In view of the converged reward and success ratio in
Figures 9and 10, we can see that the learning is also over successfully. With this result, random starting
and goal points are given to the trained network in order to obtain a feasible path between them.
Figure 11 shows several examples of generated paths by the trained actor DNN when arbitrary starting
points and goal points are given. The red lines are resulting paths by PRM and the blue lines by the
proposed method. As seen in the figure, the proposed method results in smoother and shorter paths
in general. For the sake of between comparison, 100 arbitrary starting and gold points are used to
generate paths using PRM and the proposed method. Figure 12 shows the lengths of the resulting
100 paths
. In light of Figure 12, it is obvious that the proposed method generates smoother and shorter
paths in general. Note that, in average, the resulting path by the proposed method is shortened by
2.69% compared with the path by PRM.
When the proposed method is also implemented to the open manipulator, the same experimental
result as the simulation was obtained. The experimental result is presented in the https://sites.google.
com/site/cdslweb/publication/td2-demo.
Appl. Sci. 2020,10, 575 11 of 15
Figure 9. 3-DOF manipulator: success ration of reaching the goal point with HER.
Figure 10. 3-DOF manipulator: reward from learning.
Figure 11. Cont.
Appl. Sci. 2020,10, 575 12 of 15
Figure 11. Path generation using DDPG with HER.
Figure 12. Comparison of paths by PRM and the proposed method.
5. Conclusions
For the purpose of enhancing efficiency in manufacturing industry, it is important to improve
performance of robot path planning and tasking scheduling. This paper presents a reinforcement
learning-based motion planning method of robot manipulators with focus on smoother and
shorter path generation, which means better operation efficiency. To this end, motion planning
problem is reformulated as a MDP (Markov Decision Process), called RAMDP (Robot Arm MDP).
Then, TD3 (twin delayed deep deterministic policy gradient, twin delayed DDPG) with HER (Hindsight
Experience Replay) is designed. DDPG is used since the action in RAMDP is a continuous value and
DDPG is
a policy
gradient tailored to an MDP with
a continuous action
. Moreover, since many failed
episodes are generated in the RAMDP meaning that the episode ends at non-goal state due to mainly
collision, HER is employed in order to enhance sample efficiency.
Future research topic includes how to solve motion planning problem for multi-robot arms whose
common working space is non-empty. To solve this problem, configuration space augmentation might
be a candidate solution. Since the augmented configuration space becomes high dimensional, it would
be interesting to compare performance of the proposed reinforcement learning-based approach by that
of sampling-based approaches such as FMMs, PRM, and RTT. Moreover, reinforcement learning-based
motion planning for dynamic environment is also a challenge problem.
Author Contributions:
M.K. and D.-K.H. surveyed the backgrounds of this research,
designed the
preprocessing
data, designed the deep learning network, and performed the simulations and experiments to show the benefits
of the proposed method. J.-S.K. and J.-H.P. supervised and supported this study. All authors have read and agreed
to the published version of the manuscript.
Appl. Sci. 2020,10, 575 13 of 15
Funding: This research received no external funding.
Acknowledgments:
This work was supported by the Technology Innovation Program (or Industrial Strategic
Technology Development Program) (10080636, Development of AI-based CPS technology for Industrial robot
applications) funded By the Ministry of Trade, Industry & Energy(MOTIE, Korea), and by the Human Resources
Development of the Korea Institute of Energy Technology Evaluation and Planning (KETEP) grant funded by the
Ministry of Trade, Industry & Energy of the Korea government (No. 20154030200720).
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
MDP Markov Decision Process
RAMDP Robot Arm Markov Decision Process
DOF Degrees Of Freedom
PRM Probabilistic Roadmaps
RRT Rapid Exploring Random Trees
FMMs Fast Marching Methods
DNN Deep Neural Networks
DQN Deep Q-Network
(A3C) Asynchronous Advantage Actor-Critic
(TRPO) Trust Region Policy Optimization
(DPG) Deterministic Policy Gradient
DDPG Deep Deterministic Policy Gradient
TD3 Twin Delayed Deep Deterministic Policy Gradient
HER Hindsight Experience Replay
(ROS) Robot Operating System
References
1. Laumond, J.P. Robot Motion Planning and Control; Springer: Berlin, Germany, 1998; Volume 229.
2.
Choset, H.M.; Hutchinson, S.; Lynch, K.M.; Kantor, G.; Burgard, W.; Kavraki, L.E.; Thrun, S. Principles of
Robot Motion: Theory, Algorithms, and Implementation; MIT Press: Cambridge, MA, USA, 2005.
3.
Cao, B.; Doods, G.; Irwin, G.W. Time-optimal and smooth constrained path planning for robot manipulators.
In Proceedings of the 1994 IEEE International Conference on Robotics and Automation, San Diego, CA, USA,
8–13 May 1994; pp. 1853–1858.
4.
Kanayama, Y.J.; Hartman, B.I. Smooth local-path planning for autonomous vehicles1. Int. J. Robot. Res.
1997
,
16, 263–284. [CrossRef]
5.
Rufli, M.; Ferguson, D.; Siegwart, R. Smooth path planning in constrained environments. In Proceedings
of the 2009 IEEE International Conference on Robotics and Automation, Kobe, Japan, 12–17 May 2009;
pp. 3780–3785.
6.
Karaman, S.; Frazzoli, E. Sampling-based algorithms for optimal motion planning. Int. J. Robot. Res.
2011
,
30, 846–894. [CrossRef]
7.
Spong, M.W.; Hutchinson, S.A.; Vidyasagar, M. Robot modeling and control. IEEE Control Syst.
2006
,
26, 113–115.
8.
Kavraki, L.E.; Kolountzakis, M.N.; Latombe, J.C. Analysis of probabilistic roadmaps for path planning.
IEEE Trans. Robot. Autom. 1998,14, 166–171. [CrossRef]
9.
Kavraki, L.E.; Svestka, P.; Latombe, J.C.; Overmars, M.H. Probabilistic roadmaps for path planning in
high-dimensional configuration spaces. IEEE Trans. Robot. Autom. 1996,12, 566–580. [CrossRef]
10.
Kavraki, L.E. Random Networks in Configuration Space for Fast Path Planning. Ph.D. Thesis,
Stanford University, Stanford, CA, USA, 1995.
11.
Kavraki, L.E.; Latombe, J.C.; Motwani, R.; Raghavan, P. Randomized query processing in robot path planning.
J. Comput. Syst. Sci. 1998,57, 50–60. [CrossRef]
12.
Bohlin, R.; Kavraki, L.E. A Randomized Approach to Robot Path Planning Based on Lazy Evaluation.
Comb. Optim. 2001,9, 221–249.
Appl. Sci. 2020,10, 575 14 of 15
13.
Hsu, D.; Latombe, J.C.; Kurniawati, H. On the probabilistic foundations of probabilistic roadmap planning.
Int. J. Robot. Res. 2006,25, 627–643. [CrossRef]
14.
Kuffner, J.J.; LaValle, S.M. RRT-connect: An efficient approach to single-query path planning.
In Proceedings
of the 2000 IEEE International Conference on Robotics and Automation, San Francisco, CA, USA,
24–28 April 2000; Volume 2, pp. 995–1001.
15.
Lavalle, S.; Kuffner, J. Rapidly-exploring random trees: Progress and prospects. In Algorithmic and
Computational Robotics: New Directions; CRC Press: Boca Ratol, FL, USA, 2000; pp. 293–308.
16.
Janson, L.; Schmerling, E.; Clark, A.; Pavone, M. Fast marching tree: A fast marching sampling-based method
for optimal motion planning in many dimensions. Int. J. Robot. Res.
2015
,34, 883–921. [CrossRef] [PubMed]
17. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2011.
18.
Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st ed.; John Wiley &
Sons, Inc.: New York, NY, USA, 1994.
19.
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.; Veness, J.; Bellemare, M.; Graves, A.; Riedmiller, M.;
Fidjeland, A.
; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature
2015
,
518, 529–533. [CrossRef] [PubMed]
20.
Langford, J. Efficient Exploration in Reinforcement Learning. In Encyclopedia of Machine Learning and Data
Mining; Sammut, C., Webb, G.I., Eds.; Springer US: Boston, MA, USA, 2017; pp. 389–392.
21.
Tokic, M. Adaptive
ε
-greedy exploration in reinforcement learning based on value differences. In Annual
Conference on Artificial Intelligence; Springer: Berlin, Germany, 2010; pp. 203–210.
22.
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.M.O.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D.
Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971.
23.
Hasselt, H.v.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-Learning. In Proceedings
of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016;
pp. 2094–2100.
24.
Fujimoto, S.; van Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. arXiv
2018, arXiv:1802.09477.
25.
Hessel, M.; Modayil, J.; Van Hasselt, H.; Schaul, T.; Ostrovski, G.; Dabney, W.; Horgan, D.; Piot, B.;
Azar, M.
;
Silver, D. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the
Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018.
26.
Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K.
Asynchronous methods
for deep reinforcement learning. In Proceedings of the International Conference on
Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 1928–1937.
27. Degris, T.; Pilarski, P.M.; Sutton, R.S. Model-free reinforcement learning with continuous action in practice.
In Proceedings of the 2012 American Control Conference (ACC), Montreal, QC, Canada, 27–29 June 2012;
pp. 2177–2182.
28. Degris, T.; White, M.; Sutton, R.S. Off-policy actor-critic. arXiv 2012, arXiv:1205.4839.
29.
Bae, H.; Kim, G.; Kim, J.; Qian, D.; Lee, S. Multi-Robot Path Planning Method Using Reinforcement Learning.
Appl. Sci. 2019,9, 3057. [CrossRef]
30.
Lv, L.; Zhang, S.; Ding, D.; Wang, Y. Path Planning via an Improved DQN-Based Learning Policy. IEEE Access
2019,7, 67319–67330. [CrossRef]
31.
Paul, S.; Vig, L. Deterministic policy gradient based robotic path planning with continuous action spaces.
In Proceedings
of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017;
pp. 725–733.
32.
Gu, S.; Holly, E.; Lillicrap, T.; Levine, S. Deep reinforcement learning for robotic manipulation with
asynchronous off-policy updates. In Proceedings of the 2017 IEEE International Conference on Robotics and
Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 3389–3396.
33.
Andrychowicz, M.; Wolski, F.; Ray, A.; Schneider, J.; Fong, R.; Welinder, P.; McGrew, B.; Tobin, J.; Abbeel, O.P.;
Zaremba, W. Hindsight experience replay. In Proceedings of the Advances in Neural Information Processing
Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5048–5058.
34.
Lozano-Pérez, T. Spatial Planning: A Configuration Space Approach. In Autonomous Robot Vehicles; Cox, I.J.;
Wilfong, G.T., Eds.; Springer New York: New York, NY, USA, 1990; pp. 259–271.
Appl. Sci. 2020,10, 575 15 of 15
35.
Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic Policy Gradient
Algorithms. In Proceedings of the 31st International Conference on International Conference on Machine
Learning, Bejing, China, 22–24 June 2014; pp. I-387–I-395.
36. Sutton, R.S.; Mcallester, D.; Singh, S.; Mansour, Y. Policy gradient methods for reinforcement learning with
function approximation. In Advances in Neural Information Processing Systems 12; MIT Press: Cambridge, MA,
USA, 2000; pp. 1057–1063.
c
2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
... Some methodologies adhere to traditional deep RL paradigms, concentrating on enhancing both trajectory smoothness and sample efciency concurrently [30,31] through techniques such as hindsight experience replay (HER) [32] or decaying episode mechanisms. HER addresses sparse reward challenges by retrospectively redefning failed task attempts as successful, facilitating the agent to learn from failures and thereby enhancing sample efciency during training. ...
... Te RL formulation represents goals through rewards. Rewards can be provided only at the end of the episode, resulting in sparse rewards [30,79], or they can be issued at each time step t, known as dense rewards [41,52]. Dense rewards ofer intermediate feedback to the agent, indicating the quality of the action taken in the previous time step toward achieving the fnal goal. ...
Article
Full-text available
Effective motion planning is an indispensable prerequisite for the optimal performance of robotic manipulators in any task. In this regard, the research and application of reinforcement learning in robotic manipulators for motion planning have gained great relevance in recent years. The ability of reinforcement learning agents to adapt to variable environments, especially those featuring dynamic obstacles, has propelled their increasing application in this domain. Notwithstanding, a clear need remains for a resource that critically examines the progress, challenges, and future directions of this machine learning control technique in motion planning. This article undertakes a comprehensive review of the landscape of reinforcement learning, offering a retrospective analysis of its application in motion planning from 2018 to the present. The exploration extends to the trends associated with reinforcement learning in the context of serial manipulators and motion planning, as well as the various technological challenges currently presented by this machine learning control technique. The overarching objective of this review is to serve as a valuable resource for the robotics community, facilitating the ongoing development of systems controlled by reinforcement learning. By delving into the primary challenges intrinsic to this technology, the review seeks to enhance the understanding of reinforcement learning’s role in motion planning and provides insights that may suggest future research directions in this domain.
... Similarly, Yang and He (2020) presented a decentralized Event-triggered Control (ETC) strategy based on Adaptive Critic Learning (ACL) and using experience replay, and Zhou et al. (2022) proposed an attention-based actor-critic algorithm with Prioritized Experience Replay (PER) to improve convergence time on robotic motion planning problems, changing the LSTM-based advantage actorcritic algorithm by using an encoder attention weight and initializing the networks using PER. Kim et al. (2020) introduced a motion planning algorithm for robot manipulators using a twin delayed deep deterministic policy gradient, which applies the Hindsight Experience Replay (HER) formulated by Andrychowicz et al. (2017). Prianto et al. (2020) approached the path planning for multi-arm manipulators by proposing a method based on the Soft Actor-Critic (SAC) algorithm with hindsight experience replay to improve exploration in high-dimensional problems. ...
Article
Full-text available
From the first theoretical propositions in the 1950s to its application in real-world problems, Reinforcement Learning (RL) is still a fascinating and complex class of machine learning algorithms with overgrowing literature in recent years. In this work, we present an extensive and structured literature review and discuss how the Experience Replay (ER) technique has been fundamental in making various RL methods in most relevant problems and different domains more data efficient. ER is the central focus of this review. One of its main contributions is a taxonomy that organizes the many research works and the different RL methods that use ER. Here, the focus is on how RL methods improve and apply ER strategies, demonstrating their specificities and contributions while having ER as a prominent component. Another relevant contribution is the organization in a facet-oriented way, allowing different perspectives of reading, whether based on the fundamental problems of RL, focusing on algorithmic strategies and architectural decisions, or with a view to different applications of RL with ER. Moreover, we start by presenting a detailed formal theoretical foundation of RL and some of the most relevant algorithms and bring from the recent literature some of the main trends, challenges, and advances focused on ER formal basement and how to improve its propositions to make it even more efficient in different methods and domains. Lastly, we discuss challenges and open problems and present relevant paths to feature works.
... RL involves both probabilistic and deterministic approaches, adapted to various decision-making and learning environments. Deterministic RL algorithms, such as Deep Deterministic Policy Gradient (DDPG) [56] and Twin Delayed Deep Deterministic Policy Gradient (TD3) [57], output specific actions directly from the policy network without any inherent randomness, making them suitable for environments where precise control is paramount. On the other hand, probabilistic RL algorithms, like those used in Policy Gradient methods (e.g., REINFORCE, Proximal Policy Optimization) and Soft Actor-Critic (SAC), generate a probability distribution over possible actions and sample actions from this distribution [58]. ...
Article
Full-text available
Advancements in deep learning enable sophisticated algorithms to learn from vast data sets, driving innovation and efficiency in power systems. The global push for electrification has increased focus on deep learning, leading to numerous recent publications in this field. This paper presents a bibliometric analysis and future trends of deep learning in power systems, aiming to identify its fundamental characteristics and summarize the research hot topics and future trends. This study used the Scopus database with the keywords "deep learning" and "power systems," yielding 20,202 publications from 1970 to 2023. Over 37,600 authors contributed, averaging 12 citations per paper. Keyword trends show traditional deep learning techniques like LSTMs and CNNs are widely used in power systems, while newer methods like advanced reinforcement learning, graph neural networks, and physics-informed neural networks are emerging, promising advancements in optimal power flow, V2G integrations, and grid resilience. Further analysis highlights ubiquitous deep learning applications like load forecasting and power quality analysis in power systems. Emerging topics include microgrid optimization and electric vehicle charging demand prediction, with growing interest in IoT management, digital twins, and cybersecurity. Future research will focus on self-healing grids, optimal power flow, and energy trading for improved reliability and security.
... Current mainstream actor-critic algorithms such as deep deterministic policy gradient (DDPG) [24], twin delayed deep deterministic policy gradient (TD3) [25], and soft actorcritic (SAC) [26] can estimate value functions with relative precision and guide policy learning. However, these methods necessitate persistent interaction between the agent and the environment to enrich the data within the experience replay buffer [27]. ...
Article
Full-text available
Redundant degree-of-freedom (DOF) manipulators offer increased flexibility and are better suited for obstacle avoidance, yet precise control of these systems remains a significant challenge. This paper addresses the issues of slow training convergence and suboptimal stability that plague current deep reinforcement learning (DRL)-based control strategies for redundant DOF manipulators. We propose a novel DRL-based intelligent control strategy, FK-DRL, which integrates the manipulator’s forward kinematics (FK) model into the control framework. Initially, we conceptualize the control task as a Markov decision process (MDP) and construct the FK model for the manipulator. Subsequently, we expound on the integration principles and training procedures for amalgamating the FK model with existing DRL algorithms. Our experimental analysis, applied to 7-DOF and 4-DOF manipulators in simulated and real-world environments, evaluates the FK-DRL strategy’s performance. The results indicate that compared to classical DRL algorithms, the FK-DDPG, FK-TD3, and FK-SAC algorithms improved the success rates of intelligent control tasks for the 7-DOF manipulator by 21%, 87%, and 64%, respectively, and the training convergence speeds increased by 21%, 18%, and 68%, respectively. These outcomes validate the proposed algorithm’s effectiveness and advantages in redundant manipulator control using DRL and FK models.
... In the field of robotics, reinforcement learning has been applied to various types of robots, including mobile robots [16], [17], aerial robots [18], bipedal robots [19]- [21], humanoid robots [22], arm manipulator robots [23], [24]. Many papers employed DRL for arm robot manipulator, such as Kim et al. [25] employed robot arm markov decision process (RAMDP) with both 2-DOF and 3-DOF robot manipulators using twin delayed deep deterministic policy gradient (DDPG) (TD3) and [26] utilized soft-actor-critic (SAC) to control the movement of two 7 DOF robotic arms. ...
Article
Full-text available
p>The robotic arm emerges as a subject of paramount significance within the industrial landscape, particularly in addressing the complexities of its kinematics. A significant research challenge lies in resolving the inverse kinematics of multiple degree of freedom (M-DOF) robotic arms. The inverse kinematics of M-DOF robotic arms pose a challenging problem to resolve, thus it involves consideration of singularities which affect the arm robot movement. This study aims a novel approach utilizing deep reinforcement learning (DRL) to tackle the inverse kinematic problem of the 6-DOF PUMA manipulator as a representative case within the M-DOF manipulator. The research employs Jacobian matrix for the kinematics system that can solve the singularity, and deep deterministic policy gradient (DDPG) as the kinematics solver. This chosen technique offers enhancing speed and ensuring stability. The results of inverse kinematic solution using DDPG were experimentally validated on a 6-DOF PUMA arm robot. The DDPG successfully solves inverse kinematic solution and avoids the singularity with 1,000 episodes and yielding a commendable total reward of 1,018.</p
... Hua et al. [66] conducted a study on collision-free path planning in narrow passages based on the reinforcement learning. Kim et al. [67] carried out dual-delay deep learning for smooth path planning of robotic arms. Wang et al. [68] proposed a collision-free trajectory planning model for free-floating space robots. ...
... The second network assesses the effectiveness of the first one by assigning a score to be maximized. The DDPG [1], TD3 [2], and Soft Actor-Critic [3] neural architectures are charge of capacitors to represent long-term potentiation (LTP) and long-term depreciation (LTD) effects. Another STDP circuit is presented in [15], which uses transconductance amplifiers and capacitors. ...
Article
Full-text available
In this article, we propose a circuit to imitate the behavior of a Reward-Modulated spike-timing-dependent plasticity synapse. When two neurons in adjacent layers produce spikes, each spike modifies the thickness in the shared synapse. As a result, the synapse’s ability to conduct impulses is controlled, leading to an unsupervised learning rule. By introducing a reward signal, reinforcement learning is enabled by redirecting the growth and shrinkage of synapses based on signal feedback from the environment. The proposed synapse manages the convolution of the emitted spike signals to promote either the strengthening or weakening of the synapse, represented as the resistance value of a memristor device. As memristors have a conductance range that may differ from the available current input range of typical CMOS neuron designs, the synapse circuit can be adjusted to regulate the spike’s amplitude current to comply with the neuron. The circuit described in this work allows for the implementation of fully interconnected layers of neuron analog circuits. This is achieved by having each synapse reconform the spike signal, thus removing the burden of providing enough power from the neurons to each memristor. The synapse circuit was tested using a CMOS analog neuron described in the literature. Additionally, the article provides insight into how to properly describe the hysteresis behavior of the memristor in Verilog-A code. The testing and learning capabilities of the synapse circuit are demonstrated in simulation using the Skywater-130 nm process. The article’s main goal is to provide the basic building blocks for deep neural networks relying on spiking neurons and memristors as the basic processing elements to handle spike generation, propagation, and synaptic plasticity.
Article
In addressing the optimal motion planning issue for multi-arm rock drilling robots, this paper introduces a high-precision motion planning method based on Multi-Strategy Sampling RRT* (MSS-RRT*). A dual Jacobi iterative inverse solution method, coupled with a forward kinematics error compensation model, is introduced to dynamically correct target positions, improving end-effector positioning accuracy. A multi-strategy sampling mechanism is constructed by integrating DRL position sphere sampling, spatial random sampling, and goal-oriented sampling. This mechanism flexibly applies three sampling methods at different stages of path planning, significantly improving the adaptability and search efficiency of the RRT* algorithm. In particular, DRL position sphere sampling is prioritized during the initial phase, effectively reducing the number of invalid sampling points. For training a three-arm DRL model with the twin delayed deep deterministic policy gradient algorithm (TD3), the Hindsight Experience Replay-Obstacle Arm Transfer (HER-OAT) method is used for data replay. The cylindrical bounding box method effectively prevents collisions between arms. The experimental results show that the proposed method improves motion planning accuracy by 94.15% compared to a single Jacobi iteration. MSS-RRT* can plan a superior path in a shorter duration, with the planning time under optimal path conditions being only 20.71% of that required by Informed-RRT*, and with the path length reduced by 21.58% compared to Quick-RRT* under the same time constraints.
Article
Full-text available
Industrial robots play a crucial role in a wide range of industrial processes, and due to the complexity of the work environment in which these systems are deployed, more robust and accurate control methods are required. Deep reinforcement learning (DRL) emerges as a comprehensive approach that does not require an initial source of structured data for its learning process. Instead, DRL generates its own data based on its experiences within a work environment. To generate its own data, DRL requires integration with virtualized environments, provided by simulators. These tools must include scenarios in industrial contexts and allow integration with ML tools, among other capabilities. Currently, there are several platforms that support the simulation of various scenarios and the generation of synthetic data, facilitating the development of end-to-end systems based on artificial intelligence, such as DRL. This article presents an extensive review of software tools applied in DRL-based control systems for robotic manipulators. The selection of these tools is based on their efficiency, scalability, and compatibility with contemporary industrial standards, offering insights into their practical application in real-world scenarios. Consequently, this paper establishes a complete framework for designing and developing control systems for robotic manipulators using end-to-end DRL. This framework outlines in detail the tools, including simulators, APIs, libraries, and methods, and their interactions with each other. Additionally, it discusses the practical implications of this framework, highlighting its potential applications in industry and addressing some of the challenges and limitations encountered in applying DRL to complex robotic systems.
Article
Full-text available
This paper proposes a noble multi-robot path planning algorithm using Deep q learning combined with CNN (Convolution Neural Network) algorithm. In conventional path planning algorithms, robots need to search a comparatively wide area for navigation and move in a predesigned formation under a given environment. Each robot in the multi-robot system is inherently required to navigate independently with collaborating with other robots for efficient performance. In addition, the robot collaboration scheme is highly depends on the conditions of each robot, such as its position and velocity. However, the conventional method does not actively cope with variable situations since each robot has difficulty to recognize the moving robot around it as an obstacle or a cooperative robot. To compensate for these shortcomings, we apply Deep q learning to strengthen the learning algorithm combined with CNN algorithm, which is needed to analyze the situation efficiently. CNN analyzes the exact situation using image information on its environment and the robot navigates based on the situation analyzed through Deep q learning. The simulation results using the proposed algorithm shows the flexible and efficient movement of the robots comparing with conventional methods under various environments.
Article
Full-text available
The path planning technology is an important part of navigation, which is the core of robotics research. Reinforcement learning is a fashionable algorithm that learns from experience by mimicking the process of human learning skills. When learning new skills, the comprehensive and diverse experience help to refine the grasp of new skills which are called as the depth and the breadth of experience. According to the path planning, this paper proposes an improved learning policy based on the different demand of the experience’s depth and breadth in different learning stages, where the deep Q-networks calculated Q-value adopts the dense network framework. In the initial stage of learning, an experience value evaluation network is created to increase the proportion of deep experience to understand the environmental rules more quickly. When the path wandering phenomenon happens, the exploration of wandering point and other points are taken into account to improve the breadth of the experience pool by using parallel exploration structure. In addition, the network structure is improved by referring to the dense connection method, so the learning and expressive abilities of the network are improved to some extent. Finally, the experimental results show that our model has certain improvement in convergence speed, planning success rate and path accuracy. Under the same experimental conditions, the method of this paper is compared with the conventional intensive learning method via deep Q-networks. The results show that the indicators of this method are significantly higher.
Article
Full-text available
We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU. Furthermore, we show that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task involving finding rewards in random 3D mazes using a visual input.
Article
Full-text available
We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.
Article
In value-based reinforcement learning methods such as deep Q-learning, function approximation errors are known to lead to overestimated value estimates and suboptimal policies. We show that this problem persists in an actor-critic setting and propose novel mechanisms to minimize its effects on both the actor and critic. Our algorithm takes the minimum value between a pair of critics to restrict overestimation and delays policy updates to reduce per-update error. We evaluate our method on the suite of OpenAI gym tasks, outperforming the state of the art in every environment tested.
Article
The deep reinforcement learning community has made several independent improvements to the DQN algorithm. However, it is unclear which of these extensions are complementary and can be fruitfully combined. This paper examines six extensions to the DQN algorithm and empirically studies their combination. Our experiments show that the combination provides state-of-the-art performance on the Atari 2600 benchmark, both in terms of data efficiency and final performance. We also provide results from a detailed ablation study that shows the contribution of each component to overall performance.
Article
2014 In this paper we consider deterministic policy gradient algorithms for reinforcement learning with continuous actions. The deterministic policy gradient has a particularly appealing form: it is the expected gradient of the action-value function. This simple form means that the deterministic policy gradient can be estimated much more efficiently than the usual stochastic policy gradient. To ensure adequate exploration, we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policy. We demonstrate that deterministic policy gradient algorithms can significantly outperform their stochastic counterparts in high-dimensional action spaces.