Content uploaded by Dinesh Manocha
Author content
All content in this area was uploaded by Dinesh Manocha on Dec 01, 2020
Content may be subject to copyright.
DWA-RL: Dynamically Feasible Deep Reinforcement Learning Policy
for Robot Navigation among Mobile Obstacles
Utsav Patel, Nithish K Sanjeev Kumar, Adarsh Jagan Sathyamoorthy and Dinesh Manocha.
Abstract— We present a novel Deep Reinforcement Learning
(DRL) based policy to compute dynamically feasible and
spatially aware velocities for a robot navigating among mobile
obstacles. Our approach combines the benefits of the Dynamic
Window Approach (DWA) in terms of satisfying the robot’s
dynamics constraints with state-of-the-art DRL-based naviga-
tion methods that can handle moving obstacles and pedestrians
well. Our formulation achieves these goals by embedding the
environmental obstacles’ motions in a novel low-dimensional
observation space. It also uses a novel reward function to
positively reinforce velocities that move the robot away from
the obstacle’s heading direction leading to significantly lower
number of collisions. We evaluate our method in realistic 3-D
simulated environments and on a real differential drive robot in
challenging dense indoor scenarios with several walking pedes-
trians. We compare our method with state-of-the-art collision
avoidance methods and observe significant improvements in
terms of success rate (up to 33% increase), number of dynamics
constraint violations (up to 61% decrease), and smoothness. We
also conduct ablation studies to highlight the advantages of our
observation space formulation, and reward structure.
I. INTRODUCTION
There has been considerable interest in using Deep Re-
inforcement Learning (DRL)-based local planners [1], [2],
[3], [4] to navigate a non-holonomic/differential drive robot
through environments with moving obstacles and pedestri-
ans. They are effective in capturing and reacting to the
obstacles’ motion over time, resulting in excellent mobile
obstacle avoidance capabilities. In addition, these methods
employ inexpensive perception sensors such as RGB-D
cameras or simple 2-D lidars and do not require accurate
sensing of the obstacles. However, it is not guaranteed that
the instantaneous robot velocities computed by DRL-based
methods will be dynamically feasible [5], [6]. That is, the
computed velocities may not obey the acceleration and non-
holonomic constraints of the robot, becoming impossible for
the robot to move using them. This leads to highly non-
smooth and jerky trajectories.
Desirable behaviors such as computing dynamically fea-
sible velocities are developed using a DRL method’s reward
function, where they are positively rewarded and undesirable
behaviors such as collisions are penalized. However, a fully
trained policy could over-prioritize the collision avoidance
behavior over dynamic feasibility, if the penalty for collision
is not appropriately balanced with the reward for computing
feasible velocities [7]. Therefore, acceleration limits and the
non-holonomic constraints of the robot may not be satisfied.
It is crucial that the policy account for such fundamental
This work was supported in part by ARO Grants W911NF1910069,
W911NF1910315 and Intel.
Fig. 1: Our robot avoiding mobile obstacles using dynamically
feasible, smooth, spatially-aware velocities. The red and violet
arrows indicate the obstacles’ motion and green arrows shows
the robot’s trajectory in two time instances for different obstacle
positions. Our hybrid approach, DWA-RL, considers the motion of
the moving obstacles over time in its low-dimensional observation
space which is used to compute the robot velocities. This results in
fewer collisions than DWA [6], and DRL-based methods [4]. Since
our method computes the robot velocities based on DWA’s feasible
velocity space, the computed robot velocities are guaranteed to obey
the acceleration and non-holonomic constraints of the robot.
constraints especially when the robot navigates among pedes-
trians and other mobile obstacles.
Another issue with such methods [1], [2] is that they
use high-dimensional data such as RGB or depth images
as inputs during training to detect and observe obstacles.
This greatly increases the overall training time and makes
it harder for the policy to generalize the behaviors learnt in
one environment to another.
On the other hand, the Dynamic Window Approach
(DWA) [6], is a classic navigation algorithm that accounts
for the robot’s dynamics constraints and guarantees that the
velocities in a space known as the dynamic window are
collision-free and feasible/achievable for the robot within a
time horizon ∆t. However, DWA’s formulation only consid-
ers the robot’s sensor data at the current time instant to make
decisions. As a result, avoiding mobile obstacles becomes
challenging, leading to higher number of collisions [8].
Main Results: We present a hybrid approach, DWA-
RL, that combines the benefits of DWA and DRL-based
methods for navigation in the presence of mobile obstacles.
We present a DRL-based collision avoidance policy that
utilizes a novel observation space formulation and a novel
reward function to generate spatially aware, collision-free,
dynamically feasible velocities for navigation. We show that
our approach has a superior performance compared to DWA
and a DRL-based method [4] in terms of success rate,
number of dynamics constraints violations, and smoothness.
The main contributions of our work include:
•A novel formulation for the observation space, based
on the concept of dynamic window, is used to train our
DRL-based navigation policy. The observation space is
constructed by calculating the robot’s feasible velocity
set at a time instant and the costs corresponding to using
those velocities in the past ntime instants. This formu-
lation embeds the time evolution of the environment’s
state and preserves the dynamic feasibility guarantees
of DWA (Section IV). This leads to a significantly
lower dimensional observation space unlike other DRL
methods [1], [2]. This also results in significantly lower
training times, and easier sim-to-real transfer of the fully
trained policy.
•A novel reward function that is shaped such that the
robot’s navigation is more spatially aware of the obsta-
cles’ motion. That is, the robot is rewarded for navi-
gating in the direction opposite to the heading direction
of obstacles. This leads to the robot taking maneuvers
around moving obstacles. This is different from DWA,
which might navigate directly into the path of a mobile
obstacle or collide with it. Overall, our approach reduces
the collision rate by 33% in dynamic environments as
compared to DWA.
We evaluate our method and highlight its benefits over
prior methods in four high-fidelity 3-D simulated environ-
ments that correspond to indoor and outdoor scenes with
many static and moving obstacles. To demonstrate the sim-
to-real capabilities of our method, we use DWA-RL to
navigate a real differential drive robot using a simple 2-D
lidar in indoor scenes with randomly walking pedestrians.
II. REL AT ED WORK
A. Collision Avoidance in Dynamic Scenes
Global collision avoidance methods [9], [10], [11] com-
pute an optimal trajectory for the entire route, but they
generally work offline which is not suitable for dynamic
obstacles. On the other hand, vector-based local approaches
such as DWA[6] or other multi-agent methodss [12] use
limited sensory information and are computationally efficient
when avoiding static obstacles.
Several works have extended DWA’s capabilities to avoid
mobile obstacles by using techniques such as D* graph
search [13], look-ahead to detect non-convex obstacles [14],
or by extending beyond the local dynamic window to com-
pute possible future paths using a tree [15]. The Curvature-
Velocity method [16] is another method similar to DWA
which formulates collision avoidance as a constrained opti-
mization problem incorporating goal and vehicle dynamics.
B. DRL-based Collision Avoidance
There have been numerous works on DRL-based collision
avoidance in recent years. Methods such as [17] use a deep
double-Q network for depth prediction from RGB images
Symbols Definitions
k Number of linear or angular velocities robot
can execute at a time instant
n Number of past time instants used in
observation space formulation
v, ω Robot’s linear and angular velocities
˙vl,˙ωlRobot’s linear and angular acceleration limits
distobst(v, ω) Function gives the distance to the nearest
obstacle at time tfrom the trajectory generated
by velocity vector(v, ω)
dist(p1,p2)Euclidean distance between p1and p2(two
2-D vectors).
va, ωaRobot’s current linear and angular velocity
dtDistance between the pedestrian and robot
r Reward for a specific robot action during
training
Rrob Robot’s radius
pt
rob Position vector of the robot in the odometry
frame at any time instant t
gGoal location vector
EndP oint(vi, ωi)Function to compute the end point of an
trajectory generated by a (vi, ωi) vector
TABLE I: List of symbols used in our work and their definitions.
for collision avoidance in static environments, whereas more
advanced methods [18] use Convolutional Neural Networks
to model end-to-end visuomotor navigation capabilities.
An end-to-end obstacle avoidance policy for previously
unseen scenarios filled with static obstacles a few pedestrians
is demonstrated in [19]. A decentralized, scalable, sensor-
level collision avoidance method was proposed in [4], whose
performance was improved using a new hybrid architecture
between DRL and Proportional-Integral-Derivative (PID)
control in [20]. Assuming that pedestrians aid in collision
avoidance, a cooperative model between a robot and pedestri-
ans was proposed in [21] for sparse crowds. An extension to
this work using LSTMs [22] to capture temporal information
enabled it to operate among a larger number of pedestrians.
A few deep learning-based works have also focused on
training policies that make the robot behave in a socially
acceptable manner [23], [24] and mitigate the freezing robot
problem [2], [25]. However, such policies do not provide
any guarantees on generating dynamically feasible robot
velocities.
III. BACKGROU ND
In this section we provide an overview of the different
concepts and components used in our work.
A. Symbols and Notations
A list of symbols frequently used in this work is shown
in Table I. Rarely used symbols are defined where they are
used.
B. Dynamic Window Approach
The Dynamic Window Approach (DWA) [6] mainly uses
the following two stages to search for a collision-free, and
reachable [v, ω] velocity vector in a 2-dimensional velocity
space known as the dynamic window. The dynamic window
is a discrete space with k2[v, ω] velocity vectors, where k
is the number of linear and angular velocities that the robot
can execute at any time instant.
1) Search Space: The goal of the first stage is to generate
a space of reachable velocities for the robot. This stage
involves the following steps.
Velocity Vectors Generation: In this step, according
to the maximum linear and angular velocities the robot
can attain, a set V of [v, ω] vectors is generated. Each
velocity vector in the set corresponds to an arc of a different
radius along which the robot can move along. The equations
describing the trajectory of the robot for different [v, ω]
vectors can be found in [6].
Admissible Velocities: After forming set V, for each
[v, ω]∈V, the distance to the nearest obstacle from its cor-
responding arc is computed. The [v, ω] vector is considered
admissible only if the robot is able to stop before it collides
with the obstacle. The admissible velocity set Vad is given
by,
Vad ={v, ω}Where, v≤p2·distobs(v , ω)·˙vb,
ω≤p2·distobs(v, ω)·˙ωb
(1)
dist(v, ω), is the distance to the nearest obstacle on the
arc.
Dynamic Window: The next step is to further prune the
set Vad to remove the velocities that are not achievable within
a∆tconsidering the robot’s linear and angular acceleration
limits. This final set is called the dynamic window and is
formulated as,
Vd={v, ω|v∈[va−˙vl·∆t, va+ ˙vl·∆t],
ω∈[ωa−˙ωl·∆t, ωa+ ˙ωl·∆t]}.(2)
2) Objective Function Optimization: In the second stage,
the [v, ω], which maximizes the objective function defined
in equation 3, is searched for in Vd.
G(v, ω) = σ(α.heading(v, ω )+β.distobs(v, ω)+γ .vel(v, ω )).
(3)
For a [v, ω] executed by the robot, heading() measures
the robot’s progress towards the goal (more progress =⇒
higher value), dist() measures the robot’s distance from the
nearest obstacles (more distance =⇒higher value), and the
vel() function checks that v6= 0.α, β and γdenote weighing
constants that can be tuned by the user.
Obstacle information embedded in the velocity space is
utilized to select the optimal velocity pair. The [v, ω] vector
computed by DWA may be a local minimum. However, this
issue can be mitigated if the connectivity of free space to
the goal is known.
C. DRL Policy Training
DRL-based collision avoidance policies are usually trained
in simulated environments (similar to Fig.9) using a robot
that uses the said policy to perform certain actions based
on environmental observations to earn some rewards. The
robot’s observation consists of information regarding its
environment (such as the positions of obstacles), and the set
of all observations that the robot’s sensors can make is called
its observation space (ot). The robot’s actions are represented
by the velocities that it can execute, and the set of all the
robot’s velocities is called its action space (at).
The policy’s objective during training is to maximize
areward function by performing the actions which are
rewarded and avoiding actions that are penalized. This pro-
ceeds until the robot continuously achieves the maximum
reward for several consequent training iterations. Collision-
free velocities can then be computed from the fully trained
policy πas,
[v, ω]∼π(at|ot).(4)
IV. OUR AP P ROAC H
In this section, we explain the construction of our novel
observation space, the reward function, and our network
architecture.
A. Observation Space Generation
The steps used in the observation space construction are
detailed below.
1) Dynamically Feasible Velocity Vectors: Unlike DWA,
we do not first generate an admissible velocity set that con-
tains collision-free robot velocities. Instead, we first compute
sets of feasible/reachable linear and angular velocities (lin =
[va−˙v·∆t, va+ ˙v·∆t]and ang =[ωa−˙ω·∆t, ωa+ ˙ω·∆t])
using equation 2. We discretize these sets lin and ang into k
intervals such that the total number of [v, ω] vectors obtained
from the intervals is k2. We then form the set of feasible
velocities Vffrom these discretized sets as,
Vf={(v, ω)|v∈link, ω ∈angk}.(5)
The velocity vectors in Vfdo not account for the locations
of the obstacles in the current time instant tcor the past n-1
time instants. Therefore, some velocities in Vfcould lead
to collisions. The klinear and angular velocities in Vfare
appended n-1 times as column vectors in two matrices each
of size (k2×n) and the generated linear and angular velocity
matrices are shown in the Fig. 2(a).
2) Obstacle sets: We use a 2-D lidar scan to sense the
location of the obstacles around the robot. For each time
instant, the obstacle locations are obtained relative to a fixed
odometry coordinate frame and stored in a set. The odometry
frame is attached to the ground at the location from where the
robot started. In Fig. 3(a), the locations of two obstacles in
the current as well as in the past n-1 time steps are shown.
We add the set of obstacle locations in a list Oof length
n(see Fig. 3(b)), where each row shows the set of obstacle
locations for a specific time instant. We use Oto incorporate
information regarding the motion of various obstacles in the
environment.
3) Obstacle cost calculation: Next, we calculate the
obstacle cost for every velocity vector in Vfusing the
distobst() function. Each vector in Vfis forward simulated
for a time duration ∆tto check if it leads to a collision,
given the obstacle positions in O. The costs are calculated
as,
Fig. 2: (a)[Top] The initial construction of our observation space. Initially, the linear and angular velocity matrices ( [v , ω]∈Vf) along
with their obstacle and goal alignment cost matrices for ntime instants are constructed. [Bottom] The values in the four matrices are
rearranged in the descending order of the total cost value at the current time instant. In this case shown, [v14, ω14] has the least total cost.
(b)[Top] The total costs TC for each velocity vector belonging to set Vf.[Bottom] The velocity vectors rearranged in the descending
order of the total cost at the current time instant. (c) The action space for the current time step is obtained by sorting the feasible velocities
vectors (v, ω)in the descending order of the total cost value at the current time instant.
Fig. 3: Obstacle set construction. (a) The change in the location
of the obstacle in the past n-1 time instances with respect to the
location of the robot in the current time instance. The red region (R)
and the green region (G) denote the regions of high risk and low risk
of collisions respectively. They depend on the relative positions and
motions of the robot and the obstacle. (b) The list of the obstacle
sets obtained at various time instances each column corresponds to
the location of the obstacles at particular time instance
OCtj
i=
ccol if distobstj(vi, ωi)< Rrob,
1
distobstj(vi, ωi)otherwise.
(6)
Where, ccol= 40. The Fig.2 (a) shows the obtained (k2×n)
obstacle cost matrix.
4) Goal alignment cost calculation: Each [v, ω ]in Vfis
forward simulated for a time ∆tand the distance from the
endpoint of the trajectory to the robot’s goal is measured
(equation 7). The velocity vectors that reduce the distance
between the robot and the goal location are given a low cost.
GCtc
i=dist(EndP oint(vi, ωi),g)∗cg a (7)
The goal alignment cost is independent of the location of
the obstacles around the robot, therefore the same cost for
each pair is appended ntimes to obtain a goal alignment
cost matrix of shape (k2×n) as seen in Fig. 2(a), and in the
equation 8.
GCtc
i=GCtc−1
i=...... =GCtc−n−1
i(8)
Where, cga= 2.5.
5) Total cost calculation: The total cost for the robot
using a vector [vi, ωi] for the current time instant tcis
calculated as,
T Ctc
i=OCtc
i+GCtc
i(9)
and is shown in Fig.2(b).
6) Sorting the Velocity Vectors: The linear, angular, ob-
stacle cost and goal alignment cost matrices obtained in
Section IV-A are now reordered to better represent which
velocities in Vfhave the lowest costs given the obstacle
positions for the past ntime instants. The velocity vectors
are sorted in ascending order according to the total cost of
the velocity vectors at the current time instant. The elements
in the velocity and cost matrices are then reordered in same
order.
7) Observation Space and Action Space: Finally, our
observation space is constructed using the reordered linear,
angular matrices along with the obstacle and goal alignment
cost matrices and stacking them to get a matrix of size
(k2×n×4). Our action space is the reordered set of feasible
velocities for the robot at the current time instant (see Fig.2c).
The observation space is then passed to the policy network
(see Fig.4).
B. DRL Navigation Framework
In this section, we detail the other components of our DRL
policy’s training, and run-time architecture.
1) Reward Function Shaping: Rewards for the basic nav-
igation task of reaching the goal and avoiding collisions
with obstacles are provided with high positive or negative
values respectively. In order to make the training faster, the
difference between distance from the goal in the previous
Fig. 4: Our method’s run-time architecture. The observations such as obstacle positions measured by the robot’s sensors (lidar in our
case) and the robot’s position and velocity at time tc, along with the obstacle and goal-alignment costs are reordered (Section IV-A.6)
to generate a (k2×n×4) dimensional observation space (Section IV-A.7) shown in green corresponding to time instant tc. The fully
trained DRL policy network (shown in Fig. 5) uses the observation space to compute the index of the output velocity in the action space.
Fig. 5: Architecture of the policy network used for training. The
input observation space is marked in blue, the network layers are
marked in orange. The initial 5 layers are the convolutional layers
and the remaining 3 layers are the fully connected layers. The output
of the network is marked in green.
and the current time instant is utilized in the reward function.
This incentivizes the policy to move the robot closer to the
goal each time step, or otherwise be penalized as,
(rg)t=(rgoal if dist(pt
rob,g)<0.3m,
−2.5(dist(dist(pt
rob,g)) −pt−1
rob ,g)otherwise.
(10)
We define the penalty for collision as,
(rc)t=(rcollision if dist(pt
rob,pt
obs)<0.5m,
0otherwise.
(11)
When the distance between a dynamic obstacle (located
at pt
obs = [xt
obs, yt
obs]) and the robot (located at pt
rob =
[xt
rob, y t
rob]) is less than a certain threshold, the robot receives
the steering reward/penalty (equation 13). The parameters dt
and btwhich influence this reward are depicted in Fig.3a,
and defined as follows,
dt=dist(pt
rob,pt
obs)|bt=yt
rob −yt
obs.(12)
(rsteer)t=
−|bt| ∗ rspatial −rproximity
dt
if pt
rob ∈ R
+|bt| ∗ rspatial if pt
rob ∈ G.
(13)
From equation 13, it can be seen that the robot is rewarded
positively when it is in the green region G(behind the
obstacle) shown in Fig.3a and penalized when it is in the
red region R(along the obstacle’s heading direction). This
reinforces spatially aware velocities when handling dynamic
obstacles i.e., velocities which move the robot away from
an obstacle’s heading direction, thereby reducing the risk of
collision.
Proposition IV.1. Region Rhas a high risk of collision.
Proof. The distance between the obstacle
D2= (prob
x−pobs
x)2+ (prob
y−pobs
y)2(14)
We prove that the danger of collision in the red zone is
high since the distance between the dynamic obstacle and the
robot is decreasing. To see this, we differentiate the equation
14 on both sides,
2DdD
dt = 2(prob
x−pobs
x)(dprob
x
dt −dpobs
x
dt )+
2(prob
y−pobs
y)(dprob
y
dt −dpobs
y
dt )(15)
From the Fig.3a, we get the following conditions for the
case where the obstacle moves to the left (with a positive Y
component in velocity) in the odometry coordinate frame.
Note that the conditions also hold if the obstacle had a
velocity that moved it into R.
if(prob
x, prob
y)∈ R
(prob
y−pobs
y)>0|(prob
x−prob
x)<0
(Vrob
y−Vobs
y)<0|(Vrob
x−Vrob
x)>0(16)
if(prob
x, prob
y)∈ G
(prob
y−pobs
y)<0|(prob
x−prob
x)<0
(Vrob
y−Vobs
y)<0|(Vrob
x−Vrob
x)>0(17)
Equation 15 implies,
dD
dt =1
D[(prob
x−pobs
x)(vrob
x−vobs
x)+
(prob
y−pobs
y)(vrob
y−vobs
y)] (18)
Substituting conditions in equation 16 and considering com-
parable velocities for the robot and obstacle,
dD
dt <0(19)
So, dist(pt
rob,pt
obs)is always a decreasing function in R.
This implies a higher risk of collision.
Substituting conditions in equation 17,
dD
dt >0(20)
In G, if we have |(vrob
y−vobs
y)|>> |(vrob
x−vobs
x)|,
then based on the signs of these components in the right
hand side of equation 15 will be positive. This implies that
dist(pt
rob,pt
obs)will be an increasing function in Gif vrob
yis
highly negative in y-axis. This is intuitive as a high velocity
towards the negative y direction (Gzone) is required to
generate a spatially aware trajectory in the given scenario.
Indirectly, velocities with highly negative vrob
yare positively
rewarded in our formulation.
When the obstacle moves towards right relative to the
odometry coordinate frame, the proof is symmetrical and
still proves that dist(pt
rob,pt
obs)is a decreasing function in
corresponding Rconstructed.
In the case of an obstacle moving head-on the total
steering reward is zero. In the presence of multiple dynamic
obstacles around the robot, the union of the red zones is to
be constructed for the total negative rewards.
This is also supplemented by providing negative rewards
inversely proportional to the distance from all the obstacles
in the sensor range of the robot. This reduces the danger
of collision as negative reward is accumulated as the robot
approaches the obstacle.
(rdangerOf C ollison )t=−rdCollision
dt
(21)
We set rgoal = 2000, rcollison =-2000,rproximity =
10, rspatial = 25, rdCollison = 30.
2) Network Architecture: The policy network architecture
that we use is shown in Fig.5. Five 2-D convolutional layers,
followed by 3 fully-connected layers are used for processing
the observation space. ReLU activation is applied between
the hidden layers. This architecture is much simpler and
requires fewer layers for handling our observation space.
3) Policy Training: We simulate multiple Turtlebot2
robots each with an attached lidar to train the models. The
Turtlebots are deployed in different scenarios in the same
simulation environment , to ensure that the model does not
overfit to any one scenario. Our policy finishes training in
less than 20 hours, which is significantly less than the 6 days
it takes to train methods such as [1], [2], which use similar
training environments.
4) Run-time Architecture: The output of a fully trained
policy network is the index ithat corresponds to a velocity
pair in the action space. The [v, ω]vector at the ith location
in the action space is then used by the robot for navigation
at the current time instant tc.
Fig. 6: The plot shows the reward values obtained by the policy
during training. The horizontal axis shows the number of steps
executed by the policy in the training environment and the vertical
axis shows the reward value for each step. The policy converges
to a stable reward value after executing about 700k steps in the
training environment.
Proposition IV.2. The velocity chosen by our fully trained
policy will always obey the dynamics constraints of the robot.
Proof. The proof follows trivially from the fact that our
action space is a subset of our observation space (Fig.2c),
which in turn is constructed using the dynamic feasibility
equations of DWA. Thus, our policy preserves the dynamic
feasibility guarantees of DWA.
Our full run-time architecture is shown in Fig.4.
V. RE SU LTS , COMPARISONS AND EVALUATIO NS
A. Implementation
We use ROS Melodic and Gazebo 9 to create the
simulation environments for training and evaluating on a
workstation with an Intel Xeon 3.6GHz processor and an
Nvidia GeForce RTX 2080TiGPU. We implement the policy
network using TensorFlow and use the PPO2 implementation
provided by stable baselines to train our policy.
To test the policy’s sim-to-real transfer and generalization
capabilities, we use it to navigate a Turtlebot 2 and a Jackal
robot in challenging indoor scenes with randomly moving
pedestrians (see attached video). DWA-RL does not require
accurate sensing of the obstacles’ positions in real-world
scenes.
B. Training Scenario
The training environment used to train the DWA-RL policy
is shown in Fig.7. We use 4 robots in the environment that
collect the training data in parallel, speeding up the overall
training process. Each robot in the training environment
encounters different type of static and dynamic obstacles
while navigating towards the goal, this training methodology
ensures that the policy does not overfit to a particular
scenario and generalizes well during the testing phase.
C. Testing Scenarios
We evaluate DWA-RL and compare with prior methods in
the following scenarios (see Fig.9).
Zigzag-Static: This scenario contains several sharp turns
with a number of static obstacles to resemble a cluttered
indoor environment.
Fig. 7: The training environment used to train the DWA-
RL policy. We use four robots in parallel to collect the data
for policy training, each facing different static and mobile
obstacles.
Fig. 8: DWA-RL tested on a Turtlebot in a dense scenario. This
shows that DWA-RL can be easily transferred from simulations
to different real-world robot platforms. The trajectories of the
obstacles is shown in blue and red. The robot’s trajectory is shown
in green.
Occluded-Ped: This scenario contains several sharp turns
and two pedestrians who could be occluded by the walls.
Sparse-Dynamic: This scenario contains 4 walking pedes-
trians in a corridor-like setting moving at 45◦or 90◦angles
with the line connecting the robot’s start and goal locations.
Dense-Dynamic This scenario contains 17 pedestrians in
an area of 13 ×8m2who could be static or moving and
resembles dense dynamic outdoor environments.
D. Evaluation Metrics
We compare our approach with: (i) Dynamic Window
Approach [6] (ii) Long et al.’s method [4]. We also provide
ablation studies to demonstrate the effects of our various
design choices while formulating the observation space and
reward function. We use the following metrics to compare
the methods and the ablation study models.
•Success Rate - The number of times the robot reached
its goal without colliding with obstacles over 50 trials.
The obstacles’ initial positions are randomly assigned
in each trial.
•Average Trajectory Length - The total distance tra-
versed by the robot, until the goal is reached, averaged
over the number of successful trials.
•Average Velocity - It is the trajectory length over the
time taken to reach the goal in a successful trial.
Metrics Method Zigzag
Static
Occluded
Ped
Sparse
Dynamic
Dense
Dynamic
Success
Rate
DWA-RL 1.0 1.0 0.54 0.4
DRL 1.0 0.76 0.3342 .31
DWA 1.0 0.9 0.42 0.3
Avg
Traj.
Length (m)
DWA-RL 28.85 27.26 11.95 12.23
DRL 26.89 25.63 12.1 11.57
DWA 26.62 25.07 11.99 12.81
Avg
Velocity (m/s)
DWA-RL 0.38 0.46 0.42 0.37
DRL 0.42 0.51 0.37 0.503
DWA 0.435 0.568 0.6 0.64
TABLE II: Relative performance of DWA-RL versus DWA [6]
and Long et. al’s method [4].
Metrics Method Zigzag
Static
Occluded
Ped
Sparse
Dynamic
Dense
Dynamic
Success
Rate
With PR 1.0 1.0 0.54 0.42
Without PR 0.86 0.64 0.44 0.4
Avg Traj.
Length (m)
With PR 28.85 27.26 11.95 12.46
Without PR 28.12 27.86 11.6 12.78
Avg
Velocity (m/s)
With PR 0.37 0.47 0.42 .38
Without PR 0.34 0.41 0.41 .34
TABLE III: Ablation study showing relative performance of
the models trained with positive reinforcement (PR) and without
positive reinforcement.
E. Analysis and Comparison
The results of our comparisons and ablation studies are
shown in tables II, III and IV.
From table II, we observe that in terms of success rate
all approaches perform well in the Zigzag-Static scenario.
However, in the environments with mobile obstacles, DWA-
RL collides significantly less number of times. This is
because DWA-RL considers obstacles’ motion over time (in
the observation space) and computes velocities that avoid the
region in front of the obstacle (reinforced in reward function).
DWA and Long et al.’s method try to avoid the obstacles from
in-front and collide, especially in the Occluded-Ped scenario,
where obstacles are introduced suddenly. Even with limited
temporal information, DWA-RL always guides the robot
in the direction opposite to the obstacle’s motion, thereby
reducing the chances of a collision. DWA-RL achieves this
while maintaining a comparable average trajectory lengths
and velocities for the robot.
Ablation Study for the Positive Reinforcement: We
compare two policies trained with and without the positive
reinforcement (PR) (|bt| ∗ rspatial) term in equation 13 in
different test environments. From Table III, we observe that
the policy trained with PR outperforms the model trained
without it in all the test environments. The policy trained
without PR mostly tries to avoid an obstacle by navigating
in-front of it, predominantly resulting in collisions.
Ablation Study for the Observation Space: Our obser-
vation space uses four matrices stacked together as show
in Fig. 4 which include velocities and the obstacle and
goal-alignment costs. We compare this formulation with one
which uses three matrices; the linear and angular velocity
matrices and a total cost matrix stacked together. The total
cost matrix is the sum of the obstacle and goal-alignment cost
matrices. The results for both the policies are shown in Table
IV. We observe that the 4-matrix formulation outperforms the
Fig. 9: Different testing scenarios used to evaluate the collision avoidance approaches (a): Zigzag-Static Scenario; (b): Occluded-Ped
scenario where the dynamic obstacles are suddenly introduced; (c): Sparse-Dynamic scenario; (d) Dense-Dynamic environment contains
the combination of static and dynamic obstacles.
Metrics Method Zigzag
Static
Occluded
Ped
Sparse
Dynamic
Dense
Dynamic
Success
Rate
3-matrix 0.82 0.94 0.36 0.36
4-matrix 1.0 1.0 0.54 0.42
Avg Traj.
Length (m)
3-matrix 27.93 27.04 11.56 12.34
4-matrix 28.85 27.26 11.95 12.46
Avg
Velocity (m/s)
3-matrix 0.38 0.44 0.46 .37
4-matrix 0.37 0.47 0.42 0.38
TABLE IV: Ablation study showing relative performance of
the models trained with 3-matrix and 4-matrix observation space
formulation.
Fig. 10: Graph showing the change in the linear velocity generated
by Long et. al’s approach along with the maximum and the mini-
mum achievable velocity at that time instant. For this experiment,
we use Turtlebot 2 with max angular velocity, min angular velocity
and max angular acceleration limit of 3.14 rad/s, -3.14 rad/sec and
2rad/s2respectively.
3-matrix formulation in all the scenarios. This is because, the
information about environmental obstacles is better imparted
into the policy when the obstacle cost is provided separately.
Dynamics Constraints Violation The Fig. 10 and 11
shows the graph of linear and angular velocities generated
by the Long et. al’s method [4] in the Dense Dynamic
environment. We observe that the output angular velocities
lie outside the maximum and minimum attainable angular
velocities of the robot 61% of the times, leading to oscil-
latory/jerky motion. DWA-RL on the other hand, produces
velocities that always lie within the attainable velocity range
(Fig. 12 and 13). This results in considerably smoother robot
trajectories.
VI. CONCLUSIONS, L IMITATIONS AND FUTURE WORK
We present a novel formulation of a Deep Reinforcement
Learning policy that generates dynamically feasible and
spatially aware smooth velocities. Our method addresses the
Fig. 11: Graph showing the change in the angular velocity gen-
erated by the Long et. al’s approach along with the maximum and
the minimum achievable velocity at that time instant.
Fig. 12: Graph showing the change in the linear velocity generated
by the DWA-RL approach along with the maximum and the
minimum achievable velocity at that time instant. The plot shows
that the output velocity of the DWA-RL policy is always within the
achievable velocity range at any time instant.
issues associated with learning-based approaches (dynamic
infeasible velocities) and the classical Dynamic Window Ap-
proach (sub-optimal mobile obstacle avoidance). We validate
our approach in simulation and on real-world robots, and
compare it with the other collision avoidance techniques in
terms of collision rate, average trajectory length and velocity,
and dynamics constraints violations.
Our work has a few limitations which we wish to address
in the future. For instance, the model needs at least few
observations to compute a velocity that is spatially aware. If
the obstacles are suddenly introduced in the field of view of
the robot, the robot might freeze. Efficiency of this approach
with an integrated global planner is yet to be studied. Also,
the current model uses Convolutional Neural Network as
Fig. 13: Graph showing the change in the angular velocity gener-
ated by the DWA-RL approach along with the maximum and the
minimum achievable velocity at that time instant. The plot shows
that the output velocity of the DWA-RL policy is always within the
achievable velocity range at any time instant.
layers in the policy network, but the use of LSTM [26]
could improve the processing of the temporal data from the
observation space.
REFERENCES
[1] J. Liang, U. Patel, A. Sathyamoorthy, and D. Manocha, “Crowd-
steer: Realtime smooth and collision-free robot navigation in densely
crowded scenarios trained using high-fidelity simulation,” 07 2020, pp.
4193–4200.
[2] A. J. Sathyamoorthy, J. Liang, U. Patel, T. Guan, R. Chandra, and
D. Manocha, “Densecavoid: Real-time navigation in dense crowds
using anticipatory behaviors,” 2020.
[3] M. Everett, Y. F. Chen, and J. P. How, “Collision avoidance in
pedestrian-rich environments with deep reinforcement learning,” 2019.
[4] P. Long, T. Fan, X. Liao, W. Liu, H. Zhang, and J. Pan, “Towards
optimally decentralized multi-robot collision avoidance via deep re-
inforcement learning,” in 2018 IEEE International Conference on
Robotics and Automation (ICRA), May 2018, pp. 6252–6259.
[5] B. d’Andrea Novel, G. Bastin, and G. Campion, “Modelling and
control of non-holonomic wheeled mobile robots,” in Proceedings.
1991 IEEE International Conference on Robotics and Automation,
1991, pp. 1130–1131.
[6] D. Fox, W. Burgard, and S. Thrun, “The dynamic window approach
to collision avoidance,” IEEE Robotics Automation Magazine, vol. 4,
no. 1, pp. 23–33, 1997.
[7] C. R. Shelton, “Balancing multiple sources of reward in reinforcement
learning,” in Proceedings of the 13th International Conference on
Neural Information Processing Systems, ser. NIPS’00. Cambridge,
MA, USA: MIT Press, 2000, p. 1038–1044.
[8] M. Missura and M. Bennewitz, “Predictive collision avoidance for
the dynamic window approach,” in 2019 International Conference on
Robotics and Automation (ICRA), 2019, pp. 8620–8626.
[9] E. W. Dijkstra, “A note on two problems in connexion with graphs,”
Numerische mathematik, vol. 1, no. 1, pp. 269–271, 1959.
[10] P. Hart, N. Nilsson, and B. Raphael, “A formal basis for the heuristic
determination of minimum cost paths,” IEEE Transactions on Systems
Science and Cybernetics, vol. 4, no. 2, pp. 100–107, 1968. [Online].
Available: https://doi.org/10.1109/tssc.1968.300136
[11] S. M. Lavalle, “Rapidly-exploring random trees: A new tool for path
planning,” Tech. Rep., 1998.
[12] D. Wolinski, S. J. Guy, A.-H. Olivier, M. Lin, D. Manocha, and
J. Pettr´
e, “Parameter estimation and comparative evaluation of crowd
simulations,” in Computer Graphics Forum, vol. 33, no. 2. Wiley
Online Library, 2014, pp. 303–312.
[13] M. Seder and I. Petrovic, “Dynamic window based approach to mobile
robot motion control in the presence of moving obstacles,” 05 2007,
pp. 1986 – 1991.
[14] C.-C. Chou, F.-L. Lian, and C.-C. Wang, “Characterizing indoor
environment for robot navigation using velocity space approach with
region analysis and look-ahead verification,” Instrumentation and
Measurement, IEEE Transactions on, vol. 60, pp. 442 – 451, 03 2011.
[15] E. J. Molinos, ´
Angel Llamazares, and M. Oca˜
na, “Dynamic window
based approaches for avoiding obstacles in moving,” Robotics and Au-
tonomous Systems, vol. 118, pp. 112 – 130, 2019. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0921889018309746
[16] R. Simmons, “The curvature-velocity method for local obstacle avoid-
ance,” in Proceedings of IEEE International Conference on Robotics
and Automation, vol. 4, April 1996, pp. 3375–3382 vol.4.
[17] L. Xie, S. Wang, A. Markham, and N. Trigoni, “Towards monocular
vision based obstacle avoidance through deep reinforcement
learning,” CoRR, vol. abs/1706.09829, 2017. [Online]. Available:
http://arxiv.org/abs/1706.09829
[18] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training
of deep visuomotor policies,” CoRR, vol. abs/1504.00702, 2015.
[Online]. Available: http://arxiv.org/abs/1504.00702
[19] Y. Kim, J. Jang, and S. Yun, “End-to-end deep learning for autonomous
navigation of mobile robot,” in 2018 IEEE International Conference
on Consumer Electronics (ICCE), Jan 2018, pp. 1–6.
[20] T. Fan, P. Long, W. Liu, and J. Pan, “Fully distributed multi-robot
collision avoidance via deep reinforcement learning for safe and
efficient navigation in complex scenarios,” CoRR, vol. abs/1808.03841,
2018. [Online]. Available: http://arxiv.org/abs/1808.03841
[21] Y. F. Chen, M. Liu, M. Everett, and J. P. How, “Decentralized non-
communicating multiagent collision avoidance with deep reinforce-
ment learning,” in ICRA. IEEE, 2017, pp. 285–292.
[22] M. Everett, Y. F. Chen, and J. P. How, “Motion planning among
dynamic, decision-making agents with deep reinforcement learning,”
in 2018 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), Oct 2018, pp. 3052–3059.
[23] L. Tai, J. Zhang, M. Liu, and W. Burgard, “Socially compliant navi-
gation through raw depth inputs with generative adversarial imitation
learning,” in ICRA, May 2018, pp. 1111–1117.
[24] Y. F. Chen, M. Everett, M. Liu, and J. How, “Socially aware motion
planning with deep reinforcement learning,” 09 2017, pp. 1343–1350.
[25] A. J. Sathyamoorthy, U. Patel, T. Guan, and D. Manocha, “Frozone:
Freezing-free, pedestrian-friendly navigation in human crowds,” IEEE
Robotics and Automation Letters, vol. 5, no. 3, pp. 4352–4359, 2020.
[26] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Comput., vol. 9, no. 8, p. 1735–1780, Nov. 1997. [Online]. Available:
https://doi.org/10.1162/neco.1997.9.8.1735