ArticlePDF Available

Enhanced Deep Reinforcement Learning for Parcel Singulation in Non-Stationary Environments

Authors:

Abstract and Figures

In the rapidly expanding logistics sector, parcel singulation has emerged as a significant bottleneck. To address this, we propose an automated parcel singulator utilizing a sparse actuator array, which presents an optimal balance between cost and efficiency, albeit requiring a sophisticated control policy. In this study, we frame the parcel singulation issue as a Markov Decision Process with a variable state space dimension, addressed through a deep reinforcement learning (RL) algorithm complemented by a State Space Standardization Module (S3). Distinct from previous RL approaches, our methodology initially considers the non-stationary environment during the problem modeling phase. To counter this challenge, the S3 module standardizes the dynamic input state, thereby stabilizing the RL training process. We validate our method through simulation experiments in complex environments, comparing it with several baseline algorithms. Results indicate that our algorithm excels in parcel singu-lation tasks, achieving a higher success rate and enhanced efficiency.
Content may be subject to copyright.
ENHANCED DEEP REINFORCEMENT LEARNING FOR PARCEL SINGULATION IN
NON-STATIONARY ENVIRONMENTS
Jiwei Shen, Hu Lu, Hao Zhang, Shujing Lyu, Yue Lu
School of Communications and Electronic Engineering, East China Normal University, Shanghai, China
ABSTRACT
In the rapidly expanding logistics sector, parcel singulation
has emerged as a significant bottleneck. To address this,
we propose an automated parcel singulator utilizing a sparse
actuator array, which presents an optimal balance between
cost and efficiency, albeit requiring a sophisticated control
policy. In this study, we frame the parcel singulation issue
as a Markov Decision Process with a variable state space
dimension, addressed through a deep reinforcement learning
(RL) algorithm complemented by a State Space Standard-
ization Module (S3). Distinct from previous RL approaches,
our methodology initially considers the non-stationary en-
vironment during the problem modeling phase. To counter
this challenge, the S3 module standardizes the dynamic input
state, thereby stabilizing the RL training process. We vali-
date our method through simulation experiments in complex
environments, comparing it with several baseline algorithms.
Results indicate that our algorithm excels in parcel singu-
lation tasks, achieving a higher success rate and enhanced
efficiency.
Index TermsReinforcement learning, Markov deci-
sion process, nonstationary environment, state space stan-
dardization, parcel singulation
1. INTRODUCTION
The burgeoning reliance on e-commerce, coupled with a dras-
tic surge in parcel volumes, necessitates an expansion of lo-
gistics chain capacity. In the industry, bulk unloading is a
prevalent approach adopted to enhance the distribution effi-
ciency. Parcel singulation, a critical preparatory phase pre-
ceding the downstream automated sorting process, ensures
the segregation and appropriate spacing of parcels at a prede-
fined interval, denoted as d, along the x-axis. Traditionally,
parcels have been manually separated into a one-dimensional
flow, albeit at a relatively low unload rate. Consequently, sin-
gulation emerges as a significant bottleneck in the sortation
line. Therefore, the development of high-performance meth-
ods is increasingly sought-after.
Corresponding author.
This work was supported by the Science and Technology Commission
of Shanghai Municipality under Grant 22DZ2229004.
fast
slow
d
Result
Infeed area Singulation processing
Speed
RGB image
Depth image
(a) Real world task (b) Simulated Environment
Fig. 1. From left to right, the parcels are densely packaged in
the infeed area. After singulation processing, the parcels are
separated and spaced with predefined interval d.
The equipment of parcel singulation can be modeled with
1D, 2D, and even 3D structures. 1D model [1] is the sim-
plest by adjusting the speed of control-based conveyors ac-
cording to the mathematical derivation. However, it is space
consuming and lacks efficiency as it requires a large singula-
tion area (800mm width) to ensure singulation performance.
Based on the 3D model, it is capable of seperating parcels
with peristaltic waves within limited space [2]. Nevertheless,
widely deployed equipment with a 3D structure increases the
overall cost of management and maintenance [3]. Considered
the tradeoff between limited space and cost, the equipment
with 2D models is regarded as an optimal solution to par-
cel singulation, as shown in Figure 1. Kim [4] proposed a
non-learning method with a greedy policy. As the state of
parcel singulation task under this circumstance becomes un-
predictable, it would be extremely complicated to artificially
design a comprehensive control policy considering various
scenarios. Thus, deep RL offers an alternative way of learn-
ing this complicated control policy to improve the singula-
tion efficiency [5–7]. Then an intuitive control of singulation
is proposed by learning a priority assignment function with
DDPG [4]. However, the performance of the priority-learned
method increases only a little as it ignored the uncertainty of
input parcels when formulating the problem [8–10].
In this paper, the parcel singulation task is formulated as
Markov decision process(MDP) problem with a nonstation-
ary environment [11, 12]. In real-world circumstances, the
input state space of MDP varies owing to the uncertainty of
input parcels at each time-step. To handle the dynamic in-
put for deep RL, Maria proposed a new network DeepSet-Q
86979-8-3503-4485-1/24/$31.00 ©2024 IEEE ICASSP 2024
ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 979-8-3503-4485-1/24/$31.00 ©2024 IEEE | DOI: 10.1109/ICASSP48485.2024.10446437
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 23,2024 at 09:29:37 UTC from IEEE Xplore. Restrictions apply.
to alleviate the issues aroused from the uncertainty size of
state in autonomous driving [13]. Inspired by these studies,
two solutions are creatively proposed to deal with the vari-
able dimension of state space in parcel singulation, one is
DeepSet-SAC [14], the other is S3-SAC where S3 is short for
state space standardization. S3 is first introduced as a special
architectural component for parcel singulation. Experiments
demonstrated that S3-SAC shows higher training stability and
the highest singulation efficiency.
The contribution of this work can be summarized as fol-
lows: 1) A simulation environment is constructed, ensuring
the feasibility of experiments. 2) Different from regular tasks,
S3-SAC is introduced to deal with the nonstationary envi-
ronment owing to the uncertainty of input parcels. 3) S3 is
first proposed as a special architectural component to handle
nonstationary environment problems. 4) Quantitative experi-
ments demonstrated the effectiveness of the method proposed
in this paper, which is capable to singulate parcels with an
impressive rate of 6000 parcels per hour (pph).
2. APPROACH
2.1. Problem definition
St
Stable
Legend
Standard
state
Input
state
Actor
CriticActor
State space
a0
a1
a5
a9
a13
a17
a21
a2
a6
a10
a14
a18
a22
a3
a7
a11
a15
a19
a23
a4
a8
a12
a16
a20
a24
atest
d
d
Infeed area Singulation processing Alignment result
Real-state
(k-dimension)
Padding-state
((n-k)-dimension)
Generation
space
State
(n-dimension)
State space standardization
Resample
Standard State
FC FC
Non-expert helper
Action
Generation
Guided critic
MSE
表示module
表示data
Guided state action va lue
+State action value
State space
Standardization
Input State
Feature extractor
Critic
Standard State
FC
State action value
State space
Standardization
Input State
Simulation environment
FC FC FC
State space
Standardization
3
2
1
k
2
1
1'
k-1
2
1k-1
……
……
n
k+1
……
……
……
n
k+1
k
……
……
3
2
1
k
State space
Standardization
1'
2
1
k-1
n
k+1
k
2
1
k-1
……
Step 1 Step 2 Step t
State action value
State information
Action information
Singulating parcel
Singulated parcel
Padding parcel
……
……
Standard
state
Input
state
Actor Critic
FC
State space
Processing
Simulation environment
FC FC FC
State space
Processing
3
2
1
k
2
1
1'
k-1
2
1k-1
……
……
n
k+1
……
……
……
n
k+1
k
……
……
3
2
1
k
State space
Processing
1'
2
1
k-1
n
k+1
k
2
1
k-1
……
Step 1 Step 2 Step t
State information
Action information
Singulating parcel
Singulated parcel
Padding parcel
……
……
Parcel Sorting Parcel SortingParcel Sorting
Guided
State space
Input-state
(k-dimension)
Padding-state
((n-k)-dimension)
Generation
space
Standard-state
(n-dimension)
State space standardization
Resample
2
1k ……
1 2 k n
k+1
Standard-state slots:
Input-state Empty slots
d
d
Infeed area Singulation processing Alignment result
d
Infeed area Singulation processing Alignment result
Standard
state
Input
state
Actor Critic
FC
State space
Processing
Simulation environment
FC FC FC
State space
Processing
3
2
1
k
2
1
1'
k-1
2
1k-1
……
……
n
k+1
……
……
……
n
k+1
k
……
……
3
2
1
k
State space
Processing
1'
2
1
k-1
n
k+1
k
2
1
k-1
……
Step 1 Step 2 Step t
State information
Action information
Singulating parcel
Singulated parcel
Padding parcel
……
……
Parcel Sorting Parcel SortingParcel Sorting
Guided
State space
Processing
3
2
1
kn
k+1 3
2
1
k
f
f
f
xt
n
xt
2
xt
1
r
Critic
at
Q(st ,at)
St
Stable
f
f
f
xt
n
xt
2
xt
1
r
Actor
p
(at | st )
Fig. 2. Illustration of a model of singulator. The upper layer
represents sparse real-world belt conveyors, and the layer be-
low represents the corresponding sparse actuator array. In the
process of singulation, parcels are manipulated by the belt
conveyors from the left to the right. The velocity profiles of
corresponding actuators are obtained by the learned control
policy.
A model of singulator is illustrated in Figure 2. The par-
cel singulation problem can be recognized as the optimizaiton
problem, formulated mathematically as follows:
max
U(t)E
s.t. ui(t)Umax, utest (t) = Umax,dP I ,
(1)
where i= 1, ..., N are indices of actuators. ui(t)denotes the
velocity of ith actuator and Umax denotes the maximum
velocity of actuators. utest is the velocity of the actuator of
the align area. dis the distance between the left boundary of
the parcel, which cross the goal line, and the right boundary of
the nearest parcel. P I represents the predefined interval. E,
the efficiency of parcel singulation, is defined as: E=Np×η,
where Nprepresents the number of parcels crossing the goal
line in one hour. ηis the pass rate which is formulated as:
η=Np/N, where Ndenotes the total parcels in one hour.
If dis larger than P I, the parcels is regarded as a pass
parcel. Parcels Generator simulates the circumstance where
parcels are densely packed on the infeed conveyor. And the
randomly generated parcels are sampled from the distribution
in actual express industry scenario. Conveyor Array Simu-
lator simulate the movements of parcels under the control of
the conveyor belt array. Besides Modeling the parcel as pla-
nar rigid body and making force analysis in pixel level [4],
collision between parcels is also detected by our simulation
environment. Meanwhile, the parameters of environment are
fine tuned to be proportional to the scale of equipment’s in
reality. Therefore,the pre-trained model can be easily trans-
ferred to the real world.
2.2. S3-SAC
Legend
Standard
state
Input
state
Actor
CriticActor
State space
a0
a1
a5
a9
a13
a17
a21
a2
a6
a10
a14
a18
a22
a3
a7
a11
a15
a19
a23
a4
a8
a12
a16
a20
a24
d
d
Infeed area Singulation processin g Alignm ent result
Real-state
(k-dimension)
Padding-state
((n-k)-dimension)
Generation
space
State
(n-dimension)
State space standardization
Resample
Standard State
FC FC
Non-expert helper
Action
Generation
Guided critic
MSE
表示module
表示data
Guided state action va lue
+State action value
State space
Standardization
Input State
Feature extractor
Critic
Standard State
FC
State action value
State space
Standardization
Input State
Simulation environment
FC FC FC
State space
Standardization
3
2
1
k
2
1
1'
k-1
2
1k-1
……
……
n
k+1
……
……
……
n
k+1
k
……
……
3
2
1
k
State space
Standardization
1'
2
1
k-1
n
k+1
k
2
1
k-1
……
Step 1 Step 2 Step t
State action value
State information
Action information
Singulating parcel
Singulated parcel
Padding parcel
……
……
Standard
state
Input
state
Actor Critic
FC
State space
Standardization
Simulation environment
FC FC FC
State space
Standardization
3
2
1
k
2
1
1'
k-1
2
1k-1
……
……
n
k+1
……
……
……
n
k+1
k
……
……
3
2
1
k
State space
Standardization
1'
2
1
k-1
n
k+1
k
2
1
k-1
……
Step 1 Step 2 Step t
State information
Action information
Singulating parcel
Singulated parcel
Padding parcel
……
……
Parcel Sorting Parcel SortingParcel Sorting
Guided
State space
Input-state
(k-dimension)
Padding-state
((n-k)-dimension)
Generation
space
Standard-state
(n-dimension)
State space standardization
Resample
2
1k ……
1 2 k n
k+1
Standard-state slots:
Input-state Empty slots
Fig. 3. Illustration of proposed framework. The sorted state
information extracted from simulation environment is defined
to be input state with k-dimension where k is an integer not
larger than n. The input state is standardized to standard state
with n-dimension as the input of soft actor-critic network.
When it comes to solving the reinforcement learning tasks
in real-world domains, learning a parameterized control pol-
icy can be done with existing actor-critic methods like SAC
[15–18]. The parcel singulation network is proposed based
on SAC framework as shown in Figure 3. The dimension of
action is associated with the number of actuators. The state
information is extracted from simulation environment, which
is sorted before enter the network as the input state. Due to the
instability of real word environment, the number of parcels in
singulation area varies each step. To handle the challenge, the
input-state is converted to standard-state by state space stan-
dardization module (S3) as the input of actor network. The
actor generates the control policy of the conveyor array to
critic and simulation environment. The critic outputs the state
87
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 23,2024 at 09:29:37 UTC from IEEE Xplore. Restrictions apply.
action value to update the parameters. At next step, the sim-
ulation environment is updated according to the action gen-
erated by actor. Likewise, the state information is extracted
and sorted from updated simulation environment as the in-
put state. State space standardization is a solution proposed
to the problem of variable input dimension which guarantees
the feasibility of the task. The input state with k-dimension is
converted to determined n-dimension by S3 module where k
is an integer smaller than n.
The S3 module is employed to deal with nonstationary
environment. In parcel singulation, the dimension of input-
state varies each step. To conquer the challenge of partial
observation, the dimension of standard-state is larger than
the maximum dimension of input-states. The state informa-
tion is sorted according to the distance to align area as input-
state with k-dimension, filling the first k slots of standard-
state as shown in Figure 4. Therefore, S3 module is pro-
posed as presented in Figure 5. The left n-k empty slots are
filled with padding-state resampled from generation space.
Padding-state can be considered as a series of samples of
newly generated input-state at next steps. Under the circum-
stance, at each step, the standard-state consist of input-state
and padding-state can be seen as almost following the same
distribution, transforming the nonstationary environment to
stationary environment. Also, filling the empty slots with
padding-state enhances robustness and generalization of the
network and provides higher training efficiency.
Legend
Standard
state
Input
state
Actor
CriticActor
State space
a0
a1
a5
a9
a13
a17
a21
a2
a6
a10
a14
a18
a22
a3
a7
a11
a15
a19
a23
a4
a8
a12
a16
a20
a24
d
d
Infeed area Singulation processing Alignment result
Real-state
(k-dimension)
Padding-state
((n-k)-dimension)
Generation
space
State
(n-dimension)
State space standardization
Resample
Standard State
FC FC
Non-expert helper
Action
Generation
Guided critic
MSE
表示module
表示data
Guided state action value
+State action value
State space
Standardization
Input State
Feature extractor
Critic
Standard State
FC
State action value
State space
Standardization
Input State
Simulation environment
FC FC FC
State space
Standardization
3
2
1
k
2
1
1'
k-1
2
1k-1
……
……
n
k+1
……
……
……
n
k+1
k
……
……
3
2
1
k
State space
Standardization
1'
2
1
k-1
n
k+1
k
2
1
k-1
……
Step 1 Step 2 Step t
State action value
State information
Action information
Singulating parcel
Singulated parcel
Padding parcel
……
……
Standard
state
Input
state
Actor Critic
FC
State space
Standardization
Simulation environment
FC FC FC
State space
Standardization
3
2
1
k
2
1
1'
k-1
2
1k-1
……
……
n
k+1
……
……
……
n
k+1
k
……
……
3
2
1
k
State space
Standardization
1'
2
1
k-1
n
k+1
k
2
1
k-1
……
Step 1 Step 2 Step t
State information
Action information
Singulating parcel
Singulated parcel
Padding parcel
……
……
Parcel Sorting Parcel SortingParcel Sorting
Guided
State space
Input-state
(k-dimension)
Padding-state
((n-k)-dimension)
Generation
space
Standard-state
(n-dimension)
State space standardization
Resample
2
1k ……
1 2 k n
k+1
Standard-state slots:
Input-state Empty slots
Fig. 4. An example of sorting. The parcel near the goal line
has a higher priority.
3. EXPERIMENTS
S3-SAC is proposed to mitigate the effects of variable state
space dimensions, and its performance is evaluated through
comparative experiments with existing methods.
Greedy method. Kim et al. [4] proposed to learn the pri-
ority of parcels from the input location information of each
parcel and use intuitive control to make the parcels line in
order of priority.
Zero-padding-SAC. Classically, zero-padding is used to
fill the left n-k empty standard-state slots as shown in Figure4.
Unfortunately, zero-padding is lack of effectiveness and sta-
bility. From the perspective of backward passes, it is un-
Legend
Standard
state
Input
state
Actor
CriticActor
State space
a0
a1
a5
a9
a13
a17
a21
a2
a6
a10
a14
a18
a22
a3
a7
a11
a15
a19
a23
a4
a8
a12
a16
a20
a24
d
d
Infeed area S ingulation proces sing Alignment res ult
Real-state
(k-dimension)
Padding-state
((n-k)-dimension)
Generation
space
State
(n-dimension)
State space standardization
Resample
Standard State
FC FC
Non-expert helper
Action
Generation
Guided critic
MSE
表示module
表示data
Guided state act ion value
+State action value
State spac e
Standardizat ion
Input State
Feature ext ractor
Critic
Standard Sta te
FC
State action value
State space
Standardization
Input State
Simulation environment
FC FC FC
State space
Standardization
3
2
1
k
2
1
1'
k-1
2
1k-1
……
……
n
k+1
……
……
……
n
k+1
k
……
……
3
2
1
k
State space
Standardization
1'
2
1
k-1
n
k+1
k
2
1
k-1
……
Step 1 Step 2 Step t
State action value
State information
Action information
Singulating parcel
Singulated parcel
Padding parcel
……
……
Standard
state
Input
state
Actor Critic
FC
State space
Standardization
Simulation environment
FC FC FC
State space
Standardization
3
2
1
k
2
1
1'
k-1
2
1k-1
……
……
n
k+1
……
……
……
n
k+1
k
……
……
3
2
1
k
State space
Standardiza tion
1'
2
1
k-1
n
k+1
k
2
1
k-1
……
Step 1 Step 2 Step t
State information
Action information
Singulating parcel
Singulated parcel
Padding parcel
……
……
Parcel Sorti ng Parcel Sorti ngParcel Sorti ng
Guided
State space
Input-state
(k-dimension)
Padding-state
((n-k)-dimension)
Generation
space
Standard-state
(n-dimension)
State space standardization
Resample
2
1k ……
1 2 k n
k+1
Standard-state slots:
Input-state Empty slots
Fig. 5. Interpretation of state space standardization. The
padding state is randomly resampled from generation space.
The state (n-dimension) is defined to be the concatena-
tion of padding-state((n-k)-dimension) and input-state (k-
dimension).
likely for the weights to be well trained within limited training
steps because the weights of zero-padding slots seldom up-
date. From the perspective of forward passes, without loss
of generality, assume that in step t, the input state is with
n-dimension, and in step t+1, the input state is with (n+1)-
dimension. The n+1 slot at step t is filled with 0 while the
n+1 slot at step t+1 is filled with parcel information which
vary greatly from each other. The action generated in step t
varies greatly from the action generated in step t+1 leading
to the consequence of instability.
DeepSet-SAC. Westbri et al. [14] suggest using Deep Sets
for deep RL as a flexible and permutation invariant architec-
ture to deal with nonstationary environment problems due to
variable state space dimension. Inspired by the structure of
the DeepSet-Q handling with discrete action space, we pro-
pose the DeepSet-SAC with actor-critic framework to deal
with continuous action space. The representation of the in-
put set is computed by:
Ψ(Xdyn
t) = ρ
X
xj
tXdyn
t
ϕ(xj
t)
,(2)
where the Xdyn
t= [x1
t, . . . , xn
t]is the dynamic input with
variable number of vectors and n is the number of current
parcels. Each vector xj
trepresents the position and speed in-
formation of the each parcel.
The learning curves, delineated in Figure 6, elucidate
varying degrees of performance among the different algo-
rithms. It is observed that the Zero-padding-SAC algorithm
exhibits the poorest performance coupled with the highest
variance, underscoring a pronounced instability during the
training phase. To mitigate the oscillations introduced by
dynamic input dimensions, both S3-SAC and DeepSet-SAC
algorithms manifest a noticeable reduction in training in-
stability. Furthermore, a distinct advantage of the S3-SAC
methodology is its rapid convergence and superior perfor-
mance when juxtaposed with baseline algorithms.
This elevation in performance can be attributed to the
88
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 23,2024 at 09:29:37 UTC from IEEE Xplore. Restrictions apply.
Algorithm 1 S3-SAC for parcel singulation
Require: parcel distribution D, actuator array parameters A,
the maximum number of parcels N
Initialize critic parameters: θ1, θ2
Initialize actor parameters: ϕ
Initialize an empty replay buffer: R←∅
for each iteration do
for each environment step do
Sample action from the policy: atπϕ(at|st)
Generate new parcels by the Parcel Generator:
zt=P arcel Generator(D, N )
Calculate the next state and the reward by the Con-
veyor Array Simulator:
st+1, r =C onveyor Array Simulator (st,at,zt)
Standardize state by the S3 module:
ss
t+1 st+1,ss
tst
Store the trajectory in the replay pool:
R R ss
t,at, r, ss
t+1
end for
for each gradient step do
Update the Q-function parameters:
θiθiλQˆ
θiJQ(θi)for i {1,2}
Update policy weights: ϕϕλπˆ
ϕJπ(ϕ)
Adjust temperature: ααλˆ
αJ(α)
Update target network weights:
¯
θiτθi+ (1 τ)¯
θifor i {1,2}
end for
end for
fundamental differences in how these algorithms manage the
input state dimension. Specifically, the zero-padding tech-
nique alters the state space by supplementing it with zeros
to preserve the input state dimension, thereby instigating
training instability. Conversely, in environments character-
ized by dynamic input states, the DeepSet-SAC algorithm
leverages neural networks to extract high-dimensional fea-
tures, thereby securing a stable state space. Notwithstanding,
this approach somewhat attenuates the inter-parcel connec-
tivity, culminating in a significant loss of information during
the dimensionality reduction process. On the other hand,
S3-SAC adeptly navigates dynamic input challenges by fos-
tering a seamless connection between the input and filling
states, without compromising on information fidelity, thereby
enhancing training performance.
Given the stochastic nature of the parcel singulation
task—where preceding information such as parcel size and
quantity are substantially unpredictable—a comprehensive
evaluation of the learned policy across a plethora of scenar-
ios was executed to ascertain its generalization capabilities.
This involved the generation of dense packages with varied
dimensions, emulating realistic package distributions. The
results in Table 1 and Table 2 indicate that S3-SAC not only
surpasses the performance of contemporary approaches in
       









S3-SAC
DeepSet-SAC
Zero-padding-S

Fig. 6. Normalized reward of learning-based methods
Reinforcement Learning Based Parcel Singulation
with Variable State Space Dimension ****, ****, 2021, ****
       









S3-SAC
DeepSet-SAC
Zero-padding-S

(a) Normalized Reward
(b) Critic Loss
       




 

S3-SAC
DeepSet-SAC
Zero-padding-S

(c) Policy Loss
Figure 7: Evaluation of S3 module.S3 has the highest normalized reward,fasted converge speed.
Table 1: Pass rate
Scenario 5 7 9 12 15 20
Greedy method 85.58% 85.28% 85.03% 87.09% 86.27% 85.31%
Zero-padding-SAC 96.82% 96.27% 97.38% 96.46% 97.97% 96.77%
Deep set-SAC 97.12% 97.13% 97.57% 98.75% 97.50% 97.92%
S3-SAC 98.73% 99.15% 98.71% 99.24% 99.02% 99.13%
Table 2: Parcel singulation Eciency
Scenario 5 7 9 12 15 20
Greedy method (pph) 3590 3600 3600 3691 3619 3653
Zero-padding-SAC (pph) 4136 4279 4242 4374 4222 4398
Deep set-SAC (pph) 4363 4392 4435 4550 4483 4512
S3-SAC (pph) 4997 5045 5026 4992 5011 5040
For parcel singulation based on conveyors array, Kim[
14
] pro-
posed a non-learning method with a greedy policy, adjusting the
velocity of conveyors according to the distance between the aligned
area and the parcel with the highest priority. Besides making the
comparison with non-learning methods, we compare the perfor-
mance of other DRL based algorithm DeepSet-SAC and zero-padding-
SAC with S3-SAC.
Since parcel singulation is a stochastic task, the prior information
such as size and number of parcels are highly unpredictable. We
evaluate the learned policy on a variety of scenarios, varying in
the maximum number of parcels, to verify its generalization. We
randomly generate dense packages of dierent sizes, which are
sampled from the real parcel distribution listed in the appendix.
The pass rate and parcel singulation eciency are presented
in Table1 and Table 2. Experimental results demonstrate that, for
all dierent scenarios, DRL based methods achieve better perfor-
mance than the non-learning method. Furthermore, our proposed
method S3-SAC has the highest passing rate and the best parcel
singulation eciency. S3-SAC outperforms related approaches in
parcel singulation problems while generalizing better to unseen
situations.
6 CONCLUSION
We formulate the problem of parcel singulation problem as a MDP
with variable state space dimension and propose a eective mod-
ule S3, which standardize the dynamic input state and ensure the
stability of DRL training process. The simulation environment is
constructed to verify the control policy based on deep reinforce-
ment learning surpasses the non-learning methods. Moreover, com-
parative experiments demonstrate that our S3 module eectively
reduce the impact of the dynamic input in parcel singulation prob-
lem. Therefore, learning methods based on our proposed S3-SAC
outperforms the zero-padding-SAC and DeepSet-SAC with higher
pass rate and singulation eciency.
REFERENCES
[1]
Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. 2020. An opti-
mistic perspective on oine reinforcement learning. In International Conference
on Machine Learning. PMLR, 104–114.
[2]
Yevgen Chebotar, Ankur Handa, Viktor Makoviychuk, Miles Macklin, Jan Issac,
Nathan Ratli, and Dieter Fox. 2019. Closing the sim-to-real loop: Adapting simu-
lation randomization with real world experience. In 2019 International Conference
on Robotics and Automation (ICRA). IEEE, 8973–8979.
[3]
Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. 2016.
Benchmarking deep reinforcement learning for continuous control. In Interna-
tional conference on machine learning. PMLR, 1329–1338.
[4]
Scott Fujimoto, Herke Hoof, and David Meger. 2018. Addressing function ap-
proximation error in actor-critic methods. In International Conference on Machine
Learning. PMLR, 1587–1596.
[5]
Maite Giménez, Javier Palanca, and Vicent Botti. 2020. Semantic-based padding
in convolutional neural networks for improving the performance in natural
language processing. A case of study in sentiment analysis. Neurocomputing 378
(2020), 315–323.
[6]
Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. 2017. Deep
reinforcement learning for robotic manipulation with asynchronous o-policy
updates. In 2017 IEEE international conference on robotics and automation (ICRA).
IEEE, 3389–3396.
[7]
Tuomas Haarnoja, Vitchyr Pong, Aurick Zhou, Murtaza Dalal, Pieter Abbeel,
and Sergey Levine. 2018. Composable deep reinforcement learning for robotic
manipulation. In 2018 IEEE International Conference on Robotics and Automation
(ICRA). IEEE, 6244–6251.
[8]
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft
actor-critic: O-policy maximum entropy deep reinforcement learning with a
stochastic actor. In International Conference on Machine Learning. PMLR, 1861–
1870.
[9]
Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup,
and David Meger. 2018. Deep reinforcement learning that matters. In Proceedings
of the AAAI Conference on Articial Intelligence, Vol. 32.
[10]
Maria Huegle, Gabriel Kalweit, Branka Mirchevska, Moritz Werling, and Joschka
Boedecker. 2019. Dynamic Input for Deep Reinforcement Learning in Au-
tonomous Driving. In 2019 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS). 7566–7573. https://doi.org/10.1109/IROS40897.2019.8968560
tackling parcel singulation challenges across diverse scenar-
ios, but also exhibits enhanced adaptability to unprecedented
situations.
4. CONCLUSION
We formulate the parcel singulation problem as a MDP with
a variable state space dimension and propose an efficacious
module, denoted as S3, which standardizes the dynamic in-
put state, thereby ensuring the stability of the deep RL train-
ing process. The constructed simulation environment serves
as a robust platform to corroborate that the control policy
grounded on deep RL surpasses conventional non-learning
approaches. Additionally, comparative analyses elucidate that
our innovative S3 module significantly mitigates the adverse
effects of dynamic inputs inherent in the parcel singulation
problem. Consequently, learning strategies predicated on our
proposed S3-SAC module surpass the performance metrics
of both zero-padding-SAC and DeepSet-SAC, yielding higher
pass rates and enhanced singulation efficiency.
89
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 23,2024 at 09:29:37 UTC from IEEE Xplore. Restrictions apply.
5. REFERENCES
[1] Ki Kim, Yong Choi, and Hoon Jung, “Infeed control
algorithm of sorting system using modified trapezoidal
velocity profiles, ETRI Journal, vol. 37, pp. 328–337,
04 2015.
[2] Aviral Kumar, Justin Fu, Matthew Soh, George Tucker,
and Sergey Levine, “Stabilizing off-policy q-learning
via bootstrapping error reduction,” in Advances in
Neural Information Processing Systems, H. Wallach,
H. Larochelle, A. Beygelzimer, F. d'Alch´
e-Buc, E. Fox,
and R. Garnett, Eds. 2019, vol. 32, pp. 11784–11794,
Curran Associates, Inc.
[3] Rishabh Agarwal, Dale Schuurmans, and Mohammad
Norouzi, An optimistic perspective on offline rein-
forcement learning,” in International Conference on
Machine Learning. PMLR, 2020, pp. 104–114.
[4] Woojin Kim, Ki Hak Kim, and Daesub Yoon, “Learning
control policy for parcel singulation, in 2016 Interna-
tional Conference on Information and Communication
Technology Convergence (ICTC), 2016, pp. 138–140.
[5] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and
Sergey Levine, “Deep reinforcement learning for
robotic manipulation with asynchronous off-policy up-
dates,” in 2017 IEEE international conference on
robotics and automation (ICRA). IEEE, 2017, pp. 3389–
3396.
[6] Tuomas Haarnoja, Vitchyr Pong, Aurick Zhou, Murtaza
Dalal, Pieter Abbeel, and Sergey Levine, “Composable
deep reinforcement learning for robotic manipulation,”
in 2018 IEEE International Conference on Robotics and
Automation (ICRA). IEEE, 2018, pp. 6244–6251.
[7] Jiwei Shen, Liang Yuan, Yue Lu, and Shujing Lyu,
“Leveraging predictions of task-related latents for inter-
active visual navigation, IEEE Transactions on Neural
Networks and Learning Systems, pp. 1–14, 2023.
[8] David Silver, Aja Huang, Chris J Maddison, Arthur
Guez, Laurent Sifre, George Van Den Driessche, Julian
Schrittwieser, Ioannis Antonoglou, Veda Panneershel-
vam, Marc Lanctot, et al., “Mastering the game of go
with deep neural networks and tree search, nature, vol.
529, no. 7587, pp. 484–489, 2016.
[9] David Silver, Julian Schrittwieser, Karen Simonyan,
Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas
Hubert, Lucas Baker, Matthew Lai, Adrian Bolton,
et al., “Mastering the game of go without human knowl-
edge,” nature, vol. 550, no. 7676, pp. 354–359, 2017.
[10] David Silver, Thomas Hubert, Julian Schrittwieser,
Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc
Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Grae-
pel, et al., “A general reinforcement learning algorithm
that masters chess, shogi, and go through self-play, Sci-
ence, vol. 362, no. 6419, pp. 1140–1144, 2018.
[11] Anusha Nagabandi, Gregory Kahn, Ronald S Fearing,
and Sergey Levine, “Neural network dynamics for
model-based deep reinforcement learning with model-
free fine-tuning,” in 2018 IEEE International Confer-
ence on Robotics and Automation (ICRA). IEEE, 2018,
pp. 7559–7566.
[12] Yu Wei, Minjia Mao, Xi Zhao, Jianhua Zou, and Ping
An, “City metro network expansion with reinforcement
learning,” in Proceedings of the 26th ACM SIGKDD
International Conference on Knowledge Discovery &
Data Mining, 2020, pp. 2646–2656.
[13] Maria Huegle, Gabriel Kalweit, Branka Mirchevska,
Moritz Werling, and Joschka Boedecker, “Dynamic in-
put for deep reinforcement learning in autonomous driv-
ing,” in 2019 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS), 2019, pp. 7566–
7573.
[14] Fabian Westbrink, Andreas Schwung, and Steven X.
Ding, “Data-based control of peristaltic sortation ma-
chines using discrete element method,” in IECON 2020
The 46th Annual Conference of the IEEE Industrial
Electronics Society, 2020, pp. 575–580.
[15] Yan Duan, Xi Chen, Rein Houthooft, John Schulman,
and Pieter Abbeel, “Benchmarking deep reinforcement
learning for continuous control,” in International con-
ference on machine learning. PMLR, 2016, pp. 1329–
1338.
[16] Peter Henderson, Riashat Islam, Philip Bachman, Joelle
Pineau, Doina Precup, and David Meger, “Deep re-
inforcement learning that matters,” in Proceedings of
the AAAI Conference on Artificial Intelligence, 2018,
vol. 32.
[17] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and
Sergey Levine, “Soft actor-critic: Off-policy maximum
entropy deep reinforcement learning with a stochastic
actor, in International Conference on Machine Learn-
ing. PMLR, 2018, pp. 1861–1870.
[18] Jiwei Shen, Pengjie Lou, Liang Yuan, Shujing Lyu, and
Yue Lu, “Vme-transformer: Enhancing visual memory
encoding for navigation in interactive environments,
IEEE Robotics and Automation Letters, vol. 9, no. 1,
pp. 643–650, 2024.
90
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 23,2024 at 09:29:37 UTC from IEEE Xplore. Restrictions apply.
Article
Full-text available
A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo's own move selections and also the winner of AlphaGo's games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100-0 against the previously published, champion-defeating AlphaGo. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved.
Article
Interactive visual navigation (IVN) involves tasks where embodied agents learn to interact with the objects in the environment to reach the goals. Current approaches exploit visual features to train a reinforcement learning (RL) navigation control policy network. However, RL-based methods continue to struggle at the IVN tasks as they are inefficient in learning a good representation of the unknown environment in partially observable settings. In this work, we introduce predictions of task-related latents (PTRLs), a flexible self-supervised RL framework for IVN tasks. PTRL learns the latent structured information about environment dynamics and leverages multistep representations of the sequential observations. Specifically, PTRL trains its representation by explicitly predicting the next pose of the agent conditioned on the actions. Moreover, an attention and memory module is employed to associate the learned representation to each action and exploit spatiotemporal dependencies. Furthermore, a state value boost module is introduced to adapt the model to previously unseen environments by leveraging input perturbations and regularizing the value function. Sample efficiency in the training of RL networks is enhanced by modular training and hierarchical decomposition. Extensive evaluations have proved the superiority of the proposed method in increasing the accuracy and generalization capacity.
Article
The efficiency of a robotic system is primarily determined by its ability to navigate complex and interactive environments. In real-world scenarios, cluttered surroundings are common, requiring a robot to navigate diverse spaces and displace objects to pave a path towards its objective. Consequently, “Visual Interactive Navigation” presents several challenges, including how to retain historical exploration information from partially observable visual signals, and how to utilize sparse rewards in reinforcement learning to simultaneously learn a latent representation and a control policy. Addressing these challenges, we introduce a Transformer-based Visual Memory Encoder (VME-Transformer), capable of embedding both recent and long-term exploration information into memory. Additionally, we explicitly estimate the robot's next pose, conditioned on the impending action, to bootstrap the learning process of the high-capacity VME-Transformer. We further regularize the value function by introducing input perturbations, thereby enhancing its generalization capabilities in previously unseen environments. In the Visual Interactive Navigation tasks within the iGibson environment, the VME-Transformer demonstrates superior performance compared to state-of-the-art methods, underlining its effectiveness.
Article
One program to rule them all Computers can beat humans at increasingly complex games, including chess and Go. However, these programs are typically constructed for a particular game, exploiting its properties, such as the symmetries of the board on which it is played. Silver et al. developed a program called AlphaZero, which taught itself to play Go, chess, and shogi (a Japanese version of chess) (see the Editorial, and the Perspective by Campbell). AlphaZero managed to beat state-of-the-art programs specializing in these three games. The ability of AlphaZero to adapt to various game rules is a notable step toward achieving a general game-playing system. Science , this issue p. 1140 ; see also pp. 1087 and 1118
Article
Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which necessitate meticulous hyperparameter tuning. Both of these challenges severely limit the applicability of such methods to complex, real-world domains. In this paper, we propose soft actor-critic, an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy - that is, succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as Q-learning methods. By combining off-policy updates with a stable stochastic actor-critic formulation, our method achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off-policy methods. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving very similar performance across different random seeds.