Content uploaded by Zvezdan Lončarević
Author content
All content in this area was uploaded by Zvezdan Lončarević on Jan 18, 2022
Content may be subject to copyright.
Reduction of Trajectory Encoding Data using a
Deep Autoencoder Network: Robotic Throwing
Zvezdan Lonˇcarevi´c?, Rok Pahiˇc, Mihael Simoniˇc, Aleˇs Ude and Andrej Gams
Joˇzef Stefan Institute
Jamova cesta 39, 1000 Ljubljana, Slovenia
zvezdan.loncarevic@ijs.si
Abstract. Autonomous learning and adaptation of robotic trajectories
by complex robots in unstructured environments, for example with the
use of reinforcement learning, very quickly encounters problems where
the dimensionality of the search space is beyond the range of practical
use. Different methods of reducing the dimensionality have been pro-
posed in the literature. In this paper we explore the use of deep au-
toencoders, where the dimensionality of autoencoder latent space is low.
However, a database of actions is required to train a deep autoencoder
network. The paper presents a study on the number of required database
samples in order to achieve dimensionality reduction without much loss
of information.
Keywords: deep autoencoder, reinforcement learning, robotic throwing
1 Introduction and Related Work
Learning and adaptation of robotic skills has been researched in depth over the
last decade. However, as [11] reports, the search space of learning of complete
actions/skills from scratch is too large, making learning from scratch unfeasible.
Several methods have been proposed to reduce the size of the learning space.
For example, the search space was reduced with accurate initial demonstrations
[5], or by confining it to the space between an over- and and under- attempt [9].
Policy learning has been developed for continuous actions [4]. Information has
been reduced by only keeping the principal components of data [1] and specific
methods of trajectory encoding, such as locally weighted projection regression
[14], leading to latent spaces [1].
Latent spaces of deep autoencoders has also been proposed as means of di-
mensionality reduction of robotic actions and skills [2], specifically in combina-
tion with policy representations in the form of dynamic movement primitives
[6]. In our previous paper on this topic we have shown that skill learning (us-
ing reinforcement learning) in latent space of deep autoencoders is faster than
learning in the space of the trajectory [10].
?Holder of Ad futura Scholarship for Postgraduate Studies of Nationals of Western
Balkan States for Study in the Republic of Slovenia (226. Public Call)
2 Lonˇcarevi´c et al.
However, the problem of generating the database for training the autoencoder
network remains. While in [10] we used simulated results to train the autoencoder
network, in this paper we explore how much data is actually needed for faithful
representation of an action. To do this, we used the humanoid robot Talos to
perform throwing actions with one of its arms, where the trajectory of throwing
was encoded in the task space. We applied reinforcement learning (RL [8]) to
learn throwing at the desired spot. We recorded all the roll-outs and used them
to make our database. By learning to throw at different locations, we quickly
acquired a large database, and then used parts of it to see how accurately we
can perform throws at untrained locations. Testing of different database sizes
and deep autoencoder latent space sizes can provide an insight whether training
a small database in the real world makes sense, or whether it is better to train
a large database in simulation and then apply RL in the latent space of an
autoencoder trained with this database. This paper reports on the first steps
of researching in this direction, with the research on the required simulated
database.
The rest of the paper is organized as follows. In the next Section we first show
how the task-space trajectories of throwing are encoded and how reinforcement
learning using PoWER is applied to refine a given task. Section 3 gives details
on autoencoders. The experiments are described in Section 4, followed by results
in Section 5 and a Conclusion.
2 Trajectory and Learning
Robot motion trajectories can be written in either joint space or in Cartesian
space. Cartesian space trajectories are typically easier to interpret for humans,
specially in high degree-of-freedom robots. In this paper the trajectories of mo-
tion were encoded in task space of the end-effector, in the form of Cartesian space
dynamic movement primitives (CDMPs). The original derivation of CDMPs was
proposed in [13]. For clarity we provide a short recap of CDMPs below.
2.1 Cartesian space Dynamic Movement Primitives
CDMPs are composed of two parts: position and orientation part. The position
part of the trajectory is the same as in standard DMPs. On the other hand,
the orientation part part of the CDMP is represented by unit quaternions. Unit
quaternions require special treatment, for both the nonlinear dynamic equations
and the integration of these equations.
The following parameters compose a CDMP: weights w
w
wp
k, w
w
wo
k∈R3, k =
1, . . . , N , which represent the position and orientation parts of the trajectory,
respectively; trajectory duration τand the final desired, goal position g
g
gpand
orientation g
g
goof the robot. Variable Nsets the number of radial basis functions
that are used to encode the trajectory. The orientation is in CDMP represented
by a unit quaternion q
q
q=v+u
u
u∈S3,where S3is a unit sphere in R4.v∈Ris
Robotic Throwing 3
its scalar and u
u
u∈R3its vector part. To encode position (p
p
p) and orientation (q
q
q)
trajectories we use the following differential equations:
τ˙
z=αz(βz(gp−p)−z) + fp(x),(1)
τ˙
p=z,(2)
τ˙
η
η
η=αz(βz2 log (g
g
go∗q
q
q)−η
η
η) + fo(x),(3)
τ˙
q
q
q=1
2η
η
η∗q
q
q, (4)
τ˙x=−αx. (5)
Parameters z
z
z, η
η
ηdenote the scaled linear and angular velocity (z
z
z=τ˙
p
p
p,η
η
η=τω
ω
ω).
For details on quaternion product ∗, conjugation q
q
q, and the quaternion logarithm
log (q
q
q), see [13]. The nonlinear parts, termed also forcing terms, fpand foare
defined as
fp(x) = D
D
DpPN
k=1 wp
kΨk(x)
PN
k=1 Ψk(x)x, (6)
fo(x) = D
D
DoPN
k=1 w
w
wo
kΨk(x)
PN
k=1 Ψk(x)x. (7)
Forcing terms contain parameters w
w
wp
k, w
w
wo
k∈R3. They have to be learned, for ex-
ample directly from an input Cartesian trajectory {p
p
pj, q
q
qj,˙
p
p
pj, ω
ω
ωj,¨
p
p
pj,˙
ω
ω
ωj, tj}T
j=1.
The scaling matrices D
D
Dp, D
D
Do∈R3×3can be set to D
D
Dp=D
D
Do=I
I
I. Other possi-
bilities are described in [13]. The nonlinear forcing terms are defined as a linear
combination of radial basis functions Ψk
Ψk(x) = exp −hk(x−ck)2.(8)
Here ckare the centers and hkthe widths of the radial basis functions. The dis-
tribution of weights can be, as in [12], ck= exp −αxk−1
N−1,hk=1
(ck+1 −ck)2,
hN=hN−1,k= 1, . . . , N . The time constant τis set to the desired duration of
the trajectory, i. e. τ=tT−t1. The goal position and orientation are usually set
to the final position and orientation on the desired trajectory, i. e. g
g
gp=p
p
ptTand
g
g
go=q
q
qtT. Detailed CDMP description and auxiliary math are explained in [13].
2.2 PoWER
The goal of the paper is to show that autonomous learning, in our case rein-
forcement learning, can be faster in the latent space. We use Policy Learning
by Weighting Exploration with the Returns (PoWER) [7] RL algorithm. It is
an Expectation-Maximization (EM) based RL algorithm that can be combined
with importance sampling to better exploit previous experience. It tries to max-
imize the expected return of trials using a parametrized policy, such as the
aforementioned CDMPs. We use PoWER because it is robust with respect to
reward functions [7]. Furthermore, it also uses only the terminal reward, and no
intermediate rewards.
4 Lonˇcarevi´c et al.
100 100
164
164
50 50
10
Fig. 1. Illustration of the autoencoder structure with five hidden layers. Note that
the number of neurons per layer in the used autoencoder is too high for an effective
illustration, that way we plotted in each layer three times less neurons. The depicted
number of neurons per layer does not match the number we used (164 for input and
output layers, 100, 50, 10, 50, 100 for hidden layers).
3 Deep Autoencoder Network
An antoencoder is a neural network used for unsupervised encoding of data. This
neural network is comprised of two parts: an encoder and a decoder network.
Using the encoder network part, the data is encoded. Because the number of
neurons in the hidden layers is lower than in the input layer, this forces the data
through a bottleneck or latent space. In this space the most relevant features
are extracted. Latent space is therefore often also called feature space. One of
typical applications of autoencoders is for dimensionality reduction [2].
The decoder network part reconstructs the feature representation so that the
output data θ0matches the input data θ. Training of the parameters of the
autoencoder (θ?) is described by
θ?= arg min
θ?
1
n
n
X
i=1
L(θ(i),θ
0(i)) (9)
= arg min
θ?
1
n
n
X
i=1
L(θ(i), h(g(θ(i)))),(10)
where Lis Euclidian distance between the input and output vectors and nis the
number of samples. Figure 1 shows an illustration of an arbitrary autoencoder
architecture.
4 Experimental Evaluation
We experimentally tested how many database entries for training the autoen-
coder network are needed to faithfully represent the input data, with the goal
to show faster learning in the latent space.
Robotic Throwing 5
The learning experiments were performed on a simulated humanoid robot
Talos, using seven degrees of the robot’s left arm for the throwing. A passive
elastic spoon was used to extend the range of the throw of the ball. An image
sequence of a successful throw with the simulated Talos is shown in Fig. 2.
Fig. 2. Still images of the simulated throwing experiment using the Talos humanoid
robot in Gazebo dynamical simulation. A sequence of a successful throw is shown.
We first implemented learning in the space of the CDMP parameters. Start-
ing from an initial task-space trajectory demonstration, the RL using PoWER
modifies the policy so that the throw results in the ball landing in the basket.
The distance and the angle of the throw have to be learned, while the learn-
ing takes place in all 6 DOF of the CDMP trajectory encoding. In the CDMP
parameter space we learn parameters θfor the next, n+ 1 experiment, θn+1,
where
θ=θT
x,θT
y,θT
z,θT
u1,θT
u2,θT
u3T,(11)
θ(.)=wT, y0,, gT,(.) = x, y, z , u1, u2, u3.
We used N= 25 basis functions per dimension. After taking into the consider-
ation start and goal pose, the size of learning space sums up to 164 parameters.
To create the database for autoencoder training, we recorded all the roll-outs
of the learning and the corresponding landing spots of the ball.
We used (11) as the input data for the autoencoder network. This sets the
input and output layer sizes to 164. The autoencoder was comprised of 5 hidden
layers with 100, 50, L, 50, and 100 neurons. We varied latent layer size Lbetween
2 and 10 for different experiments. Activation function for each hidden layer
is y= tanh(Wθ#+b), where θ#is the input into a neuron network layer
and and θ?={W,b}are the autoencoder parameters. Note that the input
is different for each layer, because it is the output of the previous layer. The
activation function of the output layer was linear. After the training we split the
autoencoder in the encoder and the decoder parts. The encoder maps input into
the latent space θl=g(θ) and the decoder maps from latent space to the output
θ0=h(θl), i. e., again into CDMP parameters that describe the robot cartesian
trajectories.
In latent space we also use PoWER [7] to learn θl
n+1. However, the learning
space in this case is only L-dimensional. The values of parameters in latent space
6 Lonˇcarevi´c et al.
θldefine the DMP parameters, and therefore the shape of the trajectory on the
robot, through the decoder network
θ0
n+1 =h(θl
n+1).(12)
5 Results
From the graphs of mean square error for trajectory positions and quarterion
orientations (Fig. 3) we can see the expected trend that bigger database and
bigger latent space reduced the error. Similar was reporten in [3]. As we can see
in our case, the trend settles for the databases bigger than 200 examples and
latent space size bigger than 4. These are the most efficient training parameters
in our case. However, we can see certain improvements for much bigger latent
space size and database size (e.g. 10 dimensions and 800 examples), but this
leads to much bigger costs of generating input data and would slow down the
learning process.
Fig. 3. Mean square error of the trajectory position and quaternion orientation in
respect to database size and latent space size
As a proof of concept we tested our RL algorithm on three different reward
systems (exact, unsigned, signed [10]) and two different search spaces (task space
and latent space). Learning was done 25 times for all the cases. For the case of
latent space RL we have chosen the neural network with the latent space size 4
that was trained on 200 shots.
Figure 4 shows the error convergence through the iterations of RL. Error is
given as the absolute distance in meters between the basket and the landing spot
of the ball. Graphs show that all the reward systems in both task and latent
space converged successfully. It shows that reduced reward systems (unsigned
and signed) were able to converge to the target almost equally fast as the exact
Robotic Throwing 7
Fig. 4. Average error of throwing through the iterations of learning. RL in configuration
space is shown in the left plot, and in latent space in the right plot. In both plots, the
exact reward is shown with the red line, the unsigned reward with the green line and
the signed with the blue line. Shaded areas show the corresponding distributions for
all the reward systems.
reward system. Learning the parameters in latent space outperformed learning
in task space in all the cases, no matter the reward.
The average convergence rates for different reward systems and cases are
shown in Fig. 5. The top left graph shows the average iteration of the successful
shot in the case of RL in task space and bottom left graph shows the average
iteration of the successful shot for RL in the latent space. On the right side
maximal number of iterations needed for the successful accomplishment of the
task is shown for the both task space (top graph) and latent space (bottom
graph).
Fig. 5. Average number of throws until the first hit (left) and maximal number of
throws until first hit for different reward systems.
8 Lonˇcarevi´c et al.
6 Conclusion
Apart from confirming the results shown in [10], that the simplified reward sys-
tems can work equally good as exact reward if only terminal reward is available,
we have also shown that the search space size for the RL can be successfully
reduced using the neural networks even for complex systems such as a high
degree-of-freedom robot arm. All the experiments were conducted on a simu-
lated robot, that behaves slightly differently than the real system would, but it
is faster to generate the required training database. In the future we plan to test
this approach on the real robot as well.
References
1. Bitzer, S., Vijayakumar, S.: Latent spaces for dynamic movement primitives. In:
2009 9th IEEE-RAS International Conference on Humanoid Robots. pp. 574–581
(Dec 2009)
2. Chen, N., Bayer, J., Urban, S., van der Smagt, P.: Efficient movement representa-
tion by embedding dynamic movement primitives in deep autoencoders. In: 2015
IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids). pp.
434–440 (Nov 2015)
3. Chen, N., Karl, M., Smagt, P.V.D.: Dynamic Movement Primitives in Latent Space
of Time-Dependent Variational Autoencoders. 2016 IEEE-RAS 16th International
Conference on Humanoid Robots (Humanoids) 2(3), 629–636 (2016)
4. Deisenroth, M.P., Neumann, G., Peters, J.: A survey on policy search for robotics
pp. 388–403 (2013)
5. Gams, A., Petriˇc, T., Do, M., Nemec, B., Morimoto, J., Asfour, T., Ude, A.:
Adaptation and coaching of periodic motion primitives through physical and
visual interaction. Robotics and Autonomous Systems 75, 340 – 351 (2016),
http://www.sciencedirect.com/science/article/pii/S0921889015001992
6. Ijspeert, A., Nakanishi, J., Pastor, P., Hoffmann, H., Schaal, S.: Dynamical move-
ment primitives: Learning attractor models for motor behaviors. Neural Compu-
tation 25(2), 328–373 (2013)
7. Kober, J., Peters, J.: Policy search for motor primitives in robotics (1-2), 171–203
(2011)
8. Kober, J., Bagnell, J.A., Peters, J.: Reinforcement learning in robotics : (2013)
9. Nemec, B., Vuga, R., Ude, A.: Efficient sensorimotor learning from multiple demon-
strations. Advanced Robotics 27(13), 1023–1031 (2013)
10. Pahiˇc, R., Lonˇcarevi´c, Z., Ude, A., Nemec, B., Gams, A.: User feedback in latent
space robotic skill learning. In: 2018 IEEE-RAS 18th International Conference on
Humanoid Robots (Humanoids). pp. 270–276 (Nov 2018)
11. Schaal, S.: Is imitation learning the route to humanoid robots? Trends in Cognitive
Sciences 3(6), 233–242 (1999)
12. Ude, A., Gams, A., Asfour, T., Morimoto, J.: Task-specific generalization of dis-
crete and periodic dynamic movement primitives. IEEE Transactions on Robotics
26(5), 800–815 (Oct 2010)
13. Ude, A., Nemec, B., Petriˇc, T., Morimoto, J.: Orientation in cartesian space dy-
namic movement primitives. In: 2014 IEEE International Conference on Robotics
and Automation (ICRA). pp. 2997–3004 (May 2014)
14. Vijayakumar, S., D’souza, A., Schaal, S.: Incremental online learning in high di-
mensions. Neural Comput. 17(12), 2602–2634 (Dec 2005), https://doi.org/10.1162/
089976605774320557