Content uploaded by Zvezdan Lončarević

Author content

All content in this area was uploaded by Zvezdan Lončarević on Jan 18, 2022

Content may be subject to copyright.

Reduction of Trajectory Encoding Data using a

Deep Autoencoder Network: Robotic Throwing

Zvezdan Lonˇcarevi´c?, Rok Pahiˇc, Mihael Simoniˇc, Aleˇs Ude and Andrej Gams

Joˇzef Stefan Institute

Jamova cesta 39, 1000 Ljubljana, Slovenia

zvezdan.loncarevic@ijs.si

Abstract. Autonomous learning and adaptation of robotic trajectories

by complex robots in unstructured environments, for example with the

use of reinforcement learning, very quickly encounters problems where

the dimensionality of the search space is beyond the range of practical

use. Diﬀerent methods of reducing the dimensionality have been pro-

posed in the literature. In this paper we explore the use of deep au-

toencoders, where the dimensionality of autoencoder latent space is low.

However, a database of actions is required to train a deep autoencoder

network. The paper presents a study on the number of required database

samples in order to achieve dimensionality reduction without much loss

of information.

Keywords: deep autoencoder, reinforcement learning, robotic throwing

1 Introduction and Related Work

Learning and adaptation of robotic skills has been researched in depth over the

last decade. However, as [11] reports, the search space of learning of complete

actions/skills from scratch is too large, making learning from scratch unfeasible.

Several methods have been proposed to reduce the size of the learning space.

For example, the search space was reduced with accurate initial demonstrations

[5], or by conﬁning it to the space between an over- and and under- attempt [9].

Policy learning has been developed for continuous actions [4]. Information has

been reduced by only keeping the principal components of data [1] and speciﬁc

methods of trajectory encoding, such as locally weighted projection regression

[14], leading to latent spaces [1].

Latent spaces of deep autoencoders has also been proposed as means of di-

mensionality reduction of robotic actions and skills [2], speciﬁcally in combina-

tion with policy representations in the form of dynamic movement primitives

[6]. In our previous paper on this topic we have shown that skill learning (us-

ing reinforcement learning) in latent space of deep autoencoders is faster than

learning in the space of the trajectory [10].

?Holder of Ad futura Scholarship for Postgraduate Studies of Nationals of Western

Balkan States for Study in the Republic of Slovenia (226. Public Call)

2 Lonˇcarevi´c et al.

However, the problem of generating the database for training the autoencoder

network remains. While in [10] we used simulated results to train the autoencoder

network, in this paper we explore how much data is actually needed for faithful

representation of an action. To do this, we used the humanoid robot Talos to

perform throwing actions with one of its arms, where the trajectory of throwing

was encoded in the task space. We applied reinforcement learning (RL [8]) to

learn throwing at the desired spot. We recorded all the roll-outs and used them

to make our database. By learning to throw at diﬀerent locations, we quickly

acquired a large database, and then used parts of it to see how accurately we

can perform throws at untrained locations. Testing of diﬀerent database sizes

and deep autoencoder latent space sizes can provide an insight whether training

a small database in the real world makes sense, or whether it is better to train

a large database in simulation and then apply RL in the latent space of an

autoencoder trained with this database. This paper reports on the ﬁrst steps

of researching in this direction, with the research on the required simulated

database.

The rest of the paper is organized as follows. In the next Section we ﬁrst show

how the task-space trajectories of throwing are encoded and how reinforcement

learning using PoWER is applied to reﬁne a given task. Section 3 gives details

on autoencoders. The experiments are described in Section 4, followed by results

in Section 5 and a Conclusion.

2 Trajectory and Learning

Robot motion trajectories can be written in either joint space or in Cartesian

space. Cartesian space trajectories are typically easier to interpret for humans,

specially in high degree-of-freedom robots. In this paper the trajectories of mo-

tion were encoded in task space of the end-eﬀector, in the form of Cartesian space

dynamic movement primitives (CDMPs). The original derivation of CDMPs was

proposed in [13]. For clarity we provide a short recap of CDMPs below.

2.1 Cartesian space Dynamic Movement Primitives

CDMPs are composed of two parts: position and orientation part. The position

part of the trajectory is the same as in standard DMPs. On the other hand,

the orientation part part of the CDMP is represented by unit quaternions. Unit

quaternions require special treatment, for both the nonlinear dynamic equations

and the integration of these equations.

The following parameters compose a CDMP: weights w

w

wp

k, w

w

wo

k∈R3, k =

1, . . . , N , which represent the position and orientation parts of the trajectory,

respectively; trajectory duration τand the ﬁnal desired, goal position g

g

gpand

orientation g

g

goof the robot. Variable Nsets the number of radial basis functions

that are used to encode the trajectory. The orientation is in CDMP represented

by a unit quaternion q

q

q=v+u

u

u∈S3,where S3is a unit sphere in R4.v∈Ris

Robotic Throwing 3

its scalar and u

u

u∈R3its vector part. To encode position (p

p

p) and orientation (q

q

q)

trajectories we use the following diﬀerential equations:

τ˙

z=αz(βz(gp−p)−z) + fp(x),(1)

τ˙

p=z,(2)

τ˙

η

η

η=αz(βz2 log (g

g

go∗q

q

q)−η

η

η) + fo(x),(3)

τ˙

q

q

q=1

2η

η

η∗q

q

q, (4)

τ˙x=−αx. (5)

Parameters z

z

z, η

η

ηdenote the scaled linear and angular velocity (z

z

z=τ˙

p

p

p,η

η

η=τω

ω

ω).

For details on quaternion product ∗, conjugation q

q

q, and the quaternion logarithm

log (q

q

q), see [13]. The nonlinear parts, termed also forcing terms, fpand foare

deﬁned as

fp(x) = D

D

DpPN

k=1 wp

kΨk(x)

PN

k=1 Ψk(x)x, (6)

fo(x) = D

D

DoPN

k=1 w

w

wo

kΨk(x)

PN

k=1 Ψk(x)x. (7)

Forcing terms contain parameters w

w

wp

k, w

w

wo

k∈R3. They have to be learned, for ex-

ample directly from an input Cartesian trajectory {p

p

pj, q

q

qj,˙

p

p

pj, ω

ω

ωj,¨

p

p

pj,˙

ω

ω

ωj, tj}T

j=1.

The scaling matrices D

D

Dp, D

D

Do∈R3×3can be set to D

D

Dp=D

D

Do=I

I

I. Other possi-

bilities are described in [13]. The nonlinear forcing terms are deﬁned as a linear

combination of radial basis functions Ψk

Ψk(x) = exp −hk(x−ck)2.(8)

Here ckare the centers and hkthe widths of the radial basis functions. The dis-

tribution of weights can be, as in [12], ck= exp −αxk−1

N−1,hk=1

(ck+1 −ck)2,

hN=hN−1,k= 1, . . . , N . The time constant τis set to the desired duration of

the trajectory, i. e. τ=tT−t1. The goal position and orientation are usually set

to the ﬁnal position and orientation on the desired trajectory, i. e. g

g

gp=p

p

ptTand

g

g

go=q

q

qtT. Detailed CDMP description and auxiliary math are explained in [13].

2.2 PoWER

The goal of the paper is to show that autonomous learning, in our case rein-

forcement learning, can be faster in the latent space. We use Policy Learning

by Weighting Exploration with the Returns (PoWER) [7] RL algorithm. It is

an Expectation-Maximization (EM) based RL algorithm that can be combined

with importance sampling to better exploit previous experience. It tries to max-

imize the expected return of trials using a parametrized policy, such as the

aforementioned CDMPs. We use PoWER because it is robust with respect to

reward functions [7]. Furthermore, it also uses only the terminal reward, and no

intermediate rewards.

4 Lonˇcarevi´c et al.

100 100

164

164

50 50

10

Fig. 1. Illustration of the autoencoder structure with ﬁve hidden layers. Note that

the number of neurons per layer in the used autoencoder is too high for an eﬀective

illustration, that way we plotted in each layer three times less neurons. The depicted

number of neurons per layer does not match the number we used (164 for input and

output layers, 100, 50, 10, 50, 100 for hidden layers).

3 Deep Autoencoder Network

An antoencoder is a neural network used for unsupervised encoding of data. This

neural network is comprised of two parts: an encoder and a decoder network.

Using the encoder network part, the data is encoded. Because the number of

neurons in the hidden layers is lower than in the input layer, this forces the data

through a bottleneck or latent space. In this space the most relevant features

are extracted. Latent space is therefore often also called feature space. One of

typical applications of autoencoders is for dimensionality reduction [2].

The decoder network part reconstructs the feature representation so that the

output data θ0matches the input data θ. Training of the parameters of the

autoencoder (θ?) is described by

θ?= arg min

θ?

1

n

n

X

i=1

L(θ(i),θ

0(i)) (9)

= arg min

θ?

1

n

n

X

i=1

L(θ(i), h(g(θ(i)))),(10)

where Lis Euclidian distance between the input and output vectors and nis the

number of samples. Figure 1 shows an illustration of an arbitrary autoencoder

architecture.

4 Experimental Evaluation

We experimentally tested how many database entries for training the autoen-

coder network are needed to faithfully represent the input data, with the goal

to show faster learning in the latent space.

Robotic Throwing 5

The learning experiments were performed on a simulated humanoid robot

Talos, using seven degrees of the robot’s left arm for the throwing. A passive

elastic spoon was used to extend the range of the throw of the ball. An image

sequence of a successful throw with the simulated Talos is shown in Fig. 2.

Fig. 2. Still images of the simulated throwing experiment using the Talos humanoid

robot in Gazebo dynamical simulation. A sequence of a successful throw is shown.

We ﬁrst implemented learning in the space of the CDMP parameters. Start-

ing from an initial task-space trajectory demonstration, the RL using PoWER

modiﬁes the policy so that the throw results in the ball landing in the basket.

The distance and the angle of the throw have to be learned, while the learn-

ing takes place in all 6 DOF of the CDMP trajectory encoding. In the CDMP

parameter space we learn parameters θfor the next, n+ 1 experiment, θn+1,

where

θ=θT

x,θT

y,θT

z,θT

u1,θT

u2,θT

u3T,(11)

θ(.)=wT, y0,, gT,(.) = x, y, z , u1, u2, u3.

We used N= 25 basis functions per dimension. After taking into the consider-

ation start and goal pose, the size of learning space sums up to 164 parameters.

To create the database for autoencoder training, we recorded all the roll-outs

of the learning and the corresponding landing spots of the ball.

We used (11) as the input data for the autoencoder network. This sets the

input and output layer sizes to 164. The autoencoder was comprised of 5 hidden

layers with 100, 50, L, 50, and 100 neurons. We varied latent layer size Lbetween

2 and 10 for diﬀerent experiments. Activation function for each hidden layer

is y= tanh(Wθ#+b), where θ#is the input into a neuron network layer

and and θ?={W,b}are the autoencoder parameters. Note that the input

is diﬀerent for each layer, because it is the output of the previous layer. The

activation function of the output layer was linear. After the training we split the

autoencoder in the encoder and the decoder parts. The encoder maps input into

the latent space θl=g(θ) and the decoder maps from latent space to the output

θ0=h(θl), i. e., again into CDMP parameters that describe the robot cartesian

trajectories.

In latent space we also use PoWER [7] to learn θl

n+1. However, the learning

space in this case is only L-dimensional. The values of parameters in latent space

6 Lonˇcarevi´c et al.

θldeﬁne the DMP parameters, and therefore the shape of the trajectory on the

robot, through the decoder network

θ0

n+1 =h(θl

n+1).(12)

5 Results

From the graphs of mean square error for trajectory positions and quarterion

orientations (Fig. 3) we can see the expected trend that bigger database and

bigger latent space reduced the error. Similar was reporten in [3]. As we can see

in our case, the trend settles for the databases bigger than 200 examples and

latent space size bigger than 4. These are the most eﬃcient training parameters

in our case. However, we can see certain improvements for much bigger latent

space size and database size (e.g. 10 dimensions and 800 examples), but this

leads to much bigger costs of generating input data and would slow down the

learning process.

Fig. 3. Mean square error of the trajectory position and quaternion orientation in

respect to database size and latent space size

As a proof of concept we tested our RL algorithm on three diﬀerent reward

systems (exact, unsigned, signed [10]) and two diﬀerent search spaces (task space

and latent space). Learning was done 25 times for all the cases. For the case of

latent space RL we have chosen the neural network with the latent space size 4

that was trained on 200 shots.

Figure 4 shows the error convergence through the iterations of RL. Error is

given as the absolute distance in meters between the basket and the landing spot

of the ball. Graphs show that all the reward systems in both task and latent

space converged successfully. It shows that reduced reward systems (unsigned

and signed) were able to converge to the target almost equally fast as the exact

Robotic Throwing 7

Fig. 4. Average error of throwing through the iterations of learning. RL in conﬁguration

space is shown in the left plot, and in latent space in the right plot. In both plots, the

exact reward is shown with the red line, the unsigned reward with the green line and

the signed with the blue line. Shaded areas show the corresponding distributions for

all the reward systems.

reward system. Learning the parameters in latent space outperformed learning

in task space in all the cases, no matter the reward.

The average convergence rates for diﬀerent reward systems and cases are

shown in Fig. 5. The top left graph shows the average iteration of the successful

shot in the case of RL in task space and bottom left graph shows the average

iteration of the successful shot for RL in the latent space. On the right side

maximal number of iterations needed for the successful accomplishment of the

task is shown for the both task space (top graph) and latent space (bottom

graph).

Fig. 5. Average number of throws until the ﬁrst hit (left) and maximal number of

throws until ﬁrst hit for diﬀerent reward systems.

8 Lonˇcarevi´c et al.

6 Conclusion

Apart from conﬁrming the results shown in [10], that the simpliﬁed reward sys-

tems can work equally good as exact reward if only terminal reward is available,

we have also shown that the search space size for the RL can be successfully

reduced using the neural networks even for complex systems such as a high

degree-of-freedom robot arm. All the experiments were conducted on a simu-

lated robot, that behaves slightly diﬀerently than the real system would, but it

is faster to generate the required training database. In the future we plan to test

this approach on the real robot as well.

References

1. Bitzer, S., Vijayakumar, S.: Latent spaces for dynamic movement primitives. In:

2009 9th IEEE-RAS International Conference on Humanoid Robots. pp. 574–581

(Dec 2009)

2. Chen, N., Bayer, J., Urban, S., van der Smagt, P.: Eﬃcient movement representa-

tion by embedding dynamic movement primitives in deep autoencoders. In: 2015

IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids). pp.

434–440 (Nov 2015)

3. Chen, N., Karl, M., Smagt, P.V.D.: Dynamic Movement Primitives in Latent Space

of Time-Dependent Variational Autoencoders. 2016 IEEE-RAS 16th International

Conference on Humanoid Robots (Humanoids) 2(3), 629–636 (2016)

4. Deisenroth, M.P., Neumann, G., Peters, J.: A survey on policy search for robotics

pp. 388–403 (2013)

5. Gams, A., Petriˇc, T., Do, M., Nemec, B., Morimoto, J., Asfour, T., Ude, A.:

Adaptation and coaching of periodic motion primitives through physical and

visual interaction. Robotics and Autonomous Systems 75, 340 – 351 (2016),

http://www.sciencedirect.com/science/article/pii/S0921889015001992

6. Ijspeert, A., Nakanishi, J., Pastor, P., Hoﬀmann, H., Schaal, S.: Dynamical move-

ment primitives: Learning attractor models for motor behaviors. Neural Compu-

tation 25(2), 328–373 (2013)

7. Kober, J., Peters, J.: Policy search for motor primitives in robotics (1-2), 171–203

(2011)

8. Kober, J., Bagnell, J.A., Peters, J.: Reinforcement learning in robotics : (2013)

9. Nemec, B., Vuga, R., Ude, A.: Eﬃcient sensorimotor learning from multiple demon-

strations. Advanced Robotics 27(13), 1023–1031 (2013)

10. Pahiˇc, R., Lonˇcarevi´c, Z., Ude, A., Nemec, B., Gams, A.: User feedback in latent

space robotic skill learning. In: 2018 IEEE-RAS 18th International Conference on

Humanoid Robots (Humanoids). pp. 270–276 (Nov 2018)

11. Schaal, S.: Is imitation learning the route to humanoid robots? Trends in Cognitive

Sciences 3(6), 233–242 (1999)

12. Ude, A., Gams, A., Asfour, T., Morimoto, J.: Task-speciﬁc generalization of dis-

crete and periodic dynamic movement primitives. IEEE Transactions on Robotics

26(5), 800–815 (Oct 2010)

13. Ude, A., Nemec, B., Petriˇc, T., Morimoto, J.: Orientation in cartesian space dy-

namic movement primitives. In: 2014 IEEE International Conference on Robotics

and Automation (ICRA). pp. 2997–3004 (May 2014)

14. Vijayakumar, S., D’souza, A., Schaal, S.: Incremental online learning in high di-

mensions. Neural Comput. 17(12), 2602–2634 (Dec 2005), https://doi.org/10.1162/

089976605774320557