ChapterPDF Available

Reduction of Trajectory Encoding Data Using a Deep Autoencoder Network: Robotic Throwing


Abstract and Figures

Autonomous learning and adaptation of robotic trajectories by complex robots in unstructured environments, for example with the use of reinforcement learning, very quickly encounters problems where the dimensionality of the search space is beyond the range of practical use. Different methods of reducing the dimensionality have been proposed in the literature. In this paper we explore the use of deep autoencoders, where the dimensionality of autoencoder latent space is low. However, a database of actions is required to train a deep autoencoder network. The paper presents a study on the number of required database samples in order to achieve dimensionality reduction without much loss of information.
Content may be subject to copyright.
Reduction of Trajectory Encoding Data using a
Deep Autoencoder Network: Robotic Throwing
Zvezdan Lonˇcarevi´c?, Rok Pahiˇc, Mihael Simoniˇc, Aleˇs Ude and Andrej Gams
Joˇzef Stefan Institute
Jamova cesta 39, 1000 Ljubljana, Slovenia
Abstract. Autonomous learning and adaptation of robotic trajectories
by complex robots in unstructured environments, for example with the
use of reinforcement learning, very quickly encounters problems where
the dimensionality of the search space is beyond the range of practical
use. Different methods of reducing the dimensionality have been pro-
posed in the literature. In this paper we explore the use of deep au-
toencoders, where the dimensionality of autoencoder latent space is low.
However, a database of actions is required to train a deep autoencoder
network. The paper presents a study on the number of required database
samples in order to achieve dimensionality reduction without much loss
of information.
Keywords: deep autoencoder, reinforcement learning, robotic throwing
1 Introduction and Related Work
Learning and adaptation of robotic skills has been researched in depth over the
last decade. However, as [11] reports, the search space of learning of complete
actions/skills from scratch is too large, making learning from scratch unfeasible.
Several methods have been proposed to reduce the size of the learning space.
For example, the search space was reduced with accurate initial demonstrations
[5], or by confining it to the space between an over- and and under- attempt [9].
Policy learning has been developed for continuous actions [4]. Information has
been reduced by only keeping the principal components of data [1] and specific
methods of trajectory encoding, such as locally weighted projection regression
[14], leading to latent spaces [1].
Latent spaces of deep autoencoders has also been proposed as means of di-
mensionality reduction of robotic actions and skills [2], specifically in combina-
tion with policy representations in the form of dynamic movement primitives
[6]. In our previous paper on this topic we have shown that skill learning (us-
ing reinforcement learning) in latent space of deep autoencoders is faster than
learning in the space of the trajectory [10].
?Holder of Ad futura Scholarship for Postgraduate Studies of Nationals of Western
Balkan States for Study in the Republic of Slovenia (226. Public Call)
2 Lonˇcarevi´c et al.
However, the problem of generating the database for training the autoencoder
network remains. While in [10] we used simulated results to train the autoencoder
network, in this paper we explore how much data is actually needed for faithful
representation of an action. To do this, we used the humanoid robot Talos to
perform throwing actions with one of its arms, where the trajectory of throwing
was encoded in the task space. We applied reinforcement learning (RL [8]) to
learn throwing at the desired spot. We recorded all the roll-outs and used them
to make our database. By learning to throw at different locations, we quickly
acquired a large database, and then used parts of it to see how accurately we
can perform throws at untrained locations. Testing of different database sizes
and deep autoencoder latent space sizes can provide an insight whether training
a small database in the real world makes sense, or whether it is better to train
a large database in simulation and then apply RL in the latent space of an
autoencoder trained with this database. This paper reports on the first steps
of researching in this direction, with the research on the required simulated
The rest of the paper is organized as follows. In the next Section we first show
how the task-space trajectories of throwing are encoded and how reinforcement
learning using PoWER is applied to refine a given task. Section 3 gives details
on autoencoders. The experiments are described in Section 4, followed by results
in Section 5 and a Conclusion.
2 Trajectory and Learning
Robot motion trajectories can be written in either joint space or in Cartesian
space. Cartesian space trajectories are typically easier to interpret for humans,
specially in high degree-of-freedom robots. In this paper the trajectories of mo-
tion were encoded in task space of the end-effector, in the form of Cartesian space
dynamic movement primitives (CDMPs). The original derivation of CDMPs was
proposed in [13]. For clarity we provide a short recap of CDMPs below.
2.1 Cartesian space Dynamic Movement Primitives
CDMPs are composed of two parts: position and orientation part. The position
part of the trajectory is the same as in standard DMPs. On the other hand,
the orientation part part of the CDMP is represented by unit quaternions. Unit
quaternions require special treatment, for both the nonlinear dynamic equations
and the integration of these equations.
The following parameters compose a CDMP: weights w
k, w
kR3, k =
1, . . . , N , which represent the position and orientation parts of the trajectory,
respectively; trajectory duration τand the final desired, goal position g
orientation g
goof the robot. Variable Nsets the number of radial basis functions
that are used to encode the trajectory. The orientation is in CDMP represented
by a unit quaternion q
uS3,where S3is a unit sphere in R4.vRis
Robotic Throwing 3
its scalar and u
uR3its vector part. To encode position (p
p) and orientation (q
trajectories we use the following differential equations:
z=αz(βz(gpp)z) + fp(x),(1)
η=αz(βz2 log (g
η) + fo(x),(3)
q, (4)
τ˙x=αx. (5)
Parameters z
z, η
ηdenote the scaled linear and angular velocity (z
For details on quaternion product , conjugation q
q, and the quaternion logarithm
log (q
q), see [13]. The nonlinear parts, termed also forcing terms, fpand foare
defined as
fp(x) = D
k=1 wp
k=1 Ψk(x)x, (6)
fo(x) = D
k=1 w
k=1 Ψk(x)x. (7)
Forcing terms contain parameters w
k, w
kR3. They have to be learned, for ex-
ample directly from an input Cartesian trajectory {p
pj, q
pj, ω
ωj, tj}T
The scaling matrices D
Dp, D
DoR3×3can be set to D
I. Other possi-
bilities are described in [13]. The nonlinear forcing terms are defined as a linear
combination of radial basis functions Ψk
Ψk(x) = exp hk(xck)2.(8)
Here ckare the centers and hkthe widths of the radial basis functions. The dis-
tribution of weights can be, as in [12], ck= exp αxk1
(ck+1 ck)2,
hN=hN1,k= 1, . . . , N . The time constant τis set to the desired duration of
the trajectory, i. e. τ=tTt1. The goal position and orientation are usually set
to the final position and orientation on the desired trajectory, i. e. g
qtT. Detailed CDMP description and auxiliary math are explained in [13].
2.2 PoWER
The goal of the paper is to show that autonomous learning, in our case rein-
forcement learning, can be faster in the latent space. We use Policy Learning
by Weighting Exploration with the Returns (PoWER) [7] RL algorithm. It is
an Expectation-Maximization (EM) based RL algorithm that can be combined
with importance sampling to better exploit previous experience. It tries to max-
imize the expected return of trials using a parametrized policy, such as the
aforementioned CDMPs. We use PoWER because it is robust with respect to
reward functions [7]. Furthermore, it also uses only the terminal reward, and no
intermediate rewards.
4 Lonˇcarevi´c et al.
100 100
50 50
Fig. 1. Illustration of the autoencoder structure with five hidden layers. Note that
the number of neurons per layer in the used autoencoder is too high for an effective
illustration, that way we plotted in each layer three times less neurons. The depicted
number of neurons per layer does not match the number we used (164 for input and
output layers, 100, 50, 10, 50, 100 for hidden layers).
3 Deep Autoencoder Network
An antoencoder is a neural network used for unsupervised encoding of data. This
neural network is comprised of two parts: an encoder and a decoder network.
Using the encoder network part, the data is encoded. Because the number of
neurons in the hidden layers is lower than in the input layer, this forces the data
through a bottleneck or latent space. In this space the most relevant features
are extracted. Latent space is therefore often also called feature space. One of
typical applications of autoencoders is for dimensionality reduction [2].
The decoder network part reconstructs the feature representation so that the
output data θ0matches the input data θ. Training of the parameters of the
autoencoder (θ?) is described by
θ?= arg min
0(i)) (9)
= arg min
L(θ(i), h(g(θ(i)))),(10)
where Lis Euclidian distance between the input and output vectors and nis the
number of samples. Figure 1 shows an illustration of an arbitrary autoencoder
4 Experimental Evaluation
We experimentally tested how many database entries for training the autoen-
coder network are needed to faithfully represent the input data, with the goal
to show faster learning in the latent space.
Robotic Throwing 5
The learning experiments were performed on a simulated humanoid robot
Talos, using seven degrees of the robot’s left arm for the throwing. A passive
elastic spoon was used to extend the range of the throw of the ball. An image
sequence of a successful throw with the simulated Talos is shown in Fig. 2.
Fig. 2. Still images of the simulated throwing experiment using the Talos humanoid
robot in Gazebo dynamical simulation. A sequence of a successful throw is shown.
We first implemented learning in the space of the CDMP parameters. Start-
ing from an initial task-space trajectory demonstration, the RL using PoWER
modifies the policy so that the throw results in the ball landing in the basket.
The distance and the angle of the throw have to be learned, while the learn-
ing takes place in all 6 DOF of the CDMP trajectory encoding. In the CDMP
parameter space we learn parameters θfor the next, n+ 1 experiment, θn+1,
θ(.)=wT, y0,, gT,(.) = x, y, z , u1, u2, u3.
We used N= 25 basis functions per dimension. After taking into the consider-
ation start and goal pose, the size of learning space sums up to 164 parameters.
To create the database for autoencoder training, we recorded all the roll-outs
of the learning and the corresponding landing spots of the ball.
We used (11) as the input data for the autoencoder network. This sets the
input and output layer sizes to 164. The autoencoder was comprised of 5 hidden
layers with 100, 50, L, 50, and 100 neurons. We varied latent layer size Lbetween
2 and 10 for different experiments. Activation function for each hidden layer
is y= tanh(Wθ#+b), where θ#is the input into a neuron network layer
and and θ?={W,b}are the autoencoder parameters. Note that the input
is different for each layer, because it is the output of the previous layer. The
activation function of the output layer was linear. After the training we split the
autoencoder in the encoder and the decoder parts. The encoder maps input into
the latent space θl=g(θ) and the decoder maps from latent space to the output
θ0=h(θl), i. e., again into CDMP parameters that describe the robot cartesian
In latent space we also use PoWER [7] to learn θl
n+1. However, the learning
space in this case is only L-dimensional. The values of parameters in latent space
6 Lonˇcarevi´c et al.
θldefine the DMP parameters, and therefore the shape of the trajectory on the
robot, through the decoder network
n+1 =h(θl
5 Results
From the graphs of mean square error for trajectory positions and quarterion
orientations (Fig. 3) we can see the expected trend that bigger database and
bigger latent space reduced the error. Similar was reporten in [3]. As we can see
in our case, the trend settles for the databases bigger than 200 examples and
latent space size bigger than 4. These are the most efficient training parameters
in our case. However, we can see certain improvements for much bigger latent
space size and database size (e.g. 10 dimensions and 800 examples), but this
leads to much bigger costs of generating input data and would slow down the
learning process.
Fig. 3. Mean square error of the trajectory position and quaternion orientation in
respect to database size and latent space size
As a proof of concept we tested our RL algorithm on three different reward
systems (exact, unsigned, signed [10]) and two different search spaces (task space
and latent space). Learning was done 25 times for all the cases. For the case of
latent space RL we have chosen the neural network with the latent space size 4
that was trained on 200 shots.
Figure 4 shows the error convergence through the iterations of RL. Error is
given as the absolute distance in meters between the basket and the landing spot
of the ball. Graphs show that all the reward systems in both task and latent
space converged successfully. It shows that reduced reward systems (unsigned
and signed) were able to converge to the target almost equally fast as the exact
Robotic Throwing 7
Fig. 4. Average error of throwing through the iterations of learning. RL in configuration
space is shown in the left plot, and in latent space in the right plot. In both plots, the
exact reward is shown with the red line, the unsigned reward with the green line and
the signed with the blue line. Shaded areas show the corresponding distributions for
all the reward systems.
reward system. Learning the parameters in latent space outperformed learning
in task space in all the cases, no matter the reward.
The average convergence rates for different reward systems and cases are
shown in Fig. 5. The top left graph shows the average iteration of the successful
shot in the case of RL in task space and bottom left graph shows the average
iteration of the successful shot for RL in the latent space. On the right side
maximal number of iterations needed for the successful accomplishment of the
task is shown for the both task space (top graph) and latent space (bottom
Fig. 5. Average number of throws until the first hit (left) and maximal number of
throws until first hit for different reward systems.
8 Lonˇcarevi´c et al.
6 Conclusion
Apart from confirming the results shown in [10], that the simplified reward sys-
tems can work equally good as exact reward if only terminal reward is available,
we have also shown that the search space size for the RL can be successfully
reduced using the neural networks even for complex systems such as a high
degree-of-freedom robot arm. All the experiments were conducted on a simu-
lated robot, that behaves slightly differently than the real system would, but it
is faster to generate the required training database. In the future we plan to test
this approach on the real robot as well.
1. Bitzer, S., Vijayakumar, S.: Latent spaces for dynamic movement primitives. In:
2009 9th IEEE-RAS International Conference on Humanoid Robots. pp. 574–581
(Dec 2009)
2. Chen, N., Bayer, J., Urban, S., van der Smagt, P.: Efficient movement representa-
tion by embedding dynamic movement primitives in deep autoencoders. In: 2015
IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids). pp.
434–440 (Nov 2015)
3. Chen, N., Karl, M., Smagt, P.V.D.: Dynamic Movement Primitives in Latent Space
of Time-Dependent Variational Autoencoders. 2016 IEEE-RAS 16th International
Conference on Humanoid Robots (Humanoids) 2(3), 629–636 (2016)
4. Deisenroth, M.P., Neumann, G., Peters, J.: A survey on policy search for robotics
pp. 388–403 (2013)
5. Gams, A., Petriˇc, T., Do, M., Nemec, B., Morimoto, J., Asfour, T., Ude, A.:
Adaptation and coaching of periodic motion primitives through physical and
visual interaction. Robotics and Autonomous Systems 75, 340 – 351 (2016),
6. Ijspeert, A., Nakanishi, J., Pastor, P., Hoffmann, H., Schaal, S.: Dynamical move-
ment primitives: Learning attractor models for motor behaviors. Neural Compu-
tation 25(2), 328–373 (2013)
7. Kober, J., Peters, J.: Policy search for motor primitives in robotics (1-2), 171–203
8. Kober, J., Bagnell, J.A., Peters, J.: Reinforcement learning in robotics : (2013)
9. Nemec, B., Vuga, R., Ude, A.: Efficient sensorimotor learning from multiple demon-
strations. Advanced Robotics 27(13), 1023–1031 (2013)
10. Pahiˇc, R., Lonˇcarevi´c, Z., Ude, A., Nemec, B., Gams, A.: User feedback in latent
space robotic skill learning. In: 2018 IEEE-RAS 18th International Conference on
Humanoid Robots (Humanoids). pp. 270–276 (Nov 2018)
11. Schaal, S.: Is imitation learning the route to humanoid robots? Trends in Cognitive
Sciences 3(6), 233–242 (1999)
12. Ude, A., Gams, A., Asfour, T., Morimoto, J.: Task-specific generalization of dis-
crete and periodic dynamic movement primitives. IEEE Transactions on Robotics
26(5), 800–815 (Oct 2010)
13. Ude, A., Nemec, B., Petriˇc, T., Morimoto, J.: Orientation in cartesian space dy-
namic movement primitives. In: 2014 IEEE International Conference on Robotics
and Automation (ICRA). pp. 2997–3004 (May 2014)
14. Vijayakumar, S., D’souza, A., Schaal, S.: Incremental online learning in high di-
mensions. Neural Comput. 17(12), 2602–2634 (Dec 2005),
... With this model, we tested the increase in the performance of the RL algorithm when we use NN for approximating the reward with randomly given DMP parameters to increase the number of examples. As learning in more reduced space is faster [12], for learning in the latent space of neural network, we used MuJoCo simulation with complete robot and ball dynamics ( Fig. 1 -left). In this case NN was approximating reward for the corresponding latent space values. ...
Conference Paper
Full-text available
Reinforcement learning is a widely used method of acquiring new skills in robotics. However, it is usually rather slow and a lot of learning iterations are needed until robot successfully learns the skill. During learning attempts, parameters of the actions together with the corresponding reward are stored and used in the following update. In this paper, we present the possibility of using neural networks for expanding the database containing the knowledge from previous learning iterations. Results of throwing examples show that this can lead to accelerated robot learning with less iterations and real-world repetitions.
Full-text available
Soft robots have been extensively studied for their ability to provide both good performance and safe human-robot interaction. In this paper, we present and compare the performance of two model-based control techniques with the common aim to independently and simultaneously control position and stiffness of a pneumatic soft robot’s joint. The dynamic system of a robot arm with flexible joints actuated by a pneumatic antagonistic pair of actuators, so-called McKibben artificial muscles, will be regarded, while its dynamic parameters will be considered imprecise. Simulation results are provided to verify the performance of the algorithms.
In order to increase the autonomy of the modern, high complexity robots with multiple degrees of freedom, it is necessary for them to be able to learn and adapt their skills, for example, using reinforcement learning (RL). However, RL performance greatly depends on the task dimensionality. Methods for reducing the task dimensionality, such as deep autoencoder neural networks, are often employed. Such neural network based dimensionality reduction approaches require a large example database for training, but obtaining such a database for a real robot is a complex and tedious process. This paper proposes a method of obtaining a database for the training of a deep autoencoder network, which serves for the dimensionality reduction of robot learning, and thus accelerates the robot’s ability to adapt to the real world. The presented method is based on a few real-world examples and statistical generalization. A comparison to using a simulated-only database on the use-case of robot throwing shows that the proposed approach achieves better real-world performance.
Conference Paper
Full-text available
Dynamic movement primitives (DMPs) are powerful for the generalization of movements from demonstration. However, high dimensional movements, as they are found in robotics, make finding efficient DMP representations difficult. Typically, they are either used in configuration or Cartesian space, but both approaches do not generalize well. Additionally, limiting DMPs to single demonstrations restricts their generalization capabilities. In this paper, we explore a method that embeds DMPs into the latent space of a time-dependent variational autoencoder framework. Our method enables the representation of high-dimensional movements in a low-dimensional latent space. Experimental results show that our framework has excellent generalization in the latent space, e.g., switching between movements or changing goals. Also, it generates optimal movements when reproducing the movements.
Conference Paper
Full-text available
Dynamic movement primitives (DMPs) were proposed as an efficient way for learning and control of complex robot behaviors. They can be used to represent point-to-point and periodic movements and can be applied in Cartesian or in joint space. One problem that arises when DMPs are used to define control policies in Cartesian space is that there exists no minimal, singularity-free representation of orientation. In this paper we show how dynamic movement primitives can be defined for non minimal, singularity free representations of orientation, such as rotation matrices and quaternions. All of the advantages of DMPs, including ease of learning, the ability to include coupling terms, and scale and temporal invariance, can be adopted in our formulation. We have also proposed a new phase stopping mechanism to ensure full movement reproduction in case of perturbations.
Full-text available
In this paper, we present a new approach to the problem of learning motor primitives, which combines ideas from statistical generalization and error learning. The learning procedure is formulated in two stages. The first stage is based on the generalization of previously trained movements associated with a specific task configuration, which results in a first approximation of a suitable control policy in a new situation. The second stage applies learning in the subspace defined by the previously acquired training data, which results in a learning problem in constrained domain. We show that reinforcement learning in constrained domain can be interpreted as an error-learning algorithm. Furthermore, we propose modifications to speed up the learning process. The proposed approach was tested both in simulation and experimentally on two challenging tasks: learning of matchbox flip-up and pouring.
Full-text available
Nonlinear dynamical systems have been used in many disciplines to model complex behaviors, including biological motor control, robotics, perception, economics, traffic prediction, and neuroscience. While often the unexpected emergent behavior of nonlinear systems is the focus of investigations, it is of equal importance to create goal-directed behavior (e.g., stable locomotion from a system of coupled oscillators under perceptual guidance). Modeling goal-directed behavior with nonlinear systems is, however, rather difficult due to the parameter sensitivity of these systems, their complex phase transitions in response to subtle parameter changes, and the difficulty of analyzing and predicting their long-term behavior; intuition and time-consuming parameter tuning play a major role. This letter presents and reviews dynamical movement primitives, a line of research for modeling attractor behaviors of autonomous nonlinear dynamical systems with the help of statistical learning techniques. The essence of our approach is to start with a simple dynamical system, such as a set of linear differential equations, and transform those into a weakly nonlinear system with prescribed attractor dynamics by means of a learnable autonomous forcing term. Both point attractors and limit cycle attractors of almost arbitrary complexity can be generated. We explain the design principle of our approach and evaluate its properties in several example applications in motor control and robotics.
Conference Paper
Full-text available
Dynamic movement primitives (DMPs) have been proposed as a powerful, robust and adaptive tool for planning robot trajectories based on demonstrated example movements. Adaptation of DMPs to new task requirements becomes difficult when demonstrated trajectories are only available in joint space, because their parameters do not in general correspond to variables meaningful for the task. This problem becomes more severe with increasing number of degrees of freedom and hence is particularly an issue for humanoid movements. It has been shown that DMP parameters can directly relate to task variables, when DMPs are learned in latent spaces resulting from dimensionality reduction of demonstrated trajectories. As we show here, however, standard dimensionality reduction techniques do not in general provide adequate latent spaces which need to be highly regular. In this work we concentrate on learning discrete (point-to-point) movements and propose a modification of a powerful nonlinear dimensionality reduction technique (Gaussian process latent variable model). Our modification makes the GPLVM more suitable for the use of DMPs by favouring latent spaces with highly regular structure. Even though in this case the user has to provide a structure hypothesis we show that its precise choice is not important in order to achieve good results. Additionally, we can overcome one of the main disadvantages of the GPLVM with this modification: its dependence on the initialisation of the latent space. We motivate our approach on data from a 7-DoF robotic arm and demonstrate its feasibility on a high-dimensional human motion capture data set.