PreprintPDF Available

Stochastic Action Prediction for Imitation Learning

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Imitation learning is a data-driven approach to acquiring skills that relies on expert demonstrations to learn a policy that maps observations to actions. When performing demonstrations, experts are not always consistent and might accomplish the same task in slightly different ways. In this paper, we demonstrate inherent stochasticity in demonstrations collected for tasks including line following with a remote-controlled car and manipulation tasks including reaching, pushing, and picking and placing an object. We model stochasticity in the data distribution using autoregressive action generation, generative adversarial nets, and variational prediction and compare the performance of these approaches. We find that accounting for stochasticity in the expert data leads to substantial improvement in the success rate of task completion.
Content may be subject to copyright.
Stochastic Action Prediction for Imitation Learning
Sagar Gubbi Venkatesh1,2, Nihesh Rathod1,2, Shishir Kolathaya2and Bharadwaj Amrutur1,2
Abstract Imitation learning is a data-driven approach to
acquiring skills that relies on expert demonstrations to learn
a policy that maps observations to actions. When performing
demonstrations, experts are not always consistent and might
accomplish the same task in slightly different ways. In this
paper, we demonstrate inherent stochasticity in demonstrations
collected for tasks including line following with a remote-
controlled car and manipulation tasks including reaching, push-
ing, and picking and placing an object. We model stochasticity
in the data distribution using autoregressive action genera-
tion, generative adversarial nets, and variational prediction
and compare the performance of these approaches. We find
that accounting for stochasticity in the expert data leads to
substantial improvement in the success rate of task completion.
Traditionally robot controllers have been designed by
modeling the environment and the sensors and actuators in
the robot and then analytically arriving at the controller[1].
However, this approach becomes difficult as the complexity
of the environment and the robot increase. For example, it is
difficult to hand-craft a controller that uses visual feedback
for complex manipulation tasks[2].
An alternative way to build robot controllers is to use
learning based methods. Imitation learning is one such data-
driven method that has been successfully applied[3] to a
wide range of problems ranging from self-driving cars[4]
and drone navigation[5] to complex manipulation tasks with
a robotic arm[2][6][7][8]. The basic principle in imitation
learning is to copy the behavior of an expert performing the
task. As the expert performs the task by teleoperating the
robot, the observations seen by the expert along with the
corresponding actions taken are recorded. A neural network
is trained on the expert data collected to clone the policy of
the expert.
For imitation learning to be successful, good quality expert
data is essential. However, even with abundant carefully
recorded demonstrations, the expert performing the demon-
strations might act differently given the same observations. In
these scenarios, the behaviour of the expert is truly random
or at least cannot be captured in the data. This is aleatoric
uncertainty as opposed to epistemic uncertainty which can be
explained away given enough data. The “true” randomness[9]
in the labels has to modeled rather than ignored as noise in
the dataset.
*This work was supported by Yaskawa India and Robert Bosch Center
for Cyber Physical Systems
1Department of Electrical and Communication Engineering, Indian In-
stitute of Science, Bangalore 560012, India;;
2Robert Bosch Center for Cyber Physical Systems, Indian Institute of
Science, Bangalore 560012, India
Fig. 1: An illustration of randomness in the expert data. The
peg may be moved downward or to the right to avoid the
obstacle and reach the target at the right-bottom.
An illustration of randomness in the expert data is shown
in Fig 1. The task is to move the peg towards the object at
the right-bottom corner while avoiding the obstacle. There
are multiple valid ways an expert might behave when faced
with the obstacle, and the training dataset might include
multiple ways to get around the obstacle. The peg may
be moved horizontally towards the right, or the peg may
be moved vertically downward when facing the obstacle.
If the action space is two dimensional with one of the
dimensions for moving horizontally and the other for moving
vertically, then the two valid actions are (RIGH T, H OLD)
and (H OLD, DO W N). Either of those actions is valid, but
sampling the two action dimensions independently could lead
to (RI GHT , DOW N )which would cause a collision, even
though this is not present in the training dataset.
We set out to answer two questions in this paper:
Is stochasticity prevalent in datasets for imitation learn-
ing tasks such as navigation and manipulation tasks like
If stochasticity is an issue, what approach to modeling
it works better for imitation learning?
To answer the first question, we built (a) an RC car that can
be controlled with a game controller, and (b) a teleoperation
system which allows the Yaskawa MotoMini and the Dobot
Magician to be controlled by a game controller. We collected
demonstrations for line following with the RC car and for
manipulation tasks including reaching, pushing, and picking
and placing an object with a robot arm.
To address the second question, we compare three
prevalent ways to model stochasticity: (a) autoregressive
arXiv:2101.01055v1 [cs.LG] 26 Dec 2020
conditional action generation[10][11], (b) generative ad-
versarial networks (GANs)[12], and (c) variational action
prediction[13]. We quantify the performance of each method
for the navigation and manipulation tasks and discuss the
merits and demerits of each method as applicable to imitation
The rest of this paper is organized as follows. In the
following section, papers related to this work are discussed.
In Section 3, the different potential solutions to modeling
stochasticity are discussed. Section 4 details experimental
results, and Section 5 concludes the paper.
Deep neural networks have been used to clone expert
behaviour where the expert behavior is captured by teleop-
eration of a robotic arm[2]. In [14], the authors note that
tasks can be accomplished in multiple ways. As a result,
the expert data comprising (state, action) pairs could be
multi-modal. They find that modeling the multi-modal nature
of the expert data with a Gaussian mixture model using a
mixture density network substantially improves performance
compared to minimizing the mean squared error. We explore
a closely related problem of modeling stochasticity when the
action space is multi-dimensional.
When generating images, the high dimensionality of the
output space makes it intractable to directly model the joint
probability of the output. The PixelCNN method described
in [10] models the joint distribution over the pixels of the
output image as a product of conditional distributions where
the distribution at each pixel is dependent on the values
sampled for the previous pixels in raster scan order. Although
[10] applies this to images, the same method can be applied
wherever the output space is multi-dimensional. In this paper,
we apply the PixelCNN approach to policy networks for
predicting the action to be taken.
Generative adversarial networks (GANs) have been used
to generate realistic images[12]. In the conditional version of
GAN[15], both the generator and discriminator are provided
with the data on which we wish to condition. We use
conditional GANs for policy networks by conditioning both
the generator and the discriminator on the observation and
the generator produces actions. One of the major problems
with GANs is mode collapse where the generator reaches a
local minima where it produces only one or a few modes
of the desired distribution and successfully confuses the
discriminator. Variants of GANs such as the Wasserstein
GAN[16] have been proposed to overcome mode collapse.
While GANs have most popularly been used for generating
images, they can be used as policy networks to generate
actions conditioned on observations.
Yet another way to address the problem of stochasticity
is with variational prediction. In predicting future frames
of a video given the current and past frames, variational
video prediction[17] has been used to address the problem
of low quality predictions with significant blurring when
deterministic neural networks are used. Variational video
prediction has also been leveraged for constructing world
Fig. 2: Independently sampling actions.
Fig. 3: The neural network layers for the pick-and-place task
where actions are sampled independently.
models that can then be used for sample efficient model-
based reinforcement learning[18]. Variational predictors use
the “reparameterization trick”[13] when training to refactor
stochastic nodes in the network into a differentiable func-
tion of its parameters and a random variable from a fixed
distribution. In [19] and [20], concrete random variables
or continuous relaxations of discrete random variables are
introduced to reparameterize categorical latent variables.
These are used as random latent variables in variational
inference[18]. Similar to GANs, variational prediction can
be used for policy networks to predict actions given the
Suppose that the action space is N-dimensional, then the
joint distribution of the action agiven the observation ois
p(a|o) = p(a1, a2, a3, ..., aN|o). Policy networks commonly
make the assumption that the actions are independent when
conditioned on the observation[2][6][14]. Under this assump-
p(a|o) =
With this assumption, each of the action probabilities may
be modeled separately. Usually, policy networks construct a
common intermediate representation and derive the individ-
ual action probabilities from the intermediate representation
(Fig. 2).
Fig. 4: Autoregressive action prediction.
for 1iN.
However, the independence assumption does not always
hold (Fig. 1). One way to address the problem is to not make
the independence assumption and to directly model the joint
probability. However, the set of possible outcomes in the joint
action space increases exponentially with the dimensionality
of the action space. For example, if each of the actions is
categorical with 5 possible values, then the joint action space
will have 625 possible values for 4-dimensional action space.
Because of this, it may be infeasible to directly model the
joint action space. In the next section, we consider alternative
ways of modeling the joint action space without making the
strong independence assumption.
We consider three different ways by which stochastic
actions can be generated by a policy network.
A. Autoregressive action generation
In autoregressive action generation, the joint distribution
of the action agiven the observation ois modeled as
p(a|o) =
p(ai|a1, a2, ..., ai1,o)(4)
The ordering of the action space may be arbitrarily chosen.
With the order fixed, the distribution of each action depends
on the actions before it and not on any other actions (Fig. 4).
During training, the part of the policy network that outputs
aiis fed expert actions in the dataset ˆa1,ˆa2, ..., ˆai1corre-
sponding to the same observation. The outputs of the actions
a1, a2, ..., ai1are not sampled from the policy network
during training. This is also sometimes referred to as “teacher
forcing”. During inference, the first action a1is sampled
unconditionally and then fed to the part of the policy network
that generates a2and so on. For the example in Fig. 1, if
RIGHT is sampled as the first action, then the network
will learn to predict HOLD with a high probability for the
second action.
One of the benefits of autoregressive action generation
is that the training is stable. Unlike GANs or variational
predictors, the training process has fewer problems such as
mode collapse or difficulty in converging. However, during
inference, since earlier actions are necessary to predict the
subsequent actions, the inference is slow and inference time
grows with the dimensionality of the action space. For action
generation, unlike image generation, the dimensionality of
the action space for most robots is often not too large and
thus the inference time is manageable.
Fig. 5: Generative adversarial network for action prediction.
B. Generative adversarial networks
In adversarial nets, a generator model is pitted against a
discriminator (Fig. 5) that learns to predict whether a sample
is from the distribution of the generator model or the data
distribution[12]. The generator model Gparameterized by
θgmaps a prior noise distribution pz(z)to the generator
distribution pgas G(z;θg). The discriminator model D
parameterized by θdpredicts whether a sample xis sampled
from the data distribution or the generator distribution. While
the generator is trained to confound the discriminator, the dis-
criminator is simultaneously trained to distinguish between
the generator distribution and the data distribution.
V(G, D) = Expdata (x)[log(D(x))]
+Ezpz(z)[log(1 D(G(z)))]
In the conditional variant of GANs, both the generator
and discriminator are fed the data on which we wish to
condition the GAN[15]. While GANs have been widely used
for natural image generator, they can also be used for policy
networks in imitation learning by conditioning the GAN on
the observation o. This will induce the generator to learn
the expert data distribution conditioned on the observation
o. For the example in Fig. 1, the generator will learn to not
emit (RI GHT , DOW N )since the discriminator can easily
predict that it is not in the training data.
A major problem with GANs is mode collapse where the
generator exploits a local minima in the discriminator and
learns to produce only a part of the data distribution which
is enough to fool the discriminator[16]. For policy networks
generating actions, this problem is exacerbated because it
can be difficult to recognize whether mode collapse has
occurred whereas with natural images, visual inspection can
be revealing. Furthermore, the consequence of mode collapse
is also difficult to predict. As we show in experimental
results, in some cases mode collapse does not harm the
performance of the robot in performing the task, whereas
for other tasks, it can result in catastrophic failure.
Unlike the autoregressive generator, the generator in
GANs are efficient to sample from on modern GPUs as there
is no need to serially sample one action after another.
C. Variational action prediction
Action prediction can be thought of as being stochastic
due to latent events that are not present in the observation.
In variational action prediction, a latent variable zthat is
drawn from the prior zp(z)is introduced to build a
model pθ(a|o,z)parameterized by θ(see Fig. 6). The true
Fig. 6: Variational action prediction.
posterior p(z|a)is approximated with a network qφ(z|a).
If this network predicts the parameters of a conditional
Gaussian distribution N(µφ(o), σφ(o)), it can be trained
using the reparameterization trick[13]
z=µφ(o) + σφ(o)×, N (0,I)(6)
Alternatively, if the prior p(z)is a uniform distribution of
discrete categorical random variables, the network qφ(z|a)
that predicts the parameters of the distribution can be trained
using the gumbel-softmax reparameterization trick[19][20].
We use this latter distribution in our experiments.
To learn the parameters θand φ, the variational lower
bound[13] is optimized. The loss function is
L(o,a) = Eqφ(z|a)[log pθ(a|o,z)]
+DKL (qφ(z|a)||p(z)) (7)
The first term in the RHS of Eq. 7 is the action prediction
loss and the second term is the divergence of the variational
posterior from the prior on the latent variable.
During test time, the latent variable zis sampled from
the assumed prior p(z). Like GANs, action prediction with
variational prediction is fast since all the actions may be
simultaneously predicted. However, an additional hyper-
parameter that scales the KL divergence in the loss function
is introduced during training. In practice, the outcome of
training is sensitive to the weight term associated with the
KL divergence loss term[18].
We have two different experimental setups to investigate
stochasticity in expert demonstrations and its impact on
imitation learning. One is a toy remote-controlled (RC) car
and the other is with a robotic arm. In all cases, the network
is trained using the Adam optimizer[21].
A. Remote controlled Car
The RC car based on Texas Instruments Robot System
Learning Kit is shown in Fig. 7. It has two motors driving
the two wheels which are independently controlled through
pulse width modulation (PWM). An ESP8266 module is used
for remote control via WiFi. Commands to the motors are
sent via UDP at 5 Hz. The torque applied to the motors using
PWM can be controlled by expert demonstrators using the
Fig. 7: The TI remote controlled car
Fig. 8: Histogram of normalized PWM values in the training
data for the left and right wheel respectively.
Method Success rate
Independent sampling 60%
Autoregressive sampling 90%
GAN 70%
Variational prediction 90%
TABLE I: Success rate in driving the RC car straight for 10
analog triggers on an Xbox 360 game controller. The car
also has eight photoreceptors at the bottom along with IR
emitters (Fig. 7) which captures an 8-bit binary “image”
where a bit is set to 1 depending on whether there is a
dark line right below the photo-diode. For each experiment,
10 expert demonstrations are collected for training, and the
trained network is evaluated 10 times. The details of the
network architecture are in the appendix.
1) Driving straight: In the first experiment, the goal was
to simply drive straight. The reason why this is not trivial is
that the two motors are not identical, and they do not provide
any feedback of the torque or velocity. As a result, the
demonstrator has to observe the path being taken by the car
and adjust the commanded PWM using the analog triggers
on the game controllers so that the car moves forward on a
straight line. Figure 8 shows the distribution of the left and
right PWM values collected from expert demonstrations.
We define a trajectory as being successful if the car moves
forward by 8 feet while staying within 30 cm of the straight-
line trajectory. As a baseline, we use a Gaussian mixture
model (GMM) with 4 Gaussians. This is compared against
the other considered methods in Table I.
2) Line following: The second task with the RC car is
line following. When a photosensor is just above the black
line in Fig. 9, the sensor measures a “1”. So, when the
center of the car is above the line, the 8-bit observation is
typically “00011000”. Likewise, when the car has to turn left
because it is moving to the right of the line, the observation
Fig. 9: The RC car following a line.
Method Duration (s)
Independent sampling 90.28
Autoregressive sampling 70.28
GAN 85.1
Variational prediction 66.9
TABLE II: Average time taken to complete the line following
task over 10 trials.
Method Success rate
Reach Push Pick-and-place
Independent sampling 60% 26.6% 20%
Autoregressive sampling 100% 100% 100%
GAN 100% 0% 0%
Variational prediction 100% 100% 100%
TABLE III: Success rate with the robotic arm for different
might be “11000000”. In some demonstrations, the course
correction is swift and happens on even small deviations
when the observation might be “00110000”, but in other
demonstrations, it is delayed until the car reaches the edge
of the line (“11000000”). Similarly, some demonstrations
include turning at high speeds, whereas others are slow.
These are sources of stochasticity. If the velocity of the left
and right wheels are sampled independently, we can expect
frequent errors which consume additional time to correct.
The duration it takes to complete the course is detailed in
Table II. We see that modeling stochasticity results in faster
completion times. However, for both tasks, GANs performed
worse. We noticed that while the generator successfully
modeled the stochasticity in the dataset, the output of the
generator was biased and consistently gave out higher PWMs
for one of the wheels which resulted in inferior performance.
B. Robot arm
We use the Yaskawa MotoMini (Fig. 10) and the Dobot
Magician (Fig. 11) for manipulation tasks. The observations
are 150×150 pixel images from a camera above the robot
along with the current position of the end-effector and the
gripper open/close state. The action space is 4-dimensional
and includes the 3D direction of movement of the end-
effector at every time step and is discretized to {0,+1,1}.
It also includes a binary gripper open/close command. The
Fig. 10: The robot pushes the pen towards the edge of the
table. If the action space is independently sampled, the robot
can push the pen to a horizontal position (parallel to the end-
effector tool) from which it cannot recover.
camera is sampled and end effector movement instructions
are issued to the robot at 5 frames per second. For each
task, the network is trained with 15 expert demonstrations
and evaluated over 15 trials. The performance of different
approaches is compared in Table III.
1) Reach task: The reaching task is performed with the
Yaskawa MotoMini (Fig. 1). The goal is to reach the target
“X” while avoiding obstacles. We consider the task to be
successfully completed if the end-effector is brought within
2 cm of the target while not touching the obstacle. The
stochasticity arises because on encountering an obstacle,
experts behave differently; some move around the obstacle
and some above it. With GANs, we observed mode collapse
which caused the robot to always turn right when it encoun-
tered an obstacle. However, mode collapse does not impede
the completion of the task.
2) Pushing a pen: The push task is also performed using
the MotoMini. The task is to push a pen on a table. We
consider the task to be successfully completed if the pen
is moved by at least 30 cm across the table (Fig 10). We
noticed stochasticity in the expert demonstrations when the
end-effector is close to the tip of the pen. Some of the expert
demonstrations involve pushing the pen forward; in others,
the end-effector is moved towards the other end of the pen.
If the different dimensions of the action space are sampled
independently, the robot can push the pen to a horizontal
position (Fig 10). If that happens, the observations are now
outside the training distribution since this scenario never
occurs in the training data, and the robot does not recover.
While autoregressive action generation addresses this issue,
GANs perform poorly due to mode collapse which causes the
end effector to oscillate about the pen, and the robot does
not push the pen forward at all. While this particular set of
actions can confuse the discriminator, it does not capture all
the modes in the training data and leads to complete failure
in completing the task. We also tried training the Wasserstein
GAN[16] to address this problem, but even with WGAN, the
mode collapse was present.
3) Pick and Place: The pick-and-place task was per-
formed with the Dobot Magician. We consider the task to be
successful if the cuboid is placed inside the hollow part at
the center of the target shaped “O” in Fig. 11. We observed
that sometimes the demonstrator overshoots the object (or
target) and corrects for this at the next timestep. So for
Fig. 11: Mode collapse in GAN causes the end effector to
oscillate around the object without picking it up.
similar observations, we have two different actions. If the
action space is independently sampled, the robot attempts
to grip the object at a position where the object is not
present. Likewise, the object is sometimes dropped too far
from the target because the act of moving and releasing the
gripper are independently sampled. Autoregressive sampling
addresses this problem effectively but GANs perform poorly
due to mode collapse. Similar to the pushing task, the gripper
approaches the object and overshoots, but due to mode
collapse, the robot always overshoots and never picks up
the object and oscillates around it, as shown in Fig. 11.
In all our experiments, autoregressive action prediction
successfully accounts for stochasticity in the training data
and is the easiest to train since it does not introduce any
new hyperparameters. However, since actions are sequen-
tially sampled, the inference time is proportional to the
dimensionality of the action space. GANs do not require
such sequential sampling, but suffers from mode collapse.
Variational prediction also offers fast inference since it does
not require sequential sampling of the action space, but it
does introduce additional hyper-parameters for training.
In this paper, we have described a teleoperation system
from which one can collect expert demonstrations for simple
tasks such as driving, pushing an object, and picking and
placing an object. We have shown that such demonstrations
exhibit stochasticity which can impede the performance of
imitation learning if ignored. With multi-dimensional action
space, sampling the different actions independently results
in sub-optimal imitation learning performance. Autoregres-
sive sampling, GANs, and variational predictors are three
tractable ways to model stochasticity in the data. Autoregres-
sive sampling is easiest to train but is slow during inference
while mode collapse in GANs is a serious problem for
imitation learning.
For the line following task, instead of conv layers, a feature
vector is derived by passing the 8-bit input “image” through
FC128-ReLU-FC16. This is used to obtain the parameters of
a GMM (with 4 mixtures). For the push, the conv layers used
are shown in Fig. 3, but the output FC layers are FC128-FC3.
We would like to thank Ashish Joglekar for his assistance
with the TI robot kit.
[1] T. Tang, H. Lin, Y. Zhao, W. Chen, and M. Tomizuka, “Autonomous
Alignment of Peg and Hole by Force/Torque Measurement for Robotic
Assembly,2016 IEEE International Conference on Automation Sci-
ence and Engineering (CASE’16).
[2] T. Zhang, Z. McCarthy, O. Jow, D. Lee, X. Chen, K. Goldberg,
and P. Abbeel, “Deep Imitation Learning for Complex Manipulation
Tasks from Virtual Reality Teleoperation,2018 IEEE International
Conference on Robotics and Automation (ICRA’18).
[3] S. Ross, G. Gordon, and D. Bagnell, “A Reduction of Imitation
Learning and Structured Prediction to No-Regret Online Learning,”
2011 International Conference on Artificial Intelligence and Statistics
[4] M. Bojarski, B. Firner, B. Flepp, L. Jackel, U. Muller, K. Zieba, and
D. Testa, “End-to-End Deep Learning for Self-Driving Cars,” arXiv
preprint arXiv:1604.07316, 2016.
[5] A. Giusti, J. Guzzi, D. C. Ciresn, F.-L. He, J. P. Rodriguez, F. Fontana,
M. Faessler, C. Forster, J. Schmidhuber, G. Di Caro, et al., “A machine
learning approach to visual perception of forest trails for mobile
robots,” IEEE Robotics and Automation Letters, vol. 1, no. 2, pp.
661–667, 2016.
[6] T. Yu, C. Finn, A. Xie, S. Dasari, T. Zhang, P. Abbeel, and
S. Levine, “One-Shot Imitation from Observing Humans via Domain-
Adaptive Meta-Learning,2017 Neural Information Processing Sys-
tems (NIPS’17).
[7] T. Inoue, G. Magistris, A. Munawar, T. Yokoya, and R. Tachibana,
“Deep Reinforcement Learning for High Precision Assembly Tasks,
2017 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS’17).
[8] S. Gubbi, B. Amrutur, “One-Shot Object Localization Using Learnt
Visual Cues via Siamese Networks,2019 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS’19).
[9] A. Kendall and Y. Gal, “What Uncertainties Do We Need in Bayesian
Deep Learning for Computer Vision?” 2017 Neural Information Pro-
cessing Systems (NIPS’17).
[10] A. Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves,
K. Kavukcuoglu, “Conditional Image Generation with PixelCNN
Decoders,” 2016 Neural Information Processing Systems (NIPS’16).
[11] A. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent
neural networks,” arXiv preprint arXiv:1601.06759, 2016.
[12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, Y. Bengio, “Generative Adversarial Networks,”
2014 Neural Information Processing Systems (NIPS’14).
[13] D. Kingma, M. Welling, “Auto-Encoding Variational Bayes,Neural
Information Processing Systems (NIPS 2014).
[14] R. Rahmatizadeh, P. Abolghasemi, A. Behal, L. B¨
oni, “From
virtual demonstration to real-world manipulation using LSTM and
MDN,” 2018 Association for the Advancement of Artificial Intelligence
[15] M. Mirza, S. Osindero, “Conditional Generative Adversarial Nets,
arXiv preprint arXiv:1411.1784, 2014.
[16] M. Arjovsky, S. Chintala, L. Bottou, “Wasserstein GAN,” arXiv
preprint arXiv:1701.07875, 2017.
[17] M. Babaeizadeh, C. Finn, D. Erhan, R.-H. Campbell, S. Levine,
“Stochastic Variational Video Prediction,2018 International Confer-
ence on Learning Representations (ICLR’18).
[18] L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R.-H. Campbell,
K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine,
A. Mohiuddin, R. Sepassi, G. Tucker, H. Michalewski, “Model-Based
Reinforcement Learning for Atari,” 2020 International Conference on
Learning Representations (ICLR’20).
[19] C.-J. Maddison, A. Mnih, Y.-W. Teh, “The Concrete Distribution: A
Continuous Relaxation of Discrete Random Variables,” 2017 Interna-
tional Conference on Learning Representations (ICLR’17).
[20] E. Jang, S. Gu, B. Poole, “Categorical Reparameterization with
Gumbel-Softmax,” 2017 International Conference on Learning Rep-
resentations (ICLR’17).
[21] D.-P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
tion,” arXiv preprint arXiv:1412.6980 (2014).
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Predicting the future in real-world settings, particularly from raw sensory observations such as images, is exceptionally challenging. Real-world events can be stochastic and unpredictable, and the high dimensionality and complexity of natural images requires the predictive model to build an intricate understanding of the natural world. Many existing methods tackle this problem by making simplifying assumptions about the environment. One common assumption is that the outcome is deterministic and there is only one plausible future. This can lead to low-quality predictions in real-world settings with stochastic dynamics. In this paper, we develop a stochastic variational video prediction (SV2P) method that predicts a different possible future for each sample of its latent variables. To the best of our knowledge, our model is the first to provide effective stochastic multi-frame prediction for real-world video. We demonstrate the capability of the proposed method in predicting detailed future frames of videos on multiple real-world datasets, both action-free and action-conditioned. We find that our proposed method produces substantially improved video predictions when compared to the same model without stochasticity, and to other stochastic video prediction methods. Our SV2P implementation will be open sourced upon publication.
Conference Paper
Full-text available
High precision assembly of mechanical parts requires accuracy exceeding the robot precision. Conventional part mating methods used in the current manufacturing requires tedious tuning of numerous parameters before deployment. We show how the robot can successfully perform a tight clearance peg-in-hole task through training a recurrent neural network with reinforcement learning. In addition to saving the manual effort, the proposed technique also shows robustness against position and angle errors for the peg-in-hole task. The neural network learns to take the optimal action by observing the robot sensors to estimate the system state. The advantages of our proposed method is validated experimentally on a 7-axis articulated robot arm.
Full-text available
There are two major types of uncertainty one can model. Aleatoric uncertainty captures noise inherent in the observations. On the other hand, epistemic uncertainty accounts for uncertainty in the model -- uncertainty which can be explained away given enough data. Traditionally it has been difficult to model epistemic uncertainty in computer vision, but with new Bayesian deep learning tools this is now possible. We study the benefits of modeling epistemic vs. aleatoric uncertainty in Bayesian deep learning models for vision tasks. For this we present a Bayesian deep learning framework combining input-dependent aleatoric uncertainty together with epistemic uncertainty. We study models under the framework with per-pixel semantic segmentation and depth regression tasks. Further, our explicit uncertainty formulation leads to new loss functions for these tasks, which can be interpreted as learned attenuation. This makes the loss more robust to noisy data, also giving new state-of-the-art results on segmentation and depth regression benchmarks.
Robots assisting the disabled or elderly must perform complex manipulation tasks and must adapt to the home environment and preferences of their user. Learning from demonstration is a promising choice, that would allow the non-technical user to teach the robot different tasks. However, collecting demonstrations in the home environment of a disabled user is time consuming, disruptive to the comfort of the user, and presents safety challenges. It would be desirable to perform the demonstrations in a virtual environment. In this paper we describe a solution to the challenging problem of behavior transfer from virtual demonstration to a physical robot. The virtual demonstrations are used to train a deep neural network based controller, which is using a Long Short Term Memory (LSTM) recurrent neural network to generate trajectories. The training process uses a Mixture Density Network (MDN) to calculate an error signal suitable for the multimodal nature of demonstrations. The controller learned in the virtual environment is transferred to a physical robot (a Rethink Robotics Baxter). An off-the-shelf vision component is used to substitute for geometric knowledge available in the simulation and an inverse kinematics module is used to allow the Baxter to enact the trajectory. Our experimental studies validate the three contributions of the paper: (1) the controller learned from virtual demonstrations can be used to successfully perform the manipulation tasks on a physical robot, (2) the LSTM+MDN architectural choice outperforms other choices, such as the use of feedforward networks and mean-squared error based training signals and (3) allowing imperfect demonstrations in the training set also allows the controller to learn how to correct its manipulation mistakes.
Conference Paper
The reparameterization trick enables the optimization of large scale stochastic computation graphs via gradient descent. The essence of the trick is to refactor each stochastic node into a differentiable function of its parameters and a random variable with fixed distribution. After refactoring, the gradients of the loss propagated by the chain rule through the graph are low variance unbiased estimators of the gradients of the expected loss. While many continuous random variables have such reparameterizations, discrete random variables lack continuous reparameterizations due to the discontinuous nature of discrete states. In this work we introduce concrete random variables -- continuous relaxations of discrete random variables. The concrete distribution is a new family of distributions with closed form densities and a simple reparameterization. Whenever a discrete stochastic node of a computation graph can be refactored into a one-hot bit representation that is treated continuously, concrete stochastic nodes can be used with automatic differentiation to produce low-variance biased gradients of objectives (including objectives that depend on the log-likelihood of latent stochastic nodes) on the corresponding discrete graph. We demonstrate their effectiveness on density estimation and structured prediction tasks using neural networks.
Conference Paper
Categorical variables are a natural choice for representing discrete structure in the world. However, stochastic neural networks rarely use categorical latent variables due to the inability to backpropagate through samples. In this work, we present an efficient gradient estimator that replaces the non-differentiable sample from a categorical distribution with a differentiable sample from a novel Gumbel-Softmax distribution. This distribution has the essential property that it can be smoothly annealed into a categorical distribution. We show that our Gumbel-Softmax estimator outperforms state-of-the-art gradient estimators on structured output prediction and unsupervised generative modeling tasks with categorical latent variables, and enables large speedups on semi-supervised classification.
Conference Paper
In the past years, many methods have been developed for robotic peg-hole-insertion to automate the assembly process. However, many of them are based on the assumption that the peg and hole are well aligned before insertion starts. In practice, if there is a large pose(position/orientation) misalignment, the peg and hole may suffer from a three-point contact condition where the traditional assembly methods cannot work. To deal with this problem, this paper proposes an autonomous alignment method by force/torque measurement before insertion phase. A three-point contact model is built up and the pose misalignment between the peg and hole is estimated by force and geometric analysis. With the estimated values, the robot can autonomously correct the misalignment before applying traditional assembly methods to perform insertions. A series of experiments on a FANUC industrial robot and a H7h7 tolerance peg-hole testbed validate the effectiveness of the proposed method. Experimental results show that the robot is able to perform peg-hole-insertion from three-point contact conditions with 96% success rate.