ArticlePDF Available

# Learning Control in Robotics

Authors:

## Abstract and Figures

Recent trends in robot learning are to use trajectory-based optimal control techniques and reinforcement learning to scale complex robotic systems. On the one hand, increased computational power and multiprocessing, and on the other hand, probabilistic reinforcement learning methods and function approximation, have contributed to a steadily increasing interest in robot learning. Imitation learning has helped significantly to start learning with reasonable initial behavior. However, many applications are still restricted to rather lowdimensional domains and toy applications. Future work will have to demonstrate the continual and autonomous learning abilities, which were alluded to in the introduction.
Content may be subject to copyright.
IEEE Robotics & Automation Magazine
20 1070-9932/10/$26.00ª2010 IEEE JUNE 2010 Trajectory-Based Optimal Control Techniques In a not too distant future, robots will be a natural part of daily life in human society, providing assistance in many areas ranging from clinical applications, education and care giving, to normal household environments [1]. It is hard to imagine that all possible tasks can be preprogrammed in such robots. Robots need to be able to learn, either by themselves or with the help of human supervision. Additionally, wear and tear on robots in daily use needs to be automatically compen- sated for, which requires a form of continuous self-calibration, another form of learning. Finally, robots need to react to sto- chastic and dynamic environments, i.e., they need to learn how to optimally adapt to uncertainty and unforeseen changes. Robot learning is going to be a key ingredient for the future of autonomous robots. While robot learning covers a rather large field, from learn- ing to perceive, to plan, to make decisions, etc., we will focus this review on topics of learning control, in particular, as it is concerned with learning control in simulated or actual physi- cal robots. In general, learning control refers to the process of acquiring a control strategy for a particular control system and a particular task by trial and error. Learning control is usually distinguished from adaptive control [2] in that the learning sys- tem can have rather general optimization objectivesnot just, e.g., minimal tracking errorand is permitted to fail during the process of learning, while adaptive control emphasizes fast convergence without failure. Thus, learning control resembles the way that humans and animals acquire new movement strategies, while adaptive control is a special case of learning control that fulfills stringent performance constraints, e.g., as needed in life-critical systems like airplanes. Learning control has been an active topic of research for at least three decades. However, given the lack of working robots that actually use learning components, more work needs to be done before robot learning will make it beyond the laboratory environment. This article will survey some ongoing and past activities in robot learning to assess where the field stands and where it is going. We will largely focus on nonwheeled robots and less on topics of state estimation, as typically explored in wheeled robots [3]6], and we emphasize learning in continuous state-action spaces rather than discrete state-action spaces [7], [8]. We will illustrate the different topics of robot learning with examples from our own research with anthropomorphic and humanoid robots. The Basics of Learning Control A key question in learning control is what it is that should be learned. To address this issue, it is helpful to begin with one of the most general frameworks of learning control, as originally developed in the middle of the 20th century in the fields of optimization theory, optimal control, and in particular, dynamic programming [9], [10]. Here, the goal of learning control was formalized as the need to acquire a task-dependent control policy pthat maps a continuous-valued state vector x Digital Object Identifier 10.1109/MRA.2010.936957 © STOCKBYTE, EYEWIRE, DIGITAL VISION & BRAND X PICTURES BY STEFAN SCHAAL AND CHRISTOPHER G. ATKESON of a controlled system and its environment, possibly in a time t dependent way, to a continuous-valued control vector u: u¼p(x,t,h):(1) The parameter vector hcontains the problem-specific parameters in the policy pthat need to be adjusted by the learning system. The controlled system can generally be expressed as a nonlinear dynamics function _x¼f(x,u,t,ex) (2) with observation equations y¼h(x,u,t,ey) (3) that describe how the observations yof the system are derived from the full-state vector xthe terms exand eydenote noise terms. Thus, learning control means finding a (usually nonlin- ear) function pthat is adequate for a given desired behavior and movement system. A repertoire of motor skills is com- posed of many such policies that are sequenced and superim- posed to achieve complex motor skills. How the control policy is learned, however, can proceed in many different ways. Assuming that the model equations (2) and (3) are unknown, one classical approach is to learn these models using methods of function approximation and then compute a controller based on the estimated model, which is often discussed as the certainty-equivalence principle in the adaptive control liter- ature [2]. Such techniques are summarized under the name model-based learning, indirect learning, or internal model learn- ing. Alternatively, model-free learning of the policy is possible given an optimization or reward criterion, usually using methods from optimal control or reinforcement learning. Such model-free learning is also known as direct learning, since the policy is learned directly, i.e., without a detour through model identification. It is useful to distinguish between several general classes of motor tasks that could be the goal of learning. Regulator tasks keep the system at a particular set point of operationa typical example is bal- ancing a pole on a fingertip or standing upright on two legs. Tracking tasks require the control system to follow a given desired trajectory within the abil- ities of the control system. Discrete movement tasks, also called one-shot tasks, are defined by achieving a particu- lar goal at which the motor skill termi- nates. A basketball foul shot or grasping a cup of coffee are representative exam- ples. Periodic movement tasks are typical in the domain of locomotion. At last, complex movement tasks are composed of sequencing and superimposing simpler motor skills, e.g., leading to complex manipulation skills like emptying a dish- washer or assembling a bookshelf. From the viewpoint of machine learning, robot learning can be classified as supervised learning, reinforcement learning, learning modularizations, or learning feature representations that subserve learning. All learning methods can benefit from giving the learning system prior knowledge about how to accomplish a motor task, and imitation learning or learning from demonstration is a popular approach to introduce this bias. In summary, the goal of robot learning is to find an appro- priate control policy to accomplish a given movement task, assuming that no traditional methods exist to compute the control policy. Approaches to robot learning can be classified and discussed using three dimensions: direct versus indirect control, the learning method used, and the class of tasks in question (Figure 1). Approaches to Robot Learning We will use the classification in Figure 1 in the following sec- tions to guide our survey of current and previous work in robot learning. Given space constraints, this survey is not meant to be comprehensive but rather to present illustrative projects in the various areas. Learning Internal Models for Control Using learning to acquire internal models for control is useful when the analytical models are too complex to derive, and/or when it can be expected that the models change over time, e.g., due to wear and tear. Various kinds of internal models are used in robotics. The most well known are kinematics and dynamic models. For instance, the direct kinematics of a robot relates joint variables qto end-effector variables y, i.e., y¼g(q)[11]. Dynamics models include kinetic terms like forces or torques, as in (2). The previous models are forward models, i.e., they model the causal relationship between input and output variables, and they are proper functions. Often, however, what is needed in control are inverse models, e.g., the inverse kinematics q¼ g 1 (y) or the inverse dynamics u¼f1(q,_q,t). As discussed in [12], inverse models are often not functions, as the inverse rela- tionships may be a one-to-many map, i.e., just a relation. Such Direct Versus Indirect Control Learning Method Model-Free Control Model-Based Control Regulator Task Tracking Task One-Shot Tasks Periodic Tasks Complex/Composite Tasks Class of Task Supervised Learning Reinforcement Learning Learning Modularity Learning Representations Imitation Learning C ontro l a sed C ontro l R egu l ator T as k Tracking Tas k ne- hot Tasks P eriodic Tasks e x/Composite Tasks Supervised Learni R einforceme L earn in Le Figure 1. Classification of robot learning along three dimensions. Topics further out on the arrows can be considered more complex research topics than topics closer to the center. IEEE Robotics & Automation Magazine JUNE 2010 21 cases pose a problem to learning methods and can be addressed with special techniques and representations [13][16]. Nonlinear function approximation is needed to learn inter- nal models. It should be noted, as will be explained later, that function approximation is also required for other robot learning problems, e.g., to represent value functions, reward functions, or policies in reinforcement learningthus, function approxi- mation has a wide applicability in robot learning. While most machine-learning problems in function approximation work by processing a given data set in an offline fashion, robot learning has severalfeatures that require specialized algorithms: udata are available in abundance, typically at a rate from 60 to 1,000 data points per second ugiven this continuous stream of data, learning should never stop, but continue forever without degradation over time. For instance, degradation happens in many algorithms if the same data point is given to the learning system repeatedly, e.g., when the robot is standing still ugiven the high dimensionality of most interesting robotic systems, the complexity of the function to be learned is often unknown in advance, and the function approximation system needs to be able to add new learning resources as learning proceeds ulearning should happen in real time, be data efficient (squeeze the most information out of each data point), and be computationally efficient (to achieve real-time learning and lookup) ulearning needs to be robust toward shifting input distri- butions, e.g., as typical when practicing calligraphy on one day and tennis on another day, a topic discussed in the context of catastrophic interference [17] ulearning needs to be able to detect relevant features in the input from ideally hundreds or thousands of input dimensions, and it needs to exclude automatically irrele- vant and redundant inputs. These requirements narrow down the learning algorithms that are applicable to function approximation for robot learn- ing. One approach that has favorable performance is learning with piecewise linear models using nonparametric regression techniques [17][22]. Essentially, this technique finds, in the spirit of a first-order Taylor series expansion, the linearization of the function at an input point, and the region (also called a kernel) in which this linearization holds within a certain error bound. Learning this region is the most complex part of these techniques, and the latest developments use Bayesian statistics [23] and dimensionality reduction [22]. A new development, largely due to increasingly faster com- puting hardware, is the application of Gaussian process regres- sion (GPR) to function approximation in robots [24][26]. GPR is a powerful function approximation tool that has gained popularity due to its sound theory, high fitting accu- racy, and the relative ease of application with public-domain software libraries. As it requires an iterative optimization that needs to invert a matrix of size N3N, where Nis the number of training data points, GPR quickly saturates the computa- tional resources with moderately many data points. Thus, scal- ability to continual and real-time learning in complex robots will require further research developments; some research along these lines is given in [25] and [27]. Example Application As mentioned earlier, learning inverse models can be challeng- ing, since the inverse model problem is often a relation and not a function, with a one-to-many mapping. Applying any arbi- trary nonlinear function approximation method to the inverse model problem can lead to unpredictably bad performance, as the training data can form nonconvex solution spaces in which averaging is inappropriate [12]. A particularly interesting approach in control involves learning local linearizations of a forward model (which is a proper function) and learning an inverse mapping within the local region of the forward model; see also [15] and [28]. Ting et al. [23] demonstrated such a forward-inverse model learning approach with Bayesian locally weighted regression (BLWR) to learn an inverse kinematics model for a haptic robot arm (Figure 2) for a task-space tracking task. Training data consisted of the arm’s joint angles q, joint velocities _q, end-effector position in Cartesian space y, and end-effector velocities _y. From this data, a differential forward kinematics model _y¼J(q)_qwas learned, where Jis the Jacobian matrix. The transformation from _qto _ycan be assumed to be locally linear at a particular configuration qof the robot arm. BLWR is used to learn the forward model in a piecewise linear fashion. The goal of the robot task is to track a desired trajectory (y,_y) specified only in terms of x,zCartesian positions and velocities, i.e., the movement is sup- posed to be in a vertical plane in front of the robot, but the exact position of the vertical plane is not given. Thus, the task has one degree of redundancy. To learn an inverse kinematics model, the local regions from the piecewise linear for- ward model can be reused since any local inverse is also locally linear within these regions. Moreover, for locally linear models, all solution spaces for the inverse 0.2 0.1 0 z (m) –0.1 –0.1 –0.05 0 x (m) Desired Learned IK 0.05 0.1 (a) (b) Figure 2. (a) Phantom robot. (b) Learned-inverse kinematics solution; the difference between the actual and desired trajectory is small. IEEE Robotics & Automation Magazine 22 JUNE 2010 model are locally convex, such that an inverse can be learned without problems. The redundancy issue can be solved by applying an additional weight to each data point according to a reward function, resulting in reward-weighted locally weighted regression [15]. Figure 2 shows the performance of the learned inverse model (Learned IK) in a figure-eight tracking task. The learned model as well as the analytical inverse kinematics solution performs with root-mean-squared tracking errors in positions and veloc- ities very close to that of the analytical solution. This perform- ance was acquired from five minutes of real-time training data. Model-Based Learning In considering model-based learning, it is useful to start by assuming that the model is perfect. Later, we will address the question of how to design a controller that is robust to flaws in the learned model. Conventional Dynamic Programming Designing controllers for linear models is well understood. Work in reinforcement learning has focused using techniques derived from dynamic programming to design controllers for models that are nonlinear. A large part of our own work has emphasized pushing back the curse of dimensionality, as the memory and computational cost of dynamic programming increase exponen- tially with the dimensionality of the state-action space. Dynamic programming provides a way to find globally optimal control policies when the model of the control system is known. This section focuses on offline planning of nonlinear control policies for control problems with continuous states and actions, deterministic time invariant discrete time dynam- ics, x kþ1 ¼f(x k ,u k ), and a time-invariant one-step cost or reward function L(x,u)equivalent formulations exist for continuous time systems [29][31]. We are addressing steady- state policies, i.e., policies that are not time variant and have an infinite time horizon. One approach to dynamic programming is to approximate the value function V(x) (the optimal total future cost from each state V(x)¼minukP1 k¼0L(xk,uk)) by repeatedly solving the Bellman equation V(x)¼minu fL(x,u)þV(f(x,u))gat sampled states xuntil the value function estimates have converged to globally optimal val- ues. Typically, the value function and control law are repre- sented on a regular gridit should be noted that more efficient adaptive grid methods [32], [33] or function approx- imation methods [7] also exist. Some type of interpolation is used to approximate these functions within each grid cell. If each dimension of the state and action is represented with a resolution R, and the dimensionality of the state is d x and that of the action is d u , the computational cost of the conven- tional approach is proportional to Rdx3Rduand the memory cost is proportional to Rdx.Thisisknownasthecurseof dimensionality [9]. We have shown that dynamic programming can be sped up by randomly sampling actions on each sweep rather than exhaustively minimizing the Bellman equation with respect to the action [34]. At each state on each update, the current best action is reevaluated and compared to some number of random actions. Our studies have found that only looking at one ran- dom action on each update is most efficient. It is more effective to propagate information about future values by reevaluating the current best action on each update than it is to put a lot of resources into searching for the absolute best action. With this speedup in action search, currently available cluster computers can easily handle ten-dimensional problems (approximately 10 10 points can handle grids of size 50 6 ,20 8 ,or 10 10 , for example). Current supercomputers are created by net- working hundreds or thousands of conventional computers. The obvious way to implement dynamic programming on such a cluster is to partition the grid representing the value function and policy across the individual computing nodes, with the borders shared between multiple nodes. When a border cell is updated by its host node, the new value must be communicated to all nodes that have copies of that cell. We have implemented dynamic programming in a cluster of up to 100 nodes, with each node having eight CPU cores and 16 GB of memory. For example, running a cluster of 40 nodes on a six-dimensional problem with 50 6 cells, about 6 GB is used on each node to store its region of the value function and policy. Decomposing Problems One way to reduce the curse of dimensionality is to break problems into parts and develop a controller for each part sep- arately. Each subsystem could be ten-dimensional, given the earlier results, and a system that combined two subsystems could be 20 dimensional. For example, we are interested in developing a controller for biped walking [35]. We can approximately model the dynamics of a biped with separate models for sagittal and lateral control. These models are linked by common actions, such as when to put down and lift the feet. Thus, there are two parts of the state vector x: variables that are part of the sagittal state x s and variables that are part of the lateral state x l . There are three parts of the action vector u: variables that are part of the sagittal action u s , variables that are part of the lateral action u l , and variables that affect both sys- tems u sl . We can perform dynamic programming on the sagit- tal system and produce a value function V s (x s ) and do the same with the lateral system V l (x l ). We can choose an optimal action by minimizing L((x,u)þV(f(x,u)) with respect to u, with V(x) approximated by V s (x s )þV l (x l ). This approximation ignores the linking of the two systems in the future and can be improved by adding elements to the one-step costs for each subsystem that bias the shared actions to behave as if the other system was present. For example, deviations from the timing usually seen in the complete system can be penalized. Trajectory Optimization and Trajectory Libraries Another way to handle complex systems is trajectory optimiza- tion. Given a model, a variety of approaches can be used to find a locally optimal sequence of commands for a given initial posi- tion and one-step cost [36][38]. Interestingly, trajectory optimi- zation is quite popular for generating motion in animation [39]. However, trajectory optimization is not so popular in robotics, because it appears that it does not produce a control law but just a fixed sequence of commands. This is not a correct view. IEEE Robotics & Automation Magazine JUNE 2010 23 To generate a control policy, trajectory optimization can be applied to many initial conditions, and the resulting com- mands can be interpolated as needed. If that is the case, why do we need to deal with dynamic programming and the curse of dimensionality? Dynamic programming is a global opti- mizer, while trajectory optimization finds local optima. Often, the local optima found are not acceptable. Some way to bias trajectory optimization to produce reasonable trajectories would be useful. Also, if interpolation of the results will be done, it would be useful to produce consistent results so that similar initial conditions lead to similar costs. There may be discontinuities between nearby trajectories that must be handled by interpolation of actions. One trick to improve trajectories is to use neighboring tra- jectories to somehow bias or guide the optimization process. A simple way to do this is to use a neighboring trajectory as the initial trajectory in the trajectory-optimization process. Trajec- tories can be reoptimized using each neighbor in turn as the initial trajectory, and the best result so far can be retained. We have explored building explicit libraries of optimized trajecto- ries to handle large perturbations in bipedal standing balance [40]. One way of using the library is to use the optimized action corresponding to the nearest state in the library at each time step. Another way is to store the derivative of the opti- mized action with respect to state and use that derivative to modify the suggested action. A third way is to look up states from multiple trajectories and generate a weighted blend of the suggested actions. The first and second derivatives of a trajectory’s cost with respect to state can be used to generate a local Taylor series model of the value function: V(x)¼V 0 þV x xþX T V xx X. Given a quadratic local model of the value function, it is possible to compute the optimal action and its first derivative, the feed- back gains. These observations led to a trajectory optimization method based on second-order gradient descent, differential dynamic programming (DDP) [29]. Although this trajectory optimization method is no longer considered the most efficient way to find an optimal trajectory [sequential quadratic program- ming (SQP) methods are currently preferred in many fields such as aerospace and animation], the localmodels of the value func- tion and policy that DDP produces are useful for machine learning. For example, the local modelof the policy can be used in a trajectory library to interpolate or extrapolate actions. Dis- crepancies in adjacent local models of the value function can be used to determine where to allocate additional library resources. Robustness Robustness has not been addressed well in robot learning. Studies often focus on robustness to additive noise. It is much more difficult to design controllers that are robust to the corre- lated errors caused by parameter error or model structure error. One approach to designing robust controllers is to opti- mize controller parameters by simulating a controller control- ling a noisy robot [41]. It is more useful to optimize controller parameters controlling a set of robots, each with different robot parameters. This allows the effect of correlated control- ler errors across time to be handled in the optimization. It is not clear how to perform a similar optimization over a set of models in dynamic programming. Using additive noise and performing stochastic dynamic programming does not capture the effect of correlated errors. One approach is to make the model parameters into model states and perform sto- chastic dynamic programming on information states that describe distributions of actual states and model parameters. However, this creates a large increase in the number of states, which is not practical for dynamic programming. Bar-Shalom and Tse showed that DDP can be used to locally optimize controller robustness as well as exploration [42], [43]. This work provides an efficient solution to optimize the typically high-dimensional information state, which includes the means and covariances of the original model states and the means and covariances of the model parameters. Representing the uncertainty using a parametric probability distribution (means and covariances) also reduces the compu- tational cost of propagating uncertainty forward in time. The dynamics of the system are given by an extended Kalman fil- ter. The key observation is that the cost of uncertainty (the state and model parameter covariances) is given by Trace(V xx R), the trace of the product of the second derivative of the value function and the covariance matrix of the state. Minimizing the additional cost due to uncertainty makes the controller more robust and guides exploration. Example Application We implemented DDP on an actual robot as part of a learning from demonstration experiment (Figure 3). Several robustness issues arose since models are never perfect, especially learned models. 1) We needed initial trajectories that were consistent with the learned models, and sometimes reasonable or feasible trajectories do not exist due to modeling error in the learned model. 2) During optimization, the forward integration of a learned model in time often blows up when the learned model is inaccurate or when the plant is unstable and the current policy fails to stabilize it. 3) The backward integration to produce a value function and a corresponding policy uses derivatives of the learned model, which are often quite inaccurate in the early stages of learning, producing inaccurate value function estimates and ineffective policies. 4) Dynamic planners amplify modeling Figure 3. The robot swinging up an inverted pendulum. IEEE Robotics & Automation Magazine 24 JUNE 2010 error, because they take advantage of any modeling error that reduces cost, and because some planners use derivatives, which can be quite inaccurate. 5) The new knowledge gained in attempting a task may not change the predictions the system makes about the task (falling down might not tell us much about theforcesneededinwalking).InthetaskshowninFigure3,we used a direct reinforcement learning approach that adjusted the task goals in addition to optimal control to overcome modeling errors that the learningsystem did not handle [44]. We use another form of one-link pendulum swing-up as an example problem to provide the reader with a visualizable example of a value function and policy (Figure 4). In this one- link pendulum swing-up, a motor at the base of the pendulum swings a rigid arm from the downward stable equilibrium to the upright unstable equilibrium and balances the arm there. What makes this challenging is that the one-step cost function penalizes the amount of torque used and the deviation of the current position from the goal. The controller must try to minimize the total cost of the trajectory. The one-step cost function for this example is a weighted sum of the squared position errors (~ h: difference between current angle and the goal angle) and the squared torques s: L(x,u)¼0:1~ h2Tþs2T, where 0.1 weights the position error relative to the torque penalty and Tis the time step of the simulation (0.01s). Including the time step Tin the optimi- zation criterion allows comparison with controllers with dif- ferent time steps and continuous time controllers. There are no costs associated with the joint velocity. Figure 4 shows the optimal value function and policy. The optimal trajectory is shown as a yellow line in the value function plot and as a black line with a yellow border in the policy plot [Figure 4(b) and (c)]. The value function is cut off above 20 so that we can see the details of the part of the value function that determines the optimal trajectory. The goal is at the state (0,0). Model-Free Learning There are several popular methods of approaching model- free robot learning. Value function-based methods are dis- cussed in the context of actor-critic methods, temporal dif- ference (TD) learning, and Q-learning. A novel wave of algorithms avoids value functions and focuses on directly learning the policy, either with gradient methods or proba- bilistic methods. Value Function Approaches Instead of using dynamic programming, the value function V(x) can be estimated with TD learning [7], [45]. Essentially, TD enforces the validity of the Bellman equations for tempo- rally adjacent states, which can be shown to lead to a spatially consistent estimate of the value function for a given policy. To improve the policy, TD needs to be coupled to a simultaneous policy update using actor-critic methods [7]. Alternatively, instead of the value function V(x), the action value function Q(x,u) can be used, which is defined as Q(x,u)¼L(x0,u0)þminukP1 k¼1L(xk,uk) [7], [46]. Know- ing Q(x,u) for all actions in a state allows choosing the one with the maximal (or minimal for penalty costs) Q-value as the optimal action. Q-learning can be conceived of as TD learning in the joint space of states and actions. TD and Q-learning work well for discrete state-action spaces but become more problematic in continuous state- action scenarios. In continuous spaces, function approximators need to be used to represent the value function and policy. Achieving reliable estimation of these functions usually requires a large number of samples that densely fill the relevant space for learning, which is hard to accomplish in actual experiments with complex robot systems. There are also no guarantees that, during learning, the robot will not be given unsafe commands. Thus, many practical approaches learn first Value Velocity (r/s) –6 20 15 10 5 0 –5 –10 –15 –20 –5 –4 –3 –2 –1 0 1 2 3 0 10 20 Torque (N · m) Position (r) Position (r) Velocity (r/s) –6 20 15 10 5 0 –5 –10 –15 –20 –5 –4 –3 –2 (a) (b) (c) (d) –1 0 1 2 3 0 10 20 –10 –6 –5 –4 –3 –2 Position (r) Velocity (r/s) 10123 –8 –6 –4 –2 0 2 4 6 8 10 Figure 4. (a) Configurations from the simulated one link pendulum optimal trajectory every half second and at the end of the trajectory. (b) Value function for one-link example. (c) Policy for one-link example. (d) Trajectory-based approach: random states (dots) and trajectories (black lines) used to plan one-link swing-up, superimposed on a contour map of the value function [33]. IEEE Robotics & Automation Magazine JUNE 2010 25 in simulations (which is essentially a model-based approach) until reasonable performance is achieved, before continuing to experiment on an actual robot to adjust the control policy to the true physics of the world [47]. In the end, it is intractable to find a globally optimal control policy in high dimensional robot systems, as global optimality requires exploration of the entire state-action space. Thus, local optimization such as trajectory optimization seems to be more practical, using initialization of the policy from some informed guess, for instance, imitation learning [44], [48] [51]. Fitted Q-iteration is an example of a model-free learning algorithm that approximates the Q-function only along some sampled trajectories [52], [53]. Recent developments have given up on estimating the value function and rather focus directly on learning the control policy from trajectory rollouts, which is the topic of the following sections. Policy Gradient Methods Policy gradient methods usually assume that the cost of motor skill can be written as J(x0)¼EsX N k¼0 ckL(xk,uk) () , (4) which is the expected sum of discounted rewards (c2[0,1]) over a (potentially infinite) time horizon N.Theexpecta- tion E{} is taken over all trajectories sthat start in state x 0 . The goal is to find the motor commands u k that optimize this cost function. Most approaches assume that there is a start state x¼x 0 and/or a start state distribution [54]. The control policy is also often compactly parameterized, e.g., by means of a basis function representation u¼h T /(x), where hare the policy parameters [see also (1)], and /(x)isa vector of nonlinear basis functions provided by the user. Mainly for the purpose of exploration, the policy can be chosen to be stochastic, e.g., with a normal distribution uN(h T /(x), R), although cases exist where only a sto- chastic policy is optimal [54]. The essence of policy gradient methods is to compute the gradient @J/@hand optimize (4) with gradient-based incremental updates. As discussed in more detail in [55], a variety of algorithms exist to compute the gradient. Finite difference gradients [56] perform a perturbation analysis of the parameter vector hand estimate the gradient from a first-order numerical Taylor series expansion. The REINFORCE algorithm [57], [58] is a straightforward derivative computation of the logarithm of (4), assuming as the probability of a trajectory ph(s)¼ p(x0)QN k¼1p(xkjxk1,uk1)ph(uk1jxk1), and emphasizing that the parameters honly appear in the stochastic policy p h such that many terms in the gradient computation drop out. GPOMDP [59] and methods based on the policy gradient theorem [54] are more efficient versions of REINFORCE (for more details, see [55]). Peters and Schaal [60] suggested a second-order gradient method derived from insights of [61] and [62], which is currently among the fastest gradient-learn- ing approaches. Reference [63] emphasized that the choice of injecting noise in the stochastic policy can strongly influence the efficiency of the gradient updates. Policy gradient methods can scale to high-dimensional state-action spaces, at the cost of finding only locally optimal control policies and have become rather popular in robotics [64][66]. One drawback of policy gradients is that they require manual tuning of gradient parameters, which can be tedious. Probabilistic methods, as discussed in the next section, try eliminating gradient computations. Probabilistic Direct Policy Learning Transforming reinforcement learning into a probabilistic estima- tion approach is inspired by the hope of bringing to bear the wealth of statistical learning techniques that were developed over the last 20 years of machine-learning research. An early attempt can be found in [67], where reinforcement learning was formu- lated as an expectationmaximization (EM) algorithm [68]. The important idea was to treat the reward L(x,u)asapseudoprob- ability, i.e., it has to be strictly positive, and the integral over the state-action space of the reward has to result in a finite number. Transforming traditional convex reward functions with the exponential function is often used to achieve this property at the cost thatthe learning problem gets slightly altered by this change of cost function. Equation (4) can thus be thought of as a likeli- hood, andthe corresponding log likelihoodbecomes log J(x)¼log Zs ph(s)R(s)ds, where R(s)¼X N k¼0 ckL(xk,uk): (5) This log likelihood can be optimized with the EM algo- rithm. In [15], such an approach was used to learn operational space controllers, where the reinforcement learning compo- nent enabled a consistent resolution of redundancy. In [69], the previous approach was extended to learning from trajecto- riessee also contribution by Kober and Peters (pp. 5562). Extending [70] and [71] added a more thorough treatment of learning in the infinite discounted horizon case, where the algorithm can essentially determine the most suitable temporal window for optimization. Another way of transforming reinforcement learning into a statistical estimation problem was suggested in [72] and [73]. Here, it was realized that optimization with the stochastic Hamilton-Jacobi-Bellman equations can be transformed into a path-integral estimation problem, which can be derived with the Feynman-Kac theorem [31], [74]. While this formulation is normally based on value functions and requires a model- based approach, Theodorou et al. [31] realized that even model-free methods can be obtained. The resulting reinforce- ment learning algorithm resembles the one of [69], however, without the requirement that reinforcement is a pseudoprob- ability. Because of its grounding in first-order principles of optimal control theory, its simplicity, and no open learning parameters except for the exploration noise, this algorithm might be one of the most straightforward methods of trajec- tory-based reinforcement learning to date. It should also be IEEE Robotics & Automation Magazine 26 JUNE 2010 mentioned that [75] developed a model-based reinforcement learning framework with a special probabilistic control cost for discrete state-action spaces that, in its limit to continuous state-action spaces, will result in a path- integral formulation. Example Application Figure 5 illustrates our application of path-integral reinforcement learning to a robot-learning problem [31]. The robot dog is to jump across a gap. The jump should make as much forward progress as possible, as it is a maneuver in a legged locomotion competition, which scores the speed of the robot. The robot has three degree of freedoms (DoFs) per leg, and thus a total of 12 DoFs. Each DoF was represented as a parameterized movement primitive [76] with 50 basis func- tions. An initial seed behavior was taught by learning from demonstration, which allowed the robot barely to reach the other side of the gap without falling into the gapthe demon- stration was generated from a manual adjustment of knot points in a spline-based trajectory plan for each leg. Path-integral reinforcement learning primarily used the forward progress as a reward and slightly penalized the squared acceleration of each DoF and the squared norm of the parame- ter vector, i.e., a typical form of complexity regularization [77]. Learning was performed on a physical simulator of the robot dog, as the real robot dog was not available for this experiment. Figure 5 illustrates that after about 30 trials, the performance of the robot was significantly improved, such that after the jump, almost the entire body was lying on the other side of the gap. It should be noted that applying path-integral reinforcement learning was algorithmically very simple, and manual tuning only focused on generate a good cost function. Imitation Learning, Policy Parameterizations, and Inverse Reinforcement Learning While space constraints will not allow us to go into more detail, three interwoven topics in robot learning are worth mentioning. First, imitation learning has become a popular topic to initi- alize and speed up robot learning. Reviews on this topic can be found, for instance, in [48], [49], and [78]. Second, determining useful parameterizations for control policies is a topic that is often discussed in conjunction with imitation learning. Many different approaches have been sug- gested in the literature, for instance, based on splines [79], hid- den Markov models [80], nonlinear attractor systems [76], and other methods. Billard et al. [78] provide a survey of this topic. Finally, designing useful reward functions remains one of the most time-consuming and frustrating topics in robot learning. Thus, extracting the reward function from observed behavior is a topic of great importance for robot learning and imitation learning under the assumption that the observed behavior is optimal under a certain criterion. Inverse reinforcement learning [81], apprenticeship learning [82], and maximum margin plan- ning [83] are some of the prominent examples in the literature. Conclusions Recent trends in robot learning are to use trajectory-based optimal control techniques and reinforcement learning to scale complex robotic systems. On the one hand, increased compu- tational power and multiprocessing, and on the other hand, probabilistic reinforcement learning methods and function approximation, have contributed to a steadily increasing inter- est in robot learning. Imitation learning has helped signifi- cantly to start learning with reasonable initial behavior. However, many applications are still restricted to rather low- dimensional domains and toy applications. Future work will have to demonstrate the continual and autonomous learning abilities, which were alluded to in the introduction. Acknowledgments This research was supported in part by National Science Foundation grants ECS-0326095, EEC-0540865, and ECCS- 0824077, IIS-0535282, CNS-0619937, IIS-0917318, CBET- 0922784, EECS-0926052, the DARPA program on Learning Locomotion, the Okawa Foundation, and the ATR Compu- tational Neuroscience Laboratories. Keywords Robot learning, learning control, reinforcement learning, optimal control. References [1] S. Schaal, “The new roboticsTowards human-centered machi nes,” HFSP J. Frontiers Interdisciplinary Res. Life Sci., vol. 1, no. 2, pp. 115126, 2007. [2] K. J. Åstrom and B. Wittenmark, Adaptive Control. Reading, MA: Addi- son-Wesley, 1989. [3] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics. Cambridge, MA: MIT Press, 2005. 0 0 100 200 300 Cost 400 500 600 10 Number of Rollouts 100 (a) (b) Figure 5. (a) Actual and simulated robot dog. (b) Learning curve of optimizing the jump behavior with path-integral reinforcement learning. IEEE Robotics & Automation Magazine JUNE 2010 27 [4] M. Buehler, The DARPA Urban Challenge: Autonomous Vehicles in City Traffic, 1st ed. New York: Springer-Verlag, 2009. [5] M. Buehler, K. Iagnemma, and S. Singh, The 2005 DARPA Grand Chal- lenge: The Great Robot Race. New York: Springer-Verlag, 2007. [6] M. Roy, G. Gordon, and S. Thrun, “Finding approximate POMDP solu- tions through belief compression,” J. Artif. Intell. Res., vol. 23, pp. 140, 2005. [7] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cam- bridge, MA: MIT Press, 1998. [8] J. Si, Handbook of Learning and Approximate Dynamic Programming. Hobo- ken, NJ: IEEE Press/Wiley-Interscience, 2004. [9] R. Bellman, Dynamic Programming. Princeton, NJ: Princeton Univ. Press, 1957. [10] P. Dyer and S. R. McReynolds, The Computation and Theory of Optimal Control. New York: Academic, 1970. [11] L. Sciavicco and B. Siciliano, Modelling and Control of Robot Manipulators. New York: Springer-Verlag, 2000. [12] I. M. Jordan, D. E. Rumelhart, “Supervised learning with a distal teacher,” Cogn. Sci., vol. 16, pp. 307354, 1992. [13] A. D’Souza, S. Vijayakumar, and S. Schaal, “Learning inverse kine- matics,” in Proc. IEEE Int. Conf. Intelligent Robots and Systems (IROS 2001), Maui, HI, Oct. 29Nov. 3, 2001, pp. 298301. [14] D. Bullock, S. Grossberg, and F. H. Guenther, “A self-organizing neural model of motor equivalent reaching and tool use by a multijoint arm,” J. Cogn. Neurosci., vol. 5, no. 4, pp. 408435, 1993. [15] J. Peters and S. Schaal, “Learning to control in operational space,” Int. J. Robot. Res., vol. 27, pp. 197212, 2008. [16] Z. Ghahramani and M. I. Jordan, “Supervised learning from incomplete data via an EM approach,” in Advances in Neural Information Processing Sys- tems 6, J. D. Cowan, G. Tesauro, and J. Alspector, Eds. San Mateo, CA: Morgan Kaufmann, 1994, pp. 120127. [17] S. Schaal and C. G. Atkeson, “Constructive incremental learning from only local information,” Neural Comput., vol. 10, no. 8, pp. 20472084, 1998. [18] W. S. Cleveland, “Robust locally weighted regression and smoothing scatterplots,” J. Amer. Statist. Assoc., vol. 74, pp. 829836, 1979. [19] C. G. Atkeson, “Using local models to control movement,” in Advances in Neural Information Processing Systems 1, D. Touretzky, Ed. San Mateo, CA: Morgan Kaufmann, 1989, pp. 157183. [20] C. G. Atkeson, A. W. Moore, and S. Schaal, “Locally weighted learning,” Artif. Intell. Rev., vol. 11, no. 15, pp. 1173, 1997. [21] C. G. Atkeson, A. W. Moore, and S. Schaal, “Locally weighted learning for control,” Artif. Intell. Rev., vol. 11, no. 15, pp. 75113, 1997. [22] S. Vijayakumar, A. D’Souza, and S. Schaal, “Incremental online learning in high dimensions,” Neural Comput., vol. 17, no. 12, pp. 26022634, 2005. [23] J.-A. Ting, A. D’Souza, S. Vijayakumar, and S. Schaal, “A Bayesian approach to empirical local linearizations for robotics,” in Proc. Int. Conf. Robotics and Automation (ICRA2008), Pasadena, CA, May 1923, 2008, pp. 28602865. [24] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. Cambridge, MA: MIT Press, 2006. [25] D. Nguyen-Tuong, M. Seeger, and J. Peters, “Local gaussian process regression for real time online model learning and control,” in Proc. Advan- ces in Neural Information Processing Systems 21 (NIPS 2008), D. Schuurmans, J. Benigio, and D. Koller, Eds. Vancouver, BC, Dec. 811, 2009, pp. 11931200. [26] M. P. Deisenroth, C. E. Rasmussen, and J. Peters, “Gaussian process dynamic programming,” Neurocomputing, vol. 72, no. 79, pp. 1508 1524, 2009. [27] L. Csat’o and M. Opper, “Sparse representation for gaussian process models,” in Proc. Advances in Neural Information Processing Systems 13 (NIPS 2000), Denver, CO, 2001, pp. 444450. [28] D. M. Wolpert and M. Kawato, “Multiple paired forward and inverse models for motor control,” Neural Netw., vol. 11, no. 78, pp. 1317 1329, 1998. [29] D. H. Jacobson and D. Q. Mayne, Differential Dynamic Programming. New York: American Elsevier, 1970. [30] K. Doya, “Reinforcement learning in continuous time and space,” Neu- ral Comput., vol. 12, no. 1, pp. 219245, Jan. 2000. [31] E. Theodorou, J. Buchli, and S. Schaal, “Reinforcement learning in high dimensional state spaces: A path integral approach,” submitted for publication. [32] R. Munos and A. Moore, “Variable resolution discretization in optimal control,” Mach. Learn., vol. 49, no. 2/3, p. 33, 2002. [33] C. G. Atkeson and B. J. Stephens, “Random sampling of states in dynamic programming,” IEEE Trans. Syst., Man, Cybern. B, vol. 38, no. 4, pp. 924929, 2008. [34] C. G. Atkeson, “Randomly sampling actions in dynamic programming,” in Proc. IEEE Int. Symp. Approximate Dynamic Programming and Reinforce- ment Learning, 2007, ADPRL’07, pp. 185192. [35] E. Whitman and C. G. Atkeson, “Control of a walking biped using a combination of simple policies,” in Proc. IEEE/RAS Int. Conf. Humanoid Robotics, Paris, France, Dec. 710, 2009, pp. 520527. [36] Tomlab Optimization Inc. (2010). PROPTMatlab optimal control software [Online]. Available: http://tomdyn.com/ [37] Technische Universitat Darmstadt. (2010). DIRCOL: A direct colloca- tion method for the numerical solution of optimal control problems [Online]. Available: http://www.sim.informatik.tu-darmstadt.de/sw/dircol [38] Stanford Business Software Corporation. (2010). SNOPT; Software for large-scale nonlinear programming [Online]. Available: http://www.sbsi- sol-optimize.com/asp/sol_product_snopt.htm [39] A. Safonova, J. K. Hodgins, and N. S. Pollard, “Synthesizing physically realistic human motion in low-dimensional, behavior-specific spaces,” ACM Trans. Graph. J. (SIGGRAPH 2004 Proc.), vol. 23, no. 3, pp. 514 521, 2004. [40] L. Chenggang and C. G. Atkeson, “Standing balance control using a trajectory library,” presented at the IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS 2009), 2009. [41] A. Ng, “Pegasus: A policy search method for large MDPs and POMDPs,” presented at the Uncertainty in Artificial Intelligence (UAI), 2000. [42] E. Tse, Y. Bar-Shalom, and L. Meier, III, “Wide-sense adaptive dual control for nonlinear stochastic systems,” IEEE Trans. Automat. Contr., vol. 18, no. 2, pp. 98108, 1973. [43] Y. Bar-Shalom and E. Tse, “Caution, probing and the value of informa- tion in the control of uncertain systems,” Ann. Econ. Social Meas., vol. 4, no. 3, pp. 323338, 1976. [44] C. G. Atkeson and S. Schaal, “Robot learning from demonstration,” in Proc. 14th Int. Conf. Machine Learning (ICML‘97), D. H. Fisher, Jr., Ed. Nashville, TN, July 812, 1997, pp. 1220. [45] R. S. Sutton, “Learning to predict by the methods of temporal differ- ences,” Mach. Learn., vol. 3, no. 1, pp. 944, 1988. [46] C. J. C. H. Watkins, “Learning with delayed rewards,” Ph.D. thesis, Cambridge Univ., U.K., 1989. [47] J. Morimoto and K. Doya, “Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning,” Robot. Auton. Syst., vol. 36, no. 1, pp. 3751, 2001. [48] S. Schaal, “Is imitation learning the route to humanoid robots?” Trends Cogn. Sci., vol. 3, no. 6, pp. 233242, 1999. [49] S. Schaal, A. Ijspeert, and A. Billard, “Computational approaches to motor learning by imitation,” Philos. Trans. R. Soc. London B, Biol. Sci., vol. 358, no. 1431, pp. 537547, 2003. [50] C. G. Atkeson and S. Schaal, “Learning tasks from a single demon- stration,” in Proc. IEEE Int. Conf. Robotics and Automation (ICRA’97), Albuquerque, NM, Apr. 2025, 1997, pp. 17061712. [51] S. Schaal, “Learning from demonstration,” in Proc. Advances in Neural Information Processing Systems 9, M. C. Mozer, M. Jordan, and T. Petsche, Eds. Cambridge, MA, 1997, pp. 10401046. [52] D. Ernst, P. Geurts, and L. Wehenkel, “Tree-based batch mode rein- forcement learning,” J. Mach. Learn. Res., vol. 6, pp. 503556, 2005. [53] G. Neumann and J. Peters, “Fitted Q-iteration by advantage weighted regression,” in Proc. Advances in Neural Information Processing Systems 21 (NIPS 2008), D. Schuurmans, J. Benigio, and D. Koller, Eds. Vancouver, BC, Dec. 811, 2009, pp. 11771184. IEEE Robotics & Automation Magazine 28 JUNE 2010 [54] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Proc. Advances in Neural Processing Systems 12, S. A. Solla, T. K. Leen, and K.-R. Muller, Eds. Denver, CO, 2000. [55] J. Peters and S. Schaal, “Reinforcement learning of motor skills with policy gradients,” Neural Netw., vol. 21, no. 4, pp. 682697, May 2008. [56] P. Sadegh and J. Spall, “Optimal random perturbations for stochastic approximation using a simultaneous perturbation gradient approx- imation,” presented at the Proc. American Control Conf., 1997. [57] R. J. Williams, “Simple statistical gradient-following algorithms for con- nectionist reinforcement learning,” Mach. Learn., vol. 8, no. 34, pp. 229 256, 1992. [58] V. Gullapalli, “A stochastic reinforcement learning algorithm for learning real-valued functions,” Neural Netw., vol. 3, no. 6, pp. 671692, 1990. [59] D. Aberdeen and J. Baxter, “Scaling internal-state policy-gradient meth- ods for POMDPs,” in Proc. 19th Int. Conf. Machine Learning (ICML-2002), Sydney, Australia, 2002, pp. 310. [60] J. Peters and S. Schaal, “Natural actor critic,” Neurocomputing, vol. 71, no. 79, pp. 11801190, 2008. [61] S. Amari, “Natural gradient learning for over- and under-complete bases In ICA,” Neural Comput., vol. 11, no. 8, pp. 18751883, Nov. 1999. [62] S. Kakade, “Natural policy gradient,” presented at the Advances in Neu- ral Information Processing Systems, Vancouver, CA, 2002. [63] T. Ruckstieß, M. Felder, and J. Schmidhuber, “State-dependent explo- ration for policy gradient methods,” presented at the European Conf. Machine Learning and Principles and Practice of Knowledge Discovery in Databases 2008, Part II, LNAI 5212, 2008. [64] G. Endo, J. Morimoto, T. Matsubara, J. Nakanish, and G. Cheng, “Learning CPG-based biped locomotion with a policy gradient method: Application to a humanoid robot,” Int. J. Robot. Res., vol. 27, no. 2, pp. 213228, 2008. [65] R. Tedrake, T. W. Zhang, and S. Seung, “Stochastic policy gradient rein- forcement learning on a simple 3D biped,” in Proc. Int. Conf. Intelligent Robots and Systems (IROS 2004), Sendai, Japan, Oct. 2004, pp. 28492854. [66] J. Peters and S. Schaal, “Policy gradient methods for robotics,” in Proc. IEEE Int. Conf. Intelligent Robotics Systems (IROS 2006), Beijing, Oct. 9 15, 2006, pp. 22192225. [67] P. Dayan and G. Hinton, “Using EM for reinforcement learning,” Neural Comput., vol. 9, no. 2, pp. 271278, 1997. [68] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Statist. Soc. B, vol. 39, no. 1, pp. 138, 1977. [69] J. Kober and J. Peters, “Learning motor primitives in robotics,” in Proc. Advances in Neural Information Processing Systems 21 (NIPS 2008),D. Schuurmans, J. Benigio, and D. Koller, Eds. Vancouver, BC, Dec. 811, 2009, pp. 297304. [70] M. Toussaint and A. Storkey, “Probabilistic inference for solving discrete and continuous state Markov decision processes,” presented at the 23nd Int. Conf. Machine Learning (ICML 2006), 2006. [71] N. Vlassis, M. Toussaint, G. Kontes, and S. Piperidis, “Learning model- free control by a Monte-Carlo EM algorithm,” Auton. Robots, vol. 27, no. 2, pp. 123130, 2009. [72] H. J. Kappen, “Linear theory for control of nonlinear stochastic systems,” Phys. Rev. Lett., vol. 95, no. 20, pp. 200201200204, Nov. 2005. [73] H. J. Kappen, “An introduction to stochastic control theory, path inte- grals and reinforcement learning,” in Cooperative Behavior in Neural Systems, vol. 887, J. Marro, P. L. Garrido, and J. J. Torres, Eds. 2007, pp. 149181. [74] E. Theodorou, J. Buchli, and S. Schaal, “Path integral stochastic optimal control for rigid body dynamics,” presented at the IEEE Int. Symp. Approximate Dynamic Programming and Reinforcement Learning (ADPRL2009), Nashville, TN, Mar. 30Apr. 2, 2009. [75] E. Todorov, “Efficient computation of optimal actions,” Proc. Nat. Acad. Sci. USA, vol. 106, no. 28, pp. 1147811483, July 2009. [76] A. Ijspeert, J. Nakanishi, and S. Schaal, “Learning attractor landscapes for learning motor primitives,” in Advances in Neural Information Processing Systems 15, S. Becker, S. Thrun, and K. Obermayer, Eds. 2003, pp. 15471554. [77] C. M. Bishop, Pattern Recognition and Machine Learning. New York: Springer-Verlag, 2006. [78] A. Billard, S. Calinon, R. Dillmann, and S. Schaal, “Robot programming by demonstration,” in Handbook of Robotics,vol.1,B.SicilianoandO.Khatib, Eds. Cambridge, MA: MIT Press, 2008, ch. 59. [79] Y. Wada and M. Kawato, “Trajectory formation of arm movement by a neural network with forward and inverse dynamics models,” Syst. Comput. Jpn., vol. 24, pp. 3750, 1994. [80] T. Inamura, I. Toshima, H. Tanie, and Y. Nakamura, “Embodied sym- bol emergence based on mimesis theory,” Int. J. Robot. Res., vol. 23, no. 45, p. 363, Apr.-May 2004. [81] A. Y. Ng and S. Russell, “Algorithms for inverse reinforcement learning,” in Proc. 17th Int. Conf. Machine Learning (ICML 2000), Stanford, CA, 2000, pp. 663670. [82] P. Abbeel and A. Ng, “Apprenticeship learning via inverse reinforcement learning,” in Proc. 21st Int. Conf. Machine Learning, 2004. [83] N. Ratliff, D. Silver, and J. A. Bagnell, “Learning to search: Functional gradient techniques for imitation learning,” Auton. Robots, vol. 27, no. 1, pp. 2553, 2009. Stefan Schaal is a professor of computer science, neuro- science, and biomedical engineering at the University of Southern California and an invited researcher at the ATR Computational Neuroscience Laboratory in Japan. He has coauthored more than 200 papers in refereed journals and conferences. He is a cofounder of the IEEE/RAS International Conference and Humanoid Robotics as well as Robotics Science and Systems. He serves on the editorial board of Neural Networks, International Journal of Humanoid Robotics, and Frontiers in Neuro- robotics. He is a Member of the German National Academic Foundation (Studienstiftung des Deutschen Volkes), Alexander von Humboldt Foundation, Society for Neuroscience, the Society for Neural Control of Movement, the IEEE, and AAAS. His research interests include topics of statistical and machine learning, neural networks, computational neuroscience, func- tional brain imaging, nonlinear dynamics, nonlinear control theory, and biomimetic robotics. Christopher G. Atkeson received his M.S. degree in applied mathematics (computer science) from Harvard University and his Ph.D. degree in brain and cognitive sciences from Massa- chusetts Institute of Technology (MIT). He is a professor at the Robotics Institute and HumanComputer Interaction Institute, Carnegie Mellon University. He joined the MIT as a faculty in 1986 and moved to the Georgia Institute of Technology College of Computing in 1994. He has received the National Science Foundation Presidential Young Investi- gator Award, Sloan Research Fellowship, and Teaching Award from the MIT Graduate Student Council. His research focuses on humanoid robotics and robot learning by using challenging dynamic tasks such as juggling. His specific research interests include nonparametric learning, memory- based learning including approaches based on trajectory libra- ries, reinforcement learning, and other forms of learning based on optimal control, learning from demonstration, and model- ing human behavior. Address for Correspondence: Stefan Schaal, Computer Science, Neuroscience, and Biomedical Engineering, University of Southern California, Los Angeles, CA 90089-2905 USA. E-mail: sschaal@usc.edu. IEEE Robotics & Automation Magazine JUNE 2010 29 ... or other equations of motion defining the system. This task can be cast as an instance of meta-optimization [45] or trajectory learning for control [24,46,47]. We formalize the meta-optimization problem as a list of problem- ... Article Full-text available Reduction of the circuit depth of quantum circuits is a crucial bottleneck to enabling quantum technology. This depth is inversely proportional to the number of available quantum gates that have been synthesized. Moreover, quantum gate-synthesis and control problems exhibit a vast range of external parameter dependencies, both physical and application specific. In this paper, we address the possibility of learning families of optimal-control pulses that depend adaptively on various parameters, in order to obtain a global optimal mapping from the space of potential parameter values to the control space and hence to produce continuous classes of gates. Our proposed method is tested on different experimentally relevant quantum gates and proves capable of producing high-fidelity pulses even in the presence of multiple variables or uncertain parameters with wide ranges. ... Moreover, ML could be implemented to improve the trajectory-planning process [27][28][29] or reduce vibrations [30,31]. ... Article Full-text available Positioning accuracy in robotics is a key issue for the manufacturing process. One of the possible ways to achieve high accuracy is the implementation of machine learning (ML), which allows robots to learn from their own practical experience and find the best way to perform the prescribed operation. Usually, accuracy improvement methods cover the generation of a positioning error map for the whole robot workspace, providing corresponding correction models. However, most practical cases require extremely high positioning accuracy only at a few essential points on the trajectory. This paper provides a methodology for the online deep Q-learning-based approach intended to increase positioning accuracy at key points by analyzing experimentally predetermined robot properties and their impact on overall accuracy. Using the KUKA-YouBot robot as a test system, we perform accuracy measurement experiments in the following three axes: (i) after a long operational break, (ii) using different loads, and (iii) at different speeds. To use this data for ML, the relationships between the robot’s operating time from switching on, load, and positioning accuracy are defined. In addition, the gripper vibrations are evaluated when the robot arm moves at various speeds in vertical and horizontal planes. It is found that the robot’s degrees of freedom (DOFs) clearances are significantly influenced by operational heat, which affects its static and dynamic accuracy. Implementation of the proposed ML-based compensation method resulted in a positioning error decrease at the trajectory key points by more than 30%. ... Thus, the solution of the transformed stochastic HJB is formulated as a conditional expectation value with respect to the system dynamics. As a result, the optimal control can be estimated using Monte Carlo methods drawing samples of stochastic trajectories [5]. While in general the resulting optimal feedback control function has an unknown structure, there are different approaches for its representation. ... Article Full-text available In this paper, a novel feature-based sampling strategy for nonlinear Model Predictive Path Integral (MPPI) control is presented. Using the MPPI approach, the optimal feedback control is calculated by solving a stochastic optimal control (OCP) problem online by evaluating the weighted inference of sampled stochastic trajectories. While the MPPI algorithm can be excellently parallelized, the closed-loop performance strongly depends on the information quality of the sampled trajectories. To draw samples, a proposal density is used. The solver’s and thus, the controller’s performance is of high quality if the sampled trajectories drawn from this proposal density are located in low-cost regions of state-space. In classical MPPI control, the explored state-space is strongly constrained by assumptions that refer to the control value’s covariance matrix, which are necessary for transforming the stochastic Hamilton–Jacobi–Bellman (HJB) equation into a linear second-order partial differential equation. To achieve excellent performance even with discontinuous cost functions, in this novel approach, knowledge-based features are introduced to constitute the proposal density and thus the low-cost region of state-space for exploration. This paper addresses the question of how the performance of the MPPI algorithm can be improved using a feature-based mixture of base densities. Furthermore, the developed algorithm is applied to an autonomous vessel that follows a track and concurrently avoids collisions using an emergency braking feature. Therefore, the presented feature-based MPPI algorithm is applied and analyzed in both simulation and full-scale experiments. ... Learning from demonstration (LfD) and imitation learning allow agents to execute a task by observing the task being performed (Hussein et al., 2017). In the robotics domain, a goal of imitation learning is to produce a mapping, π , from states to actions, known as a control policy (Argall et al., 2009;Schaal and Atkeson, 2010), that has the maximum likelihood of producing the demonstration dataset D = {ρ 1 , ρ 2 , . . . , ρ n }, where each ρ = (s 1 , a 1 ), (s 2 , a 2 ), . . . ... Article Full-text available Generalizing prior experiences to complete new tasks is a challenging and unsolved problem in robotics. In this work, we explore a novel framework for control of complex systems called Primitive Imitation for Control (PICO). The approach combines ideas from imitation learning, task decomposition, and novel task sequencing to generalize from demonstrations to new behaviors. Demonstrations are automatically decomposed into existing or missing sub-behaviors which allows the framework to identify novel behaviors while not duplicating existing behaviors. Generalization to new tasks is achieved through dynamic blending of behavior primitives. We evaluated the approach using demonstrations from two different robotic platforms. The experimental results show that PICO is able to detect the presence of a novel behavior primitive and build the missing control policy. ... Learning to make control decisions online in a stable and efficient manner is important in computer animation (Ling et al., 2020;Zhang & van de Panne, 2018), resource management (Zhou et al., 2011;Ignaciuk & Bartoszewicz, 2010), robotics (Andrychowicz et al., 2020;Xie et al., 2018;Schaal & Atkeson, 2010), and autonomous vehicles (Chen et al., 2020;Sadigh et al., 2016). Online decision making has a variety of challenges: from partial-observability and asymmetric information (Warrington et al., 2021;Choudhury et al., 2018), to function approximation and bootstrapping error (van Hasselt et al., 2018). ... Preprint Full-text available We consider online imitation learning (OIL), where the task is to find a policy that imitates the behavior of an expert via active interaction with the environment. We aim to bridge the gap between the theory and practice of policy optimization algorithms for OIL by analyzing one of the most popular OIL algorithms, DAGGER. Specifically, if the class of policies is sufficiently expressive to contain the expert policy, we prove that DAGGER achieves constant regret. Unlike previous bounds that require the losses to be strongly-convex, our result only requires the weaker assumption that the losses be strongly-convex with respect to the policy's sufficient statistics (not its parameterization). In order to ensure convergence for a wider class of policies and losses, we augment DAGGER with an additional regularization term. In particular, we propose a variant of Follow-the-Regularized-Leader (FTRL) and its adaptive variant for OIL and develop a memory-efficient implementation, which matches the memory requirements of FTL. Assuming that the loss functions are smooth and convex with respect to the parameters of the policy, we also prove that FTRL achieves constant regret for any sufficiently expressive policy class, while retaining$O(\sqrt{T})\$ regret in the worst-case. We demonstrate the effectiveness of these algorithms with experiments on synthetic and high-dimensional control tasks.
... In general, learning control is the process of acquiring control strategies for a specific control system and a specific task through iterative trials [205]. It enables the estimation of unknown information as the system proceeds. ...
Article
Full-text available
Ball screw feed-drive system (BSFDS) is the precision transmission mechanism widely used in micron-scale positioning or motion trajectory control. Its desired specifications including high acceleration, speed, accuracy, and stability are challenged by vibration, friction, thermal error, uncertainty, etc. Inspired by these challenges, the modeling and control issues have been widely studied and discussed for decades. This paper presents an overview of modeling and control approaches, including identification, linear parameter varying, thermal error modeling and control, nonlinear control, and robust control. In particular, it reviews the emerging control issues and approaches, such as artificial intelligence, learning control, and data-driven control, which have increased in recent years.
... A standard MPC controller is not designed to handle uncertain events. Hence, parametric uncertainty can significantly affect control performance Schaal and Christopher (2010); Fesharaki et al. (2017). Currently, most of the climate control greenhouse applications assume perfect knowledge on the model parameters (for example, Blasco et al. (2007); Ding et al. (2018)). ...
Preprint
Full-text available
Achieving optimal resource use efficiency is a key challenge in modern greenhouse production systems. Optimal performance in terms of crop yield and resource efficiency can in theory be achieved via optimal control. Standard optimal controllers are not designed to deal with uncertainty, whereas considerable model prediction errors occur due to the mismatch between the model and the real system. This paper explores the relation between prediction uncertainty, and performance with respect to crop yield, CO 2 demand, ventilation demand, and heating energy. This is done using the following steps 1) formulation of parametric uncertainty underlying prediction uncertainty, 2) extension of an existing controller model with parametric uncertainty, 3) design of a sample-based robust model predictive controller and 4) analysis of control performance under increasing parametric uncertainty. The results predict that control performance is highly sensitive to parametric uncertainty. A relative parameter uncertainty of 20%, reduced crop yield with 11% compared to the case without uncertainty. Furthermore, a 20% uncertainty decreased CO 2 demand with 80%, whereas it increased ventilation demand with 96%, and increased heating energy demand with 90%.
Book
At the dawn of the new millennium, robotics is undergoing a major transformation in scope and dimension. From a largely dominant industrial focus, robotics is rapidly expanding into the challenges of unstructured environments. Interacting with, assi- ing, serving, and exploring with humans, the emerging robots will increasingly touch people and their lives. The goal of the new series of Springer Tracts in Advanced Robotics (STAR) is to bring, in a timely fashion, the latest advances and developments in robotics on the basis of their significance and quality. It is our hope that the wider dissemination of research developments will stimulate more exchanges and collaborations among the research community and contribute to further advancement of this rapidly growing field. The volume edited by Martin Buehler, Karl Iagnemma and Sanjiv Singh presents a unique and extensive collection of the scientific results by the teams which took part into the DARPA Grand Challenge in October 2005 in the Nevada desert. This event reached an incredible peak of popularity in the media, the race of the century like someone called it! The Grand Challenge demonstrated the fast growing progress - ward the development of robotics technology, as it showed the feasibility of using mobile robots operating autonomously in real world scenarios.
Article
The minimum torque-change model predicts and reproduces human multijoint movement data quite well. However, there are three criticisms of the current neural network models for trajectory formation based on the minimum torque-change criterion: (1) their spatial representation of time; (2) backpropagation is essential; and (3) they require too many iterations. Accordingly, a new neural network model for trajectory formation is proposed based on the minimum torque-change criterion. This neural network model basically uses a forward dynamics model, an inverse dynamics model and a trajectory formation mechanism which generates an approximate minimum torque-change trajectory. It does not require spatial representation of time or backpropagation. Furthermore, there are fewer iterations required to obtain an approximate optimal solution. Finally, the proposed neural network model can be applied broadly in the engineering field because it is a new method for solving optimization problems with boundary conditions.
Article
This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units. These algorithms, called REINFORCE algorithms, are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediate-reinforcement tasks and certain limited forms of delayed-reinforcement tasks, and they do this without explicitly computing gradient estimates or even storing information from which such estimates could be computed. Specific examples of such algorithms are presented, some of which bear a close relationship to certain existing algorithms while others are novel but potentially interesting in their own right. Also given are results that show how such algorithms can be naturally integrated with backpropagation. We close with a brief discussion of a number of additional issues surrounding the use of such algorithms, including what is known about their limiting behaviors as well as further considerations that might be used to help develop similar but potentially more powerful reinforcement learning algorithms.
Article
We address the role of noise and the issue of efficient computation in stochastic optimal control problems. We consider a class of nonlinear control problems that can be formulated as a path integral and where the noise plays the role of temperature. The path integral displays symmetry breaking and there exists a critical noise value that separates regimes where optimal control yields qualitatively different solutions. The path integral can be computed efficiently by Monte Carlo integration or by a Laplace approximation, and can therefore be used to solve high dimensional stochastic control problems.
Article
Three studies are presented which explore the information conveyed by demonstrations and which evaluate the effectiveness of demonstrations in providing the learner with sufficient information to model both kinematic and kinetic features of motor skill. In all three experiments adult female participants viewed a video-recorded demonstration of an adult female dance expert’s performance of a two-dimensional dance step. Immediately after viewing the demonstration participants attempted to replicate the movement. Performance attempts were video-recorded and biomechanically analysed on a number of measures including movement outcome accuracy, movement form (angular displacement). absolute and relative timing of movements and peak force on landing and take-off. Results of the programme of research suggest that the demonstration viewing condition affected the quality of visual information conveyed which in turn affected subsequent efforts at replication. Results are discussed in terms of coordination and control aspects of motor skill acquisition.
Article
A Markov Decision Process (MDP) is a natural framework for formulating sequential decision-making problems under uncertainty. In recent years, researchers have greatly advanced algorithms for learning and acting in MDPs. This article reviews such algorithms, beginning with well-known dynamic programming methods for solving MDPs such as policy iteration and value iteration, then describes approximate dynamic programming methods such as trajectory based value iteration, and finally moves to reinforcement learning methods such as Q-Learning, SARSA, and least-squares policy iteration. We describe algorithms in a unified framework, giving pseudocode together with memory and iteration complexity analysis for each. Empirical evaluations of these techniques with four representations across four domains, provide insight into how these algorithms perform with various feature sets in terms of running time and performance.