ArticlePDF Available

Inverted autonomous helicopter flight via reinforcement learning

Authors:
  • Udacity Inc.
Autonomous inverted helicopter flight via
reinforcement learning
Andrew Y. Ng1, Adam Coates1, Mark Diel2, Varun Ganapathi1, Jamie
Schulte1, Ben Tse2, Eric Berger1, and Eric Liang1
1Computer Science Department, Stanford University, Stanford, CA 94305
2Whirled Air Helicopters, Menlo Park, CA 94025
Abstract. Helicopters have highly stochastic, nonlinear, dynamics, and autonomous
helicopter flight is widely regarded to be a challenging control problem. As heli-
copters are highly unstable at low speeds, it is particularly difficult to design con-
trollers for low speed aerobatic maneuvers. In this paper, we describe a successful
application of reinforcement learning to designing a controller for sustained in-
verted flight on an autonomous helicopter. Using data collected from the helicopter
in flight, we began by learning a stochastic, nonlinear model of the helicopter’s
dynamics. Then, a reinforcement learning algorithm was applied to automatically
learn a controller for autonomous inverted hovering. Finally, the resulting controller
was successfully tested on our autonomous helicopter platform.
1 Introduction
Autonomous helicopter flight represents a challenging control problem with
high dimensional, asymmetric, noisy, nonlinear, non-minimum phase dynam-
ics, and helicopters are widely regarded to be significantly harder to control
than fixed-wing aircraft. [3,10] But helicopters are uniquely suited to many
applications requiring either low-speed flight or stable hovering. The con-
trol of autonomous helicopters thus provides an important and challenging
testbed for learning and control algorithms.
Some recent examples of successful autonomous helicopter flight are given
in [7,2,9,8]. Because helicopter flight is usually open-loop stable at high speeds
but unstable at low speeds, we believe low-speed helicopter maneuvers are
particularly interesting and challenging. In previous work, (Ng et al.,2004)
considered the problem of learning to fly low-speed maneuvers very accu-
rately. In this paper, we describe a successful application of machine learning
to performing a simple low-speed aerobatic maneuver—autonomous sustained
inverted hovering.
2 Helicopter platform
To carry out flight experiments, we began by instrumenting a Bergen in-
dustrial twin helicopter (length 59”, height 22”) for autonomous flight. This
2 Ng et al.
Fig. 1. Helicopter in configuration for upright-only flight (single GPS antenna).
helicopter is powered by a twin cylinder 46cc engine, and has an unloaded
weight of 18 lbs.
Our initial flight tests indicated that the Bergen industrial twin’s original
rotor-head was unlikely to be sufficiently strong to withstand the forces en-
countered in aerobatic maneuvers. We therefore replaced the rotor-head with
one from an X-Cell 60 helicopter. We also instrumented the helicopter with
a PC104 flight computer, an Inertial Science ISIS-IMU (accelerometers and
turning-rate gyroscopes), a Novatel GPS unit, and a MicroStrain 3d mag-
netic compass. The PC104 was mounted in a plastic enclosure at the nose
of the helicopter, and the GPS antenna, IMU, and magnetic compass were
mounted on the tail boom. The IMU in particular was mounted fairly close
to the fuselage, to minimize measurement noise arising from tail-boom vibra-
tions. The fuel tank, originally mounted at the nose, was also moved to the
rear. Figure 1 shows our helicopter in this initial instrumented configuration.
Readings from all the sensors are fed to the onboard PC104 flight com-
puter, which runs a Kalman filter to obtain position and orientation estimates
for the helicopter at 100Hz. A custom takeover board also allows the com-
puter either to read the human pilot’s commands that are being sent to the
helicopter control surfaces, or to send its own commands to the helicopter.
The onboard computer also communicates with a ground station via 802.11b
wireless.
Most GPS antenna (particularly differential, L1/L2 ones) are directional,
and a single antenna pointing upwards relative to the helicopter would be un-
able to see any satellites if the helicopter is inverted. Thus, a single, upward-
pointing antenna cannot be used to localize the helicopter in inverted flight.
We therefore added to our system a second antenna facing downwards, and
used a computer-controlled relay for switching between them. By examining
the Kalman filter output, our onboard computer automatically selects the
upward-facing antenna. (See Figure 2a.) We also tried a system in which
Autonomous inverted helicopter flight via reinforcement learning 3
(a) (b)
Fig. 2. (a) Dual GPS antenna configuration (one antenna is mounted on the tail-
boom facing up; the other is shown facing down in the lower-left corner of the
picture). The small box on the left side of the picture (mounted on the left side of
the tail-boom) is a computer-controlled relay. (b) Graphical simulator of helicopter,
built using the learned helicopter dynamics.
the two antenna were simultaneously connected to the receiver via a Y-cable
(without a relay). In our experiments, this suffered from significant GPS
multipath problems and was not usable.
3 Machine learning for controller design
A helicopter such as ours has a high center of gravity when in inverted hover,
making inverted flight significantly less stable than upright flight (which is
also unstable at low speeds). Indeed, there are far more human RC pilots
who can perform high-speed aerobatic maneuvers than can keep a helicopter
in sustained inverted hover. Thus, designing a stable controller for sustained
inverted flight appears to be a difficult control problem.
Most helicopters are flown using four controls:
a[1] and a[2]: The longitudinal (front-back) and latitudinal (left-right)
cyclic pitch controls cause the helicopter to pitch forward/backwards or
sideways, and can thereby also be used to affect acceleration in the lon-
gitudinal and latitudinal directions.
a[3]: The main rotor collective pitch control causes the main rotor blades
to rotate along an axis that runs along the length of the rotor blade, and
thereby affects the angle at which the main rotor’s blades are tilted rela-
tive to the plane of rotation. As the main rotor blades sweep through the
air, they generate an amount of upward thrust that (generally) increases
with this angle. By varying the collective pitch angle, we can affect the
main rotor’s thrust. For inverted flight, by setting a negative collective
pitch angle, we can cause the helicopter to produce negative thrust.
a[4]: The tail rotor collective pitch control affects tail rotor thrust, and
can be used to yaw (turn) the helicopter.
4 Ng et al.
A fifth control, the throttle, is commanded as pre-set function of the main
rotor collective pitch, and can safely be ignored for the rest of this paper.
To design the controller for our helicopter, we began by learning a stochas-
tic, nonlinear, model of the helicopter dynamics. Then, a reinforcement learn-
ing/policy search algorithm was used to automatically design a controller.
3.1 Model identification
We applied supervised learning to identify a model of the helicopter’s dy-
namics. We began by asking a human pilot to fly the helicopter upside-down,
and logged the pilot commands and helicopter state scomprising its position
(x, y, z), orientation (roll φ, pitch θ, yaw ω), velocity ( ˙x, ˙y, ˙z) and angular
velocities ( ˙
φ, ˙
θ, ˙ω). A total of 391s of flight data was collected for model iden-
tification. Our goal was to learn a model that, given the state stand the
action atcommanded by the pilot at time t, would give a good estimate of
the probability distribution Pstat(st+1) of the resulting state of the helicopter
st+1 one time step later.
Following standard practice in system identification [4], we converted the
original 12-dimensional helicopter state into a reduced 8-dimensional state
represented in body coordinates sb= [φ, θ, ˙x, ˙y, ˙z, ˙
φ, ˙
θ, ˙ω]. Where there is risk
of confusion, we will use superscript sand bto distinguish between spatial
(world) coordinates and body coordinates. The body coordinate representa-
tion specifies the helicopter state using a coordinate frame in which the x,
y, and zaxes are forwards, sideways, and down relative to the current ori-
entation of the helicopter, instead of north, east and down. Thus, ˙xbis the
forward velocity, whereas ˙xsis the velocity in the northern direction. (φand
θare always expressed in world coordinates, because roll and pitch relative
to the body coordinate frame is always zero.) By using a body coordinate
representation, we encode into our model certain “symmetries” of helicopter
flight, such as that the helicopter’s dynamics are the same regardless of its
absolute position and orientation (assuming the absence of obstacles).1
Even in the reduced coordinate representation, only a subset of the state
variables need to be modeled explicitly using learning. Specifically, the roll φ
and pitch θ(and yaw ω) angles of the helicopter over time can be computed
exactly as a function of the roll rate ˙
φ, pitch rate ˙
θand yaw rate ˙ω. Thus,
given a model that predicts only the angular velocities, we can numerically
integrate the velocities over time to obtain orientations.
We identified our model at 10Hz, so that the difference in time between st
and st+1 was 0.1 seconds. We used linear regression to learn to predict, given
1Actually, by handling the effects of gravity explicitly, it is possible to obtain an
even better model that uses a further reduced, 6-dimensional, state, by eliminat-
ing the state variables φand θ. We found this additional reduction useful and
included it in the final version of our model; however, a full discussion is beyond
the scope of this paper.
Autonomous inverted helicopter flight via reinforcement learning 5
sb
tR8and atR4, a sub-vector of the state variables at the next timestep
[ ˙xb
t+1,˙yb
t+1,˙zb
t+1,˙
φb
t+1,˙
θb
t+1,˙ωb
t+1]. This body coordinate model is then con-
verted back into a world coordinates model, for example by integrating an-
gular velocities to obtain world coordinate angles. Note that because the
process of integrating angular velocities expressed in body coordinates to
obtain angles expressed in world coordinates is nonlinear, the final model
resulting from this process is also necessarily nonlinear. After recovering the
world coordinate orientations via integration, it is also straightforward to ob-
tain the rest of the world coordinates state. (For example, the mapping from
body coordinate velocity to world coordinate velocity is simply a rotation.)
Lastly, because helicopter dynamics are inherently stochastic, a determin-
istic model would be unlikely to fully capture a helicopter’s range of possible
behaviors. We modeled the errors in the one-step predictions of our model as
Gaussian, and estimated the magnitude of the noise variance via maximum
likelihood.
The result of this procedure is a stochastic, nonlinear model of our heli-
copter’s dynamics. To verify the learned model, we also implemented a graph-
ical simulator (see Figure 2b) with a joystick control interface similar to that
on the real helicopter. This allows the pilot to fly the helicopter in simulation
and verify the simulator’s modeled dynamics. The same graphical simulator
was subsequently also used for controller visualization and testing.
3.2 Controller design via reinforcement learning
Having built a model/simulator of the helicopter, we then applied reinforce-
ment learning to learn a good controller.
Reinforcement learning [11] gives a set of tools for solving control problems
posed in the Markov decision process (MDP) formalism. An MDP is a tuple
(S, s0, A, {Psa}, γ , R). In our problem, Sis the set of states (expressed in
world coordinates) comprising all possible helicopter positions, orientations,
velocities and angular velocities; s0Sis the initial state; A= [1,1]4is the
set of all possible control actions; Psa(·) are the state transition probabilities
for taking action ain state s;γ[0,1) is a discount factor; and R:S7→ Ris
a reward function. The dynamics of an MDP proceed as follows: The system
is first initialized in state s0. Based on the initial state, we get to choose
some control action a0A. As a result of our choice, the system transitions
randomly to some new state s1according to the state transition probabilities
Ps0a0(·). We then get to pick a new action a1, as a result of which the system
transitions to s2Ps1a1, and so on.
A function π:S7→ Ais called a policy (or controller). It we take action
π(s) whenever we are in state s, then we say that we are acting according to
π. The reward function Rindicates how well we are doing at any particular
time, and the goal of the reinforcement learning algorithm is to find a policy
6 Ng et al.
πso as to maximize
U(π) ˙=Es0,s1,... "
X
t=0
γtR(st)|π#,(1)
where the expectation is over the random sequence of states visited by acting
according to π, starting from state s0. Because γ < 1, rewards in the distant
future are automatically given less weight in the sum above.
For the problem of autonomous hovering, we used a quadratic reward
function
R(ss) = (αx(xx)2+αy(yy)2+αz(zz)2
+α˙x˙x2+α˙y˙y2+α˙z˙z2+αω(ωω)2),(2)
where the position (x, y, z ) and orientation ωspecifies where we want
the helicopter to hover. (The term ωω, which is a difference between two
angles, is computed with appropriate wrapping around 2π.) The coefficients
αiwere chosen to roughly scale each of the terms in (2) to the same order
of magnitude (a standard heuristic in LQR control [1]). Note that our re-
ward function did not penalize deviations from zero roll and pitch, because
a helicopter hovering stably in place typically has to be tilted slightly.2
For the policy π, we chose as our representation a simplified version of
the neural network used in [7]. Specifically, the longitudinal cyclic pitch a[1]
was commanded as a function of xbxb(error in position in the xdirection,
expressed in body coordinates), ˙xb, and pitch θ; the latitudinal cyclic pitch
a[2] was commanded as a function of ybyb, ˙yband roll φ; the main rotor
collective pitch a[3] was commanded as a function of zbzband ˙zb; and
the tail rotor collective pitch a[4] was commanded as a function of ωω.3
Thus, the learning problem was to choose the gains for the controller so that
we obtain a policy πwith large U(π).
Given a particular policy π, computing U(π) exactly would require taking
an expectation over a complex distribution over state sequences (Equation 1).
For nonlinear, stochastic, MDPs, it is in general intractable to exactly com-
pute this expectation. However, given a simulator for the MDP, we can ap-
proximate this expectation via Monte Carlo. Specifically, in our application,
the learned model described in Section 3.1 can be used to sample st+1 Pstat
2For example, the tail rotor generates a sideways force that would tend to cause
the helicopter to drift sideways if the helicopter were perfectly level. This side-
ways force is counteracted by having the helicopter tilted slightly in the opposite
direction, so that the main rotor generates a slight sideways force in an opposite
direction to that generated by the tail rotor, in addition to an upwards force.
3Actually, we found that a refinement of this representation worked slightly better.
Specifically, rather than expressing the position and velocity errors in the body
coordinate frame, we instead expressed them in a coordinate frame whose xand
yaxes lie in the horizontal plane/parallel to the ground, and whose xaxis has
the same yaw angle as the helicopter.
Autonomous inverted helicopter flight via reinforcement learning 7
for any state action pair st, at. Thus, by sampling s1Ps0π(s0),s2Ps1π(s1),
. . . , we obtain a random state sequence s0, s1, s2,... drawn from the distri-
bution resulting from flying the helicopter (in simulation) using controller π.
By summing up P
t=0 γtR(st), we obtain one “sample” with which to esti-
mate U(π).4More generally, we can repeat this entire process mtimes, and
average to obtain an estimate ˆ
U(π) of U(π).
One can now try to search for πthat optimizes ˆ
U(π). Unfortunately, op-
timizing ˆ
U(π) represents a difficult stochastic optimization problem. Each
evaluation of ˆ
U(π) is defined via a random Monte Carlo procedure, so multi-
ple evaluations of ˆ
U(π) for even the same πwill in general give back slightly
different, noisy, answers. This makes it difficult to find “arg maxπˆ
U(π)” us-
ing standard search algorithms. But using the Pegasus method (Ng and
Jordan, 2000), we can turn this stochastic optimization problem into an or-
dinary deterministic problem, so that any standard search algorithm can now
be applied. Specifically, the computation of ˆ
U(π) makes multiple calls to the
helicopter dynamical simulator, which in turn makes multiple calls to a ran-
dom number generator to generate the samples st+1 Pstat. If we fix in
advance the sequence of random numbers used by the simulator, then there
is no longer any randomness in the evaluation of ˆ
U(π), and in particular
finding maxπˆ
U(π) involves only solving a standard, deterministic, optimiza-
tion problem. (For more details, see [6], which also proves that the “sample
complexity”—i.e., the number of Monte Carlo samples mwe need to average
over in order to obtain an accurate approximation—is at most polynomial
in all quantities of interest.) To find a good controller, we therefore applied
a greedy hillclimbing algorithm (coordinate ascent) to search for a policy π
with large ˆ
U(π).
We note that in earlier work, (Ng et al., 2004) also used a similar approach
to learn to fly expert-league RC helicopter competition maneuvers, including
a nose-in circle (where the helicopter is flown in a circle, but with the nose
of the helicopter continuously pointed at the center of rotation) and other
maneuvers.
4 Experimental Results
Using the reinforcement learning approach described in Section 3, we found
that we were able to extremely quickly design new controllers for the heli-
copter. We first completed the inverted flight hardware and collected (human
pilot) flight data on 3rd Dec 2003. Using reinforcement learning, we completed
our controller design by 5th Dec. In our flight experiment on 6th Dec, we suc-
cessfully demonstrated our controller on the hardware platform by having a
human pilot first take off and flip the helicopter upside down, immediately
4In practice, we truncate the state sequence after a large but finite number of
steps. Because of discounting, this introduces at most a small error into the
approximation.
8 Ng et al.
Fig. 3. Helicopter in autonomous sustained inverted hover.
after which our controller took over and was able to keep the helicopter in
stable, sustained inverted flight. Once the helicopter hardware for inverted
flight was completed, building on our pre-existing software (implemented for
upright flight only), the total time to design and demonstrate a stable in-
verted flight controller was less than 72 hours, including the time needed to
write new learning software.
A picture of the helicopter in sustained autonomous hover is shown in
Figure 3. To our knowledge, this is the first helicopter capable of sustained
inverted flight under computer control. A video of the helicopter in inverted
autonomous flight is also at
http://www.cs.stanford.edu/~ang/rl-videos/
Other videos, such as of a learned controller flying the competition maneuvers
mentioned earlier, are also available at the url above.
5 Conclusions
In this paper, we described a successful application of reinforcement learning
to the problem of designing a controller for autonomous inverted flight on
a helicopter. Although not the focus of this paper, we also note that, using
controllers designed via reinforcement learning and shaping [5], our helicopter
is also capable of normal (upright) flight, including hovering and waypoint
following.
Autonomous inverted helicopter flight via reinforcement learning 9
We also found that a side benefit of being able to automatically learn
new controllers quickly and with very little human effort is that it becomes
significantly easier to rapidly reconfigure the helicopter for different flight
applications. For example, we frequently change the helicopter’s configura-
tion (such as replacing the tail rotor assembly with a new, improved one)
or payload (such as mounting or removing sensor payloads, additional com-
puters, etc.). These modifications significantly change the dynamics of the
helicopter, by affecting its mass, center of gravity, and responses to the con-
trols. But by using our existing learning software, it has proved generally
quite easy to quickly design a new controller for the helicopter after each
time it is reconfigured.
Acknowledgments
We give warm thanks to Sebastian Thrun for his assistance and advice on
this project, to Jin Kim for helpful discussions, and to Perry Kavros for his
help constructing the helicopter. This work was supported by DARPA under
contract number N66001-01-C-6018.
References
1. B. D. O. Anderson and J. B. Moore. Optimal Control: Linear Quadratic Meth-
ods. Prentice-Hall, 1989.
2. J. Bagnell and J. Schneider. Autonomous helicopter control using reinforcement
learning policy search methods. In Int’l Conf. Robotics and Automation. IEEE,
2001.
3. J. Leishman. Principles of Helicopter Aerodynamics. Cambridge Univ. Press,
2000.
4. B. Mettler, M. Tischler, and T. Kanade. System identification of small-size
unmanned helicopter dynamics. In American Helicopter Society, 55th Forum,
1999.
5. Andrew Y. Ng, Daishi Harada, and Stuart Russell. Policy invariance under
reward transformations: Theory and application to reward shaping. In Pro-
ceedings of the Sixteenth International Conference on Machine Learning, pages
278–287, Bled, Slovenia, July 1999. Morgan Kaufmann.
6. Andrew Y. Ng and Michael I. Jordan. Pegasus: A policy search method for
large MDPs and POMDPs. In Uncertainty in Artificial Intellicence, Proceedings
of Sixteenth Conference, pages 406–415, 2000.
7. Andrew Y. Ng, H. Jin Kim, Michael Jordan, and Shankar Sastry. Autonomous
helicopter flight via reinforcement learning. In Neural Information Processing
Systems 16, 2004.
8. Jonathan M. Roberts, Peter I. Corke, and Gregg Buskey. Low-cost flight control
system for a small autonomous helicopter. In IEEE International Conference
on Robotics and Automation, 2003.
9. T. Schouwenaars, B. Mettler, E. Feron, and J. How. Hybrid architecture for
full-envelope autonomous rotorcraft guidance. In American Helicopter Society
59th Annual Forum, 2003.
10 Ng et al.
10. J. Seddon. Basic Helicopter Aerodynamics. AIAA Education Series. America
Institute of Aeronautics and Astronautics, 1990.
11. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Intro-
duction. MIT Press, 1998.
... Using various RL algorithms, an optimal policy can be learned. Some widely applied RL algorithms include Q-Learning for Neural Networks [30], the Fitted-Q-Iteration (FQI) [30], Deep Q Network (DQN) [20], actor-critic network [29], and model-based RL [36]. More technical details about various RL models have been explained [9,26]. ...
... Using various RL algorithms, an optimal policy can be learned. Some widely applied RL algorithms include Q-Learning for Neural Networks [30], the Fitted-Q-Iteration (FQI) [30], Deep Q Network (DQN) [20], actor-critic network [29], and model-based RL [36]. More technical details about various RL models have been explained [9,26]. ...
... You have to explore enough (several ten thousands of episodes have to be done until an optimal or near optimal policy has been found) ii. You have to eventually make the learning rate small enough Fitted-Q-Iteration (FQI) [30] The basic concept behind NFQ is that the update is done off-line, with a whole range of transfer experiences rather than uploading on-line the neural value function (which leading to issues with Q-learning). ...
Article
Full-text available
The healthcare industry has always been an early adopter of new technology and a big benefactor of it. The use of reinforcement learning in the healthcare system has repeatedly resulted in improved outcomes.. Many challenges exist concerning the architecture of the RL method, measurement metrics, and model choice. More significantly, the validation of RL in authentic clinical settings needs further work. This paper presents a new Effective Resource Allocation Strategy (ERAS) for the Fog environment, which is suitable for Healthcare applications. ERAS tries to achieve effective resource management in the Fog environment via real-time resource allocating as well as prediction algorithms. Comparing the ERAS with the state-of-the-art algorithms, ERAS achieved the minimum Makespan as compared to previous resource allocation algorithms, while maximizing the Average Resource Utilization (ARU) and the Load Balancing Level (LBL). For each application, we further compared and contrasted the architecture of the RL models and the assessment metrics. In critical care, RL has tremendous potential to enhance decision-making. This paper presents two main contributions, (i) Optimization of the RL hyperparameters using PSO, and (ii) Using the optimized RL for the resource allocation and load balancing in the fog environment. Because of its exploitation, exploration, and capacity to get rid of local minima, the PSO has a significant significance when compared to other optimization methodologies.
... This long tradition led to foundational theoretical and algorithmic insights, such as the Q-learning algorithm (Watkins & Dayan, 1992), or the policy gradient theorem (Sutton et al., 2000). Classical RL agents have been successfully applied to complex control problems, such as controlling helicopters (Ng et al., 2006), locomotion and manipulation tasks with robots (Kohl & Stone, 2004;Kormushev et al., 2010), or playing complex games at human level, e.g. Backgammon (Tesauro, 1995). ...
... Classical RL already had impressive results in controlling complex systems, such as performing helicopter maneuvers (Ng et al., 2006), efficiently learning to flip pancakes with a robotic arm (Kormushev et al., 2010) -a frivolous yet complicated manipulation task -or learning efficient locomotion gates with a quadrupedal robot (Kohl & Stone, 2004). DRL however allows dealing with such approaches in an end-to-end manner, i.e. it is not necessary to supply the RL algorithm with carefully chosen representations of the inputs nor to provide high-level action commands. ...
Thesis
A long-standing goal of Machine Learning (ML) and AI at large is to design autonomous agents able to efficiently interact with our world. Towards this, taking inspirations from the interactive nature of human and animal learning, several lines of works focused on building decision making agents embodied in real or virtual environments. In less than a decade, Deep Reinforcement Learning (DRL) established itself as one of the most powerful set of techniques to train such autonomous agents. DRL is based on the maximization of expert-defined reward functions that guide an agent’s learning towards a predefined target task or task set. In parallel, the Developmental Robotics field has been working on modelling cognitive development theories and integrating them into real or simulated robots. A core concept developed in this literature is the notion of intrinsic motivation: developmental robots explore and interact with their environment according to self-selected objectives in an open-ended learning fashion. Recently, similar ideas of self-motivation and open-ended learning started to grow within the DRL community, while the Developmental Robotics community started to consider DRL methods into their developmental systems. We propose to refer to this convergence of works as Developmental Machine Learning. Developmental ML regroups works on building embodied autonomous agents equipped with intrinsic-motivation mechanisms shaping open-ended learning trajectories. The present research aims to contribute within this emerging field. More specifically, the present research focuses on proposing and assessing the performance of a core algorithmic block of such developmental machine learners: Automatic Curriculum Learning (ACL) methods. ACL algorithms shape the learning trajectories of agents by challenging them with tasks adapted to their capacities. In recent years, they have been used to improve sample efficiency and asymptotic performance, to organize exploration, to encourage generalization or to solve sparse reward problems, among others. Despite impressive success in traditional supervised learning scenarios (e.g. image classification), large-scale and real-world applications of embodied machine learners are yet to come. The present research aims to contribute towards the creation of such agents by studying how to autonomously and efficiently scaffold them up to proficiency.
... A typical teleoperation system is composed of a human operator maneuvering a slave robot working on a remote or hard to reach environment [1]- [4]. Motivated by the large variety of applications, ranging from space explorations [2], [5] and deep underwater explorations [6], [7] to UAV operations [8], [9] and nuclear waste decontamination [10], [11], teleoperation have been studied extensively. However, some of the fundamental problems of teleoperation remain to be solved. ...
... In order to address these problems, autonomous teleoperation systems are often proposed for missions in remote environments such as deep sea [6], [7], [14], air [8], [9], space [15], nuclear disaster sites [11], industrial sites [16], [17] or for any generic site [18]- [23]. Many of these autonomous tasks mainly involve basic processes like grasping [16], rotating [19] or dragging [17] an object, opening a valve [7], or hot-stabbing [6]. ...
Article
Full-text available
Teleoperation systems have been getting significant attention from many application areas for decades. However, classical teleoperation systems suffer from problems such as lack of natural feedback, latency, and inefficient operator throughput. Researchers attempted to address these issues by performing some of the teleoperation sub-tasks autonomously whenever requested by the operator. Nevertheless, these systems still need the operator to see the need for autonomous actions and initiate these actions manually, which is demanding for the operators. This paper proposes a novel end-to-end Stochastic Assistive Teleoperation System (SATS) that always stays in the loop, automatically detects applicable actions with probabilities, and produces visual scene estimations for each of these actions, which results in increased operator efficiency and throughput. We introduce several methods that combine ideas from Markov processes and recurrent neural networks to stochastically predict future action sequences and scene configurations with tractable algorithms. Experiments performed with a group of operators on real and simulative teleoperation environments show that operators issue a considerably smaller number of commands compared to alternative methods. We also showed that the operators can manipulate multiple robots simultaneously using our technique, which boosts the operator throughput even further. We provide supplementary video material that demonstrates SATS in action.
... to be persistent exciting [281]. Such a method has been utilized in [294] to generate the desired trajectory for a quadrotor aimed to transport a suspended load. in [295] to maximize a value function in order to develop a controller for a helicopter in low-speed aerobatic maneuvers, e.g. the inverted flight of the aerial vehicle, where the optimization process has been performed in the simulation environment using an identified stochastic, nonlinear model of the system. Using the Monte Carlo method, a collision-avoidance control system has been proposed in [296]. ...
Preprint
Full-text available
As the first review in this field, this paper presents an in-depth mathematical view of Intelligent Flight Control Systems (IFCSs), particularly those based on artificial neural networks. The rapid evolution of IFCSs in the last two decades in both the methodological and technical aspects necessitates a comprehensive view of them to better demonstrate the current stage and the crucial remaining steps towards developing a truly intelligent flight management unit. To this end, in this paper, we will provide a detailed mathematical view of Neural Network (NN)-based flight control systems and the challenging problems that still remain. The paper will cover both the model-based and model-free IFCSs. The model-based methods consist of the basic feedback error learning scheme, the pseudocontrol strategy, and the neural backstepping method. Besides, different approaches to analyze the closed-loop stability in IFCSs, their requirements, and their limitations will be discussed in detail. Various supplementary features, which can be integrated with a basic IFCS such as the fault-tolerance capability, the consideration of system constraints, and the combination of NNs with other robust and adaptive elements like disturbance observers, would be covered, as well. On the other hand, concerning model-free flight controllers, both the indirect and direct adaptive control systems including indirect adaptive control using NN-based system identification, the approximate dynamic programming using NN, and the reinforcement learning-based adaptive optimal control will be carefully addressed. Finally, by demonstrating a well-organized view of the current stage in the development of IFCSs, the challenging issues, which are critical to be addressed in the future, are thoroughly identified.
... Reinforcement learning methods demonstrated an exciting progress recently and proved that they can be used for such complicated tasks as an autonomous helicopter flight (Kim et al., 2004, Ng et al., 2006, robot hand manipulation (OpenAI et al., 2018a) and playing games Schmidhuber, 2018b, Freeman et al., 2019). In RL framework a decision-making model, commonly called an agent, learns to interact with the environment by choosing from available actions and earning some awards. ...
Article
Full-text available
3D data retrieval is required in various fields such as an industrial monitoring, agriculture, and robotics. Recent advances in photogrammetry and computer vision allowed to perform 3D reconstruction using a set of images captured with uncalibrated camera. Such technique is commonly known as Structure-from-Motion. In this paper, we propose a reinforcement learning framework RL3D for online strong camera configuration planning onboard of a mobile robot. The mobile robot consists of a skid-steered wheeled platform, a single-board computer and an industrial camera. Our aim is developing a model that plans a set of robot location that provide a strong camera configuration. We developed an environment simulator to train our RL3D framework. The simulator was implemented using a 3D model of the indoor scene and includes a model of robot’s dynamics. We trained our framework using the simulator and evaluated it using a virtual and real environments. The results of the evaluation are encouraging and demonstrate that the controller model successfully learns simple camera configurations such as a circle around an object.
... AlphaGo is a full-of-tech AI based on Monte Carlo Tree Search [29], a hybrid network (policy network and value network), and a self-taught training strategy [30]. Other applications of deep RL can be found in self-driving cars [31,32], helicopters [33], or even NPhard problems such as Vehicle Routing Problem [34] and combinatorial graph optimization [35]. ...
Article
Full-text available
Reinforcement learning (RL) has emerged as an effective approach for building an intelligent system, which involves multiple self-operated agents to collectively accomplish a designated task. More importantly, there has been a renewed focus on RL since the introduction of deep learning that essentially makes RL feasible to operate in high-dimensional environments. However, there are many diversified research directions in the current literature, such as multi-agent and multi-objective learning, and human-machine interactions. Therefore, in this paper, we propose a comprehensive software architecture that not only plays a vital role in designing a connect-the-dots deep RL architecture but also provides a guideline to develop a realistic RL application in a short time span. By inheriting the proposed architecture, software managers can foresee any challenges when designing a deep RL-based system. As a result, they can expedite the design process and actively control every stage of software development, which is especially critical in agile development environments. For this reason, we design a deep RL-based framework that strictly ensures flexibility, robustness, and scalability. To enforce generalization, the proposed architecture also does not depend on a specific RL algorithm, a network configuration, the number of agents, or the type of agents.
... In contrast to deep learning, reinforcement learning (RL) is not completely a supervised learning where the machines learn through trial and error using data from its own experience. RL has shown the capability of solving complex problems ranging from robot control (Ng et al. 2006) to games (Mnih et al. 2013;Silver et al. 2017). This approach is also investigated for CBC application Shirakawa 2018, 2019). ...
Article
Coherent combining of light from multiple high power fiber amplifiers is a promising pathway to scaling the output power to hundreds of kiloWatts. In the last decade, substantial progress has been made in terms of scaling number of elements along with improvement in phase control techniques to achieve stable locking, resulting in tens of kiloWatts of output power. In this paper, we review the progress in coherent beam combining of fiber amplifiers in the master oscillator power amplifier configuration. We also discuss the challenges in power scaling as well as the current trends in the corresponding mitigation strategies. Specifically, the use of optimized phase modulation waveforms for mitigation of stimulated Brillouin scattering in high power amplifiers, scaling of the number of beam combining elements through deep-learning based phase control, and the use of micro-lens array for enhancing beam combination efficiency are discussed.
... Learning-based approaches have achieved great success in solving sequential decision-making problems, for example, in the field of autonomous driving [5] and playing computer games [34]. Among them, reinforcement learning (RL) has gained many attentions due to its strong capability and compelling results [37,51]. Although proven successful in many tasks, the RL approach is known to be sample inefficient and require a substantial amount of data in order to achieve good results, which makes it unsuitable for many safety-critical systems in the real world. ...
Article
Full-text available
There have been many researchers studying how to enable unmanned aerial vehicles UAVs) to navigate in complex and natural environments autonomously. In this paper, we develop an imitation learning framework and use it to train navigation policies for the UAV flying inside complex and GPS-denied riverine environments. The UAV relies on a forward-pointing camera to perform reactive maneuvers and navigate itself in 2D space by dapting the heading. We compare the performance of a linear regression-based ontroller, an end-to-end neural network controller and a variational autoencoder (VAE)-based controller trained with data aggregation method in the simulation environments. The results show that the VAE-based controller outperforms the other two controllers in both training and testing environments and is able to navigate the UAV with a longer traveling distance and a lower intervention rate from the pilots.
Article
In this paper, we introduce a self-trained controller for autonomous navigation in static and dynamic (with moving walls and nets) challenging environments (including trees, nets, windows, and pipe) using deep reinforcement learning, simultaneously trained using multiple rewards. We train our RL algorithm in a multi-objective way. Our algorithm learns to generate continuous action for controlling the UAV. Our algorithm aims to generate waypoints for the UAV in such a way as to reach a goal area (shown by an RGB image) while avoiding static and dynamic obstacles. In this text, we use the RGB-D image as the input for the algorithm, and it learns to control the UAV in 3-DoF (x, y, and z). We train our robot in environments simulated by Gazebo sim. For communication between our algorithm and the simulated environments, we use the robot operating system. Finally, we visualize the trajectories generated by our trained algorithms using several methods and illustrate our results that clearly show our algorithm’s capability in learning to maximize the defined multi-objective reward.
Article
Full-text available
As the first review in this field, this paper presents an in-depth mathematical view of Intelligent Flight Control Systems (IFCSs), particularly those based on artificial neural networks. The rapid evolution of IFCSs in the last two decades in both the methodological and technical aspects necessitates a comprehensive view of them to better demonstrate the current stage and the crucial remaining steps towards developing a truly intelligent flight management unit. To this end, in this paper, we will provide a detailed mathematical view of Neural Network (NN)-based flight control systems and the challenging problems that still remain. The paper will cover both the model-based and model-free IFCSs. The model-based methods consist of the basic feedback error learning scheme, the pseudocontrol strategy, and the neural backstepping method. Besides, different approaches to analyze the closed-loop stability in IFCSs, their requirements, and their limitations will be discussed in detail. Various supplementary features, which can be integrated with a basic IFCS such as the fault-tolerance capability, the consideration of system constraints, and the combination of NNs with other robust and adaptive elements like disturbance observers, would be covered, as well. On the other hand, concerning model-free flight controllers, both the indirect and direct adaptive control systems including indirect adaptive control using NN-based system identification, the approximate dynamic programming using NN, and the reinforcement learning-based adaptive optimal control will be carefully addressed. Finally, by demonstrating a well-organized view of the current stage in the development of IFCSs, the challenging issues, which are critical to be addressed in the future, are thoroughly identified. As a result, this paper can be considered as a comprehensive road map for all researchers interested in the design and development of intelligent control systems, particularly in the field of aerospace applications.
Article
Full-text available
Autonomous helicopter flight represents a challenging control problem, with complex, noisy, dynamics. In this paper, we describe a successful application of reinforcement learning to autonomous helicopter flight. We first fit a stochastic, nonlinear model of the helicopter dynamics. We then use the model to learn to hover in place, and to fly a number of maneuvers taken from an RC helicopter competition.
Conference Paper
Full-text available
In this paper we describe a low-cost flight control system for a small (60 class) helicopter which is part of a larger project to develop an autonomous flying vehicle. Our approach differs from that of others in not using an expensive inertial/GPS sensing system. The primary sensors for vehicle stabilization are a low-cost inertial sensor and a pair of CMOS cameras. We describe the architecture of our flight control system, the inertial and visual sensing subsystems and present some flight control results.
Article
Three doubtful issues which occurred in several references of helicopter aerodynamics are pointed out: (1) whether the global momentum theory for helicopter rotors can be applied to the differential analysis; (2) whether the distribution of blade induced velocities along the radius can be assumed to be independent of the blade pitch angle of rotors in blade element theory, even for vertical flight; and (3) how the blade induced velocities distribute over the rotor azimuth in forward flight. This paper tries to clarify the doubtful issues and gives some conclusions.
Article
Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that, when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond. (Thorndike, 1911) The idea of learning to make appropriate responses based on reinforcing events has its roots in early psychological theories such as Thorndike's "law of effect" (quoted above). Although several important contributions were made in the 1950s, 1960s and 1970s by illustrious luminaries such as Bellman, Minsky, Klopf and others (Farley and Clark, 1954; Bellman, 1957; Minsky, 1961; Samuel, 1963; Michie and Chambers, 1968; Grossberg, 1975; Klopf, 1982), the last two decades have wit- nessed perhaps the strongest advances in the mathematical foundations of reinforcement learning, in addition to several impressive demonstrations of the performance of reinforcement learning algo- rithms in real world tasks. The introductory book by Sutton and Barto, two of the most influential and recognized leaders in the field, is therefore both timely and welcome. The book is divided into three parts. In the first part, the authors introduce and elaborate on the es- sential characteristics of the reinforcement learning problem, namely, the problem of learning "poli- cies" or mappings from environmental states to actions so as to maximize the amount of "reward"
Conference Paper
Many control problems in the robotics field can be cast as partially observed Markovian decision problems (POMDPs), an optimal control formalism. Finding optimal solutions to such problems in general, however is known to be intractable. It has often been observed that in practice, simple structured controllers suffice for good sub-optimal control, and recent research in the artificial intelligence community has focused on policy search methods as techniques for finding sub-optimal controllers when such structured controllers do exist. Traditional model-based reinforcement learning algorithms make a certainty equivalence assumption on their learned models and calculate optimal policies for a maximum-likelihood Markovian model. We consider algorithms that evaluate and synthesize controllers under distributions of Markovian models. Previous work has demonstrated that algorithms that maximize mean reward with respect to model uncertainty leads to safer and more robust controllers. We consider briefly other performance criterion that emphasize robustness and exploration in the search for controllers, and note the relation with experiment design and active learning. To validate the power of the approach on a robotic application we demonstrate the presented learning control algorithm by flying an autonomous helicopter. We show that the controller learned is robust and delivers good performance in this real-world domain.
System identification of small-size unmanned helicopter dynamics
  • B Mettler
  • M Tischler
  • T Kanade
B. Mettler, M. Tischler, and T. Kanade. System identification of small-size unmanned helicopter dynamics. In American Helicopter Society, 55th Forum, 1999.
Policy invariance under reward transformations: Theory and application to reward shaping
  • Andrew Y Ng
  • Daishi Harada
  • Stuart Russell
Andrew Y. Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the Sixteenth International Conference on Machine Learning, pages 278-287, Bled, Slovenia, July 1999. Morgan Kaufmann.
Pegasus: A policy search method for large MDPs and POMDPs
  • Y Andrew
  • Michael I Ng
  • Jordan
Andrew Y. Ng and Michael I. Jordan. Pegasus: A policy search method for large MDPs and POMDPs. In Uncertainty in Artificial Intellicence, Proceedings of Sixteenth Conference, pages 406-415, 2000.