Content uploaded by Varun Ganapathi

Author content

All content in this area was uploaded by Varun Ganapathi on Aug 02, 2018

Content may be subject to copyright.

Autonomous inverted helicopter ﬂight via

reinforcement learning

Andrew Y. Ng1, Adam Coates1, Mark Diel2, Varun Ganapathi1, Jamie

Schulte1, Ben Tse2, Eric Berger1, and Eric Liang1

1Computer Science Department, Stanford University, Stanford, CA 94305

2Whirled Air Helicopters, Menlo Park, CA 94025

Abstract. Helicopters have highly stochastic, nonlinear, dynamics, and autonomous

helicopter ﬂight is widely regarded to be a challenging control problem. As heli-

copters are highly unstable at low speeds, it is particularly diﬃcult to design con-

trollers for low speed aerobatic maneuvers. In this paper, we describe a successful

application of reinforcement learning to designing a controller for sustained in-

verted ﬂight on an autonomous helicopter. Using data collected from the helicopter

in ﬂight, we began by learning a stochastic, nonlinear model of the helicopter’s

dynamics. Then, a reinforcement learning algorithm was applied to automatically

learn a controller for autonomous inverted hovering. Finally, the resulting controller

was successfully tested on our autonomous helicopter platform.

1 Introduction

Autonomous helicopter ﬂight represents a challenging control problem with

high dimensional, asymmetric, noisy, nonlinear, non-minimum phase dynam-

ics, and helicopters are widely regarded to be signiﬁcantly harder to control

than ﬁxed-wing aircraft. [3,10] But helicopters are uniquely suited to many

applications requiring either low-speed ﬂight or stable hovering. The con-

trol of autonomous helicopters thus provides an important and challenging

testbed for learning and control algorithms.

Some recent examples of successful autonomous helicopter ﬂight are given

in [7,2,9,8]. Because helicopter ﬂight is usually open-loop stable at high speeds

but unstable at low speeds, we believe low-speed helicopter maneuvers are

particularly interesting and challenging. In previous work, (Ng et al.,2004)

considered the problem of learning to ﬂy low-speed maneuvers very accu-

rately. In this paper, we describe a successful application of machine learning

to performing a simple low-speed aerobatic maneuver—autonomous sustained

inverted hovering.

2 Helicopter platform

To carry out ﬂight experiments, we began by instrumenting a Bergen in-

dustrial twin helicopter (length 59”, height 22”) for autonomous ﬂight. This

2 Ng et al.

Fig. 1. Helicopter in conﬁguration for upright-only ﬂight (single GPS antenna).

helicopter is powered by a twin cylinder 46cc engine, and has an unloaded

weight of 18 lbs.

Our initial ﬂight tests indicated that the Bergen industrial twin’s original

rotor-head was unlikely to be suﬃciently strong to withstand the forces en-

countered in aerobatic maneuvers. We therefore replaced the rotor-head with

one from an X-Cell 60 helicopter. We also instrumented the helicopter with

a PC104 ﬂight computer, an Inertial Science ISIS-IMU (accelerometers and

turning-rate gyroscopes), a Novatel GPS unit, and a MicroStrain 3d mag-

netic compass. The PC104 was mounted in a plastic enclosure at the nose

of the helicopter, and the GPS antenna, IMU, and magnetic compass were

mounted on the tail boom. The IMU in particular was mounted fairly close

to the fuselage, to minimize measurement noise arising from tail-boom vibra-

tions. The fuel tank, originally mounted at the nose, was also moved to the

rear. Figure 1 shows our helicopter in this initial instrumented conﬁguration.

Readings from all the sensors are fed to the onboard PC104 ﬂight com-

puter, which runs a Kalman ﬁlter to obtain position and orientation estimates

for the helicopter at 100Hz. A custom takeover board also allows the com-

puter either to read the human pilot’s commands that are being sent to the

helicopter control surfaces, or to send its own commands to the helicopter.

The onboard computer also communicates with a ground station via 802.11b

wireless.

Most GPS antenna (particularly diﬀerential, L1/L2 ones) are directional,

and a single antenna pointing upwards relative to the helicopter would be un-

able to see any satellites if the helicopter is inverted. Thus, a single, upward-

pointing antenna cannot be used to localize the helicopter in inverted ﬂight.

We therefore added to our system a second antenna facing downwards, and

used a computer-controlled relay for switching between them. By examining

the Kalman ﬁlter output, our onboard computer automatically selects the

upward-facing antenna. (See Figure 2a.) We also tried a system in which

Autonomous inverted helicopter ﬂight via reinforcement learning 3

(a) (b)

Fig. 2. (a) Dual GPS antenna conﬁguration (one antenna is mounted on the tail-

boom facing up; the other is shown facing down in the lower-left corner of the

picture). The small box on the left side of the picture (mounted on the left side of

the tail-boom) is a computer-controlled relay. (b) Graphical simulator of helicopter,

built using the learned helicopter dynamics.

the two antenna were simultaneously connected to the receiver via a Y-cable

(without a relay). In our experiments, this suﬀered from signiﬁcant GPS

multipath problems and was not usable.

3 Machine learning for controller design

A helicopter such as ours has a high center of gravity when in inverted hover,

making inverted ﬂight signiﬁcantly less stable than upright ﬂight (which is

also unstable at low speeds). Indeed, there are far more human RC pilots

who can perform high-speed aerobatic maneuvers than can keep a helicopter

in sustained inverted hover. Thus, designing a stable controller for sustained

inverted ﬂight appears to be a diﬃcult control problem.

Most helicopters are ﬂown using four controls:

•a[1] and a[2]: The longitudinal (front-back) and latitudinal (left-right)

cyclic pitch controls cause the helicopter to pitch forward/backwards or

sideways, and can thereby also be used to aﬀect acceleration in the lon-

gitudinal and latitudinal directions.

•a[3]: The main rotor collective pitch control causes the main rotor blades

to rotate along an axis that runs along the length of the rotor blade, and

thereby aﬀects the angle at which the main rotor’s blades are tilted rela-

tive to the plane of rotation. As the main rotor blades sweep through the

air, they generate an amount of upward thrust that (generally) increases

with this angle. By varying the collective pitch angle, we can aﬀect the

main rotor’s thrust. For inverted ﬂight, by setting a negative collective

pitch angle, we can cause the helicopter to produce negative thrust.

•a[4]: The tail rotor collective pitch control aﬀects tail rotor thrust, and

can be used to yaw (turn) the helicopter.

4 Ng et al.

A ﬁfth control, the throttle, is commanded as pre-set function of the main

rotor collective pitch, and can safely be ignored for the rest of this paper.

To design the controller for our helicopter, we began by learning a stochas-

tic, nonlinear, model of the helicopter dynamics. Then, a reinforcement learn-

ing/policy search algorithm was used to automatically design a controller.

3.1 Model identiﬁcation

We applied supervised learning to identify a model of the helicopter’s dy-

namics. We began by asking a human pilot to ﬂy the helicopter upside-down,

and logged the pilot commands and helicopter state scomprising its position

(x, y, z), orientation (roll φ, pitch θ, yaw ω), velocity ( ˙x, ˙y, ˙z) and angular

velocities ( ˙

φ, ˙

θ, ˙ω). A total of 391s of ﬂight data was collected for model iden-

tiﬁcation. Our goal was to learn a model that, given the state stand the

action atcommanded by the pilot at time t, would give a good estimate of

the probability distribution Pstat(st+1) of the resulting state of the helicopter

st+1 one time step later.

Following standard practice in system identiﬁcation [4], we converted the

original 12-dimensional helicopter state into a reduced 8-dimensional state

represented in body coordinates sb= [φ, θ, ˙x, ˙y, ˙z, ˙

φ, ˙

θ, ˙ω]. Where there is risk

of confusion, we will use superscript sand bto distinguish between spatial

(world) coordinates and body coordinates. The body coordinate representa-

tion speciﬁes the helicopter state using a coordinate frame in which the x,

y, and zaxes are forwards, sideways, and down relative to the current ori-

entation of the helicopter, instead of north, east and down. Thus, ˙xbis the

forward velocity, whereas ˙xsis the velocity in the northern direction. (φand

θare always expressed in world coordinates, because roll and pitch relative

to the body coordinate frame is always zero.) By using a body coordinate

representation, we encode into our model certain “symmetries” of helicopter

ﬂight, such as that the helicopter’s dynamics are the same regardless of its

absolute position and orientation (assuming the absence of obstacles).1

Even in the reduced coordinate representation, only a subset of the state

variables need to be modeled explicitly using learning. Speciﬁcally, the roll φ

and pitch θ(and yaw ω) angles of the helicopter over time can be computed

exactly as a function of the roll rate ˙

φ, pitch rate ˙

θand yaw rate ˙ω. Thus,

given a model that predicts only the angular velocities, we can numerically

integrate the velocities over time to obtain orientations.

We identiﬁed our model at 10Hz, so that the diﬀerence in time between st

and st+1 was 0.1 seconds. We used linear regression to learn to predict, given

1Actually, by handling the eﬀects of gravity explicitly, it is possible to obtain an

even better model that uses a further reduced, 6-dimensional, state, by eliminat-

ing the state variables φand θ. We found this additional reduction useful and

included it in the ﬁnal version of our model; however, a full discussion is beyond

the scope of this paper.

Autonomous inverted helicopter ﬂight via reinforcement learning 5

sb

t∈R8and at∈R4, a sub-vector of the state variables at the next timestep

[ ˙xb

t+1,˙yb

t+1,˙zb

t+1,˙

φb

t+1,˙

θb

t+1,˙ωb

t+1]. This body coordinate model is then con-

verted back into a world coordinates model, for example by integrating an-

gular velocities to obtain world coordinate angles. Note that because the

process of integrating angular velocities expressed in body coordinates to

obtain angles expressed in world coordinates is nonlinear, the ﬁnal model

resulting from this process is also necessarily nonlinear. After recovering the

world coordinate orientations via integration, it is also straightforward to ob-

tain the rest of the world coordinates state. (For example, the mapping from

body coordinate velocity to world coordinate velocity is simply a rotation.)

Lastly, because helicopter dynamics are inherently stochastic, a determin-

istic model would be unlikely to fully capture a helicopter’s range of possible

behaviors. We modeled the errors in the one-step predictions of our model as

Gaussian, and estimated the magnitude of the noise variance via maximum

likelihood.

The result of this procedure is a stochastic, nonlinear model of our heli-

copter’s dynamics. To verify the learned model, we also implemented a graph-

ical simulator (see Figure 2b) with a joystick control interface similar to that

on the real helicopter. This allows the pilot to ﬂy the helicopter in simulation

and verify the simulator’s modeled dynamics. The same graphical simulator

was subsequently also used for controller visualization and testing.

3.2 Controller design via reinforcement learning

Having built a model/simulator of the helicopter, we then applied reinforce-

ment learning to learn a good controller.

Reinforcement learning [11] gives a set of tools for solving control problems

posed in the Markov decision process (MDP) formalism. An MDP is a tuple

(S, s0, A, {Psa}, γ , R). In our problem, Sis the set of states (expressed in

world coordinates) comprising all possible helicopter positions, orientations,

velocities and angular velocities; s0∈Sis the initial state; A= [−1,1]4is the

set of all possible control actions; Psa(·) are the state transition probabilities

for taking action ain state s;γ∈[0,1) is a discount factor; and R:S7→ Ris

a reward function. The dynamics of an MDP proceed as follows: The system

is ﬁrst initialized in state s0. Based on the initial state, we get to choose

some control action a0∈A. As a result of our choice, the system transitions

randomly to some new state s1according to the state transition probabilities

Ps0a0(·). We then get to pick a new action a1, as a result of which the system

transitions to s2∼Ps1a1, and so on.

A function π:S7→ Ais called a policy (or controller). It we take action

π(s) whenever we are in state s, then we say that we are acting according to

π. The reward function Rindicates how well we are doing at any particular

time, and the goal of the reinforcement learning algorithm is to ﬁnd a policy

6 Ng et al.

πso as to maximize

U(π) ˙=Es0,s1,... "∞

X

t=0

γtR(st)|π#,(1)

where the expectation is over the random sequence of states visited by acting

according to π, starting from state s0. Because γ < 1, rewards in the distant

future are automatically given less weight in the sum above.

For the problem of autonomous hovering, we used a quadratic reward

function

R(ss) = −(αx(x−x∗)2+αy(y−y∗)2+αz(z−z∗)2

+α˙x˙x2+α˙y˙y2+α˙z˙z2+αω(ω−ω∗)2),(2)

where the position (x∗, y∗, z ∗) and orientation ω∗speciﬁes where we want

the helicopter to hover. (The term ω−ω∗, which is a diﬀerence between two

angles, is computed with appropriate wrapping around 2π.) The coeﬃcients

αiwere chosen to roughly scale each of the terms in (2) to the same order

of magnitude (a standard heuristic in LQR control [1]). Note that our re-

ward function did not penalize deviations from zero roll and pitch, because

a helicopter hovering stably in place typically has to be tilted slightly.2

For the policy π, we chose as our representation a simpliﬁed version of

the neural network used in [7]. Speciﬁcally, the longitudinal cyclic pitch a[1]

was commanded as a function of xb−x∗b(error in position in the xdirection,

expressed in body coordinates), ˙xb, and pitch θ; the latitudinal cyclic pitch

a[2] was commanded as a function of yb−y∗b, ˙yband roll φ; the main rotor

collective pitch a[3] was commanded as a function of zb−z∗band ˙zb; and

the tail rotor collective pitch a[4] was commanded as a function of ω−ω∗.3

Thus, the learning problem was to choose the gains for the controller so that

we obtain a policy πwith large U(π).

Given a particular policy π, computing U(π) exactly would require taking

an expectation over a complex distribution over state sequences (Equation 1).

For nonlinear, stochastic, MDPs, it is in general intractable to exactly com-

pute this expectation. However, given a simulator for the MDP, we can ap-

proximate this expectation via Monte Carlo. Speciﬁcally, in our application,

the learned model described in Section 3.1 can be used to sample st+1 ∼Pstat

2For example, the tail rotor generates a sideways force that would tend to cause

the helicopter to drift sideways if the helicopter were perfectly level. This side-

ways force is counteracted by having the helicopter tilted slightly in the opposite

direction, so that the main rotor generates a slight sideways force in an opposite

direction to that generated by the tail rotor, in addition to an upwards force.

3Actually, we found that a reﬁnement of this representation worked slightly better.

Speciﬁcally, rather than expressing the position and velocity errors in the body

coordinate frame, we instead expressed them in a coordinate frame whose xand

yaxes lie in the horizontal plane/parallel to the ground, and whose xaxis has

the same yaw angle as the helicopter.

Autonomous inverted helicopter ﬂight via reinforcement learning 7

for any state action pair st, at. Thus, by sampling s1∼Ps0π(s0),s2∼Ps1π(s1),

. . . , we obtain a random state sequence s0, s1, s2,... drawn from the distri-

bution resulting from ﬂying the helicopter (in simulation) using controller π.

By summing up P∞

t=0 γtR(st), we obtain one “sample” with which to esti-

mate U(π).4More generally, we can repeat this entire process mtimes, and

average to obtain an estimate ˆ

U(π) of U(π).

One can now try to search for πthat optimizes ˆ

U(π). Unfortunately, op-

timizing ˆ

U(π) represents a diﬃcult stochastic optimization problem. Each

evaluation of ˆ

U(π) is deﬁned via a random Monte Carlo procedure, so multi-

ple evaluations of ˆ

U(π) for even the same πwill in general give back slightly

diﬀerent, noisy, answers. This makes it diﬃcult to ﬁnd “arg maxπˆ

U(π)” us-

ing standard search algorithms. But using the Pegasus method (Ng and

Jordan, 2000), we can turn this stochastic optimization problem into an or-

dinary deterministic problem, so that any standard search algorithm can now

be applied. Speciﬁcally, the computation of ˆ

U(π) makes multiple calls to the

helicopter dynamical simulator, which in turn makes multiple calls to a ran-

dom number generator to generate the samples st+1 ∼Pstat. If we ﬁx in

advance the sequence of random numbers used by the simulator, then there

is no longer any randomness in the evaluation of ˆ

U(π), and in particular

ﬁnding maxπˆ

U(π) involves only solving a standard, deterministic, optimiza-

tion problem. (For more details, see [6], which also proves that the “sample

complexity”—i.e., the number of Monte Carlo samples mwe need to average

over in order to obtain an accurate approximation—is at most polynomial

in all quantities of interest.) To ﬁnd a good controller, we therefore applied

a greedy hillclimbing algorithm (coordinate ascent) to search for a policy π

with large ˆ

U(π).

We note that in earlier work, (Ng et al., 2004) also used a similar approach

to learn to ﬂy expert-league RC helicopter competition maneuvers, including

a nose-in circle (where the helicopter is ﬂown in a circle, but with the nose

of the helicopter continuously pointed at the center of rotation) and other

maneuvers.

4 Experimental Results

Using the reinforcement learning approach described in Section 3, we found

that we were able to extremely quickly design new controllers for the heli-

copter. We ﬁrst completed the inverted ﬂight hardware and collected (human

pilot) ﬂight data on 3rd Dec 2003. Using reinforcement learning, we completed

our controller design by 5th Dec. In our ﬂight experiment on 6th Dec, we suc-

cessfully demonstrated our controller on the hardware platform by having a

human pilot ﬁrst take oﬀ and ﬂip the helicopter upside down, immediately

4In practice, we truncate the state sequence after a large but ﬁnite number of

steps. Because of discounting, this introduces at most a small error into the

approximation.

8 Ng et al.

Fig. 3. Helicopter in autonomous sustained inverted hover.

after which our controller took over and was able to keep the helicopter in

stable, sustained inverted ﬂight. Once the helicopter hardware for inverted

ﬂight was completed, building on our pre-existing software (implemented for

upright ﬂight only), the total time to design and demonstrate a stable in-

verted ﬂight controller was less than 72 hours, including the time needed to

write new learning software.

A picture of the helicopter in sustained autonomous hover is shown in

Figure 3. To our knowledge, this is the ﬁrst helicopter capable of sustained

inverted ﬂight under computer control. A video of the helicopter in inverted

autonomous ﬂight is also at

http://www.cs.stanford.edu/~ang/rl-videos/

Other videos, such as of a learned controller ﬂying the competition maneuvers

mentioned earlier, are also available at the url above.

5 Conclusions

In this paper, we described a successful application of reinforcement learning

to the problem of designing a controller for autonomous inverted ﬂight on

a helicopter. Although not the focus of this paper, we also note that, using

controllers designed via reinforcement learning and shaping [5], our helicopter

is also capable of normal (upright) ﬂight, including hovering and waypoint

following.

Autonomous inverted helicopter ﬂight via reinforcement learning 9

We also found that a side beneﬁt of being able to automatically learn

new controllers quickly and with very little human eﬀort is that it becomes

signiﬁcantly easier to rapidly reconﬁgure the helicopter for diﬀerent ﬂight

applications. For example, we frequently change the helicopter’s conﬁgura-

tion (such as replacing the tail rotor assembly with a new, improved one)

or payload (such as mounting or removing sensor payloads, additional com-

puters, etc.). These modiﬁcations signiﬁcantly change the dynamics of the

helicopter, by aﬀecting its mass, center of gravity, and responses to the con-

trols. But by using our existing learning software, it has proved generally

quite easy to quickly design a new controller for the helicopter after each

time it is reconﬁgured.

Acknowledgments

We give warm thanks to Sebastian Thrun for his assistance and advice on

this project, to Jin Kim for helpful discussions, and to Perry Kavros for his

help constructing the helicopter. This work was supported by DARPA under

contract number N66001-01-C-6018.

References

1. B. D. O. Anderson and J. B. Moore. Optimal Control: Linear Quadratic Meth-

ods. Prentice-Hall, 1989.

2. J. Bagnell and J. Schneider. Autonomous helicopter control using reinforcement

learning policy search methods. In Int’l Conf. Robotics and Automation. IEEE,

2001.

3. J. Leishman. Principles of Helicopter Aerodynamics. Cambridge Univ. Press,

2000.

4. B. Mettler, M. Tischler, and T. Kanade. System identiﬁcation of small-size

unmanned helicopter dynamics. In American Helicopter Society, 55th Forum,

1999.

5. Andrew Y. Ng, Daishi Harada, and Stuart Russell. Policy invariance under

reward transformations: Theory and application to reward shaping. In Pro-

ceedings of the Sixteenth International Conference on Machine Learning, pages

278–287, Bled, Slovenia, July 1999. Morgan Kaufmann.

6. Andrew Y. Ng and Michael I. Jordan. Pegasus: A policy search method for

large MDPs and POMDPs. In Uncertainty in Artiﬁcial Intellicence, Proceedings

of Sixteenth Conference, pages 406–415, 2000.

7. Andrew Y. Ng, H. Jin Kim, Michael Jordan, and Shankar Sastry. Autonomous

helicopter ﬂight via reinforcement learning. In Neural Information Processing

Systems 16, 2004.

8. Jonathan M. Roberts, Peter I. Corke, and Gregg Buskey. Low-cost ﬂight control

system for a small autonomous helicopter. In IEEE International Conference

on Robotics and Automation, 2003.

9. T. Schouwenaars, B. Mettler, E. Feron, and J. How. Hybrid architecture for

full-envelope autonomous rotorcraft guidance. In American Helicopter Society

59th Annual Forum, 2003.

10 Ng et al.

10. J. Seddon. Basic Helicopter Aerodynamics. AIAA Education Series. America

Institute of Aeronautics and Astronautics, 1990.

11. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Intro-

duction. MIT Press, 1998.