Content uploaded by Sepp Hochreiter

Author content

All content in this area was uploaded by Sepp Hochreiter on Apr 03, 2016

Content may be subject to copyright.

Content uploaded by Sepp Hochreiter

Author content

All content in this area was uploaded by Sepp Hochreiter on Apr 03, 2016

Content may be subject to copyright.

LONG SHORT-TERM MEMORY

Neural Computation 9(8):1735{1780, 1997

Sepp Hochreiter

Fakultat fur Informatik

Technische Universitat Munchen

80290 Munchen, Germany

hochreit@informatik.tu-muenchen.de

http://www7.informatik.tu-muenchen.de/~hochreit

Jurgen Schmidhuber

IDSIA

Corso Elvezia 36

6900 Lugano, Switzerland

juergen@idsia.ch

http://www.idsia.ch/~juergen

Abstract

Learning to store information over extended time intervals via recurrent backpropagation

takes a very long time, mostly due to insucient, decaying error back ow. We briey review

Hochreiter's 1991 analysis of this problem, then address it by introducing a novel, ecient,

gradient-based method called \Long Short-Term Memory" (LSTM). Truncating the gradient

where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000

discrete time steps by enforcing

constant

error ow through \constant error carrousels" within

special units. Multiplicative gate units learn to open and close access to the constant error

ow. LSTM is local in space and time; its computational complexity per time step and weight

is

O

(1). Our experiments with articial data involve local, distributed, real-valued, and noisy

pattern representations. In comparisons with RTRL, BPTT, Recurrent Cascade-Correlation,

Elman nets, and Neural Sequence Chunking, LSTM leads to many more successful runs, and

learns much faster. LSTM also solves complex, articial long time lag tasks that have never

been solved by previous recurrent network algorithms.

1 INTRODUCTION

Recurrent networks can in principle use their feedback connections to store representations of

recent input events in form of activations (\short-term memory", as opposed to \long-term mem-

ory" embodied by slowly changing weights). This is potentially signicant for many applications,

including speech processing, non-Markovian control, and music composition (e.g., Mozer 1992).

The most widely used algorithms for learning

what

to put in short-term memory, however, take

too much time or do not work well at all, especially when minimal time lags between inputs and

corresponding teacher signals are long. Although theoretically fascinating, existing methods do

not provide clear

practical

advantages over, say, backprop in feedforward nets with limited time

windows. This paper will review an analysis of the problem and suggest a remedy.

The problem.

With conventional \Back-Propagation Through Time" (BPTT, e.g., Williams

and Zipser 1992, Werbos 1988) or \Real-Time Recurrent Learning" (RTRL, e.g., Robinson and

Fallside 1987), error signals \owing backwards in time" tend to either (1) blow up or (2) vanish:

the temporal evolution of the backpropagated error exponentially depends on the size of the

weights (Hochreiter 1991). Case (1) may lead to oscillating weights, while in case (2) learning to

bridge long time lags takes a prohibitive amount of time, or does not work at all (see section 3).

The remedy.

This paper presents

\Long Short-Term Memory"

(LSTM), a novel recurrent

network architecture in conjunction with an appropriate gradient-based learning algorithm. LSTM

is designed to overcome these error back-ow problems. It can learn to bridge time intervals in

excess of 1000 steps even in case of noisy, incompressible input sequences, without loss of short

time lag capabilities. This is achieved by an ecient, gradient-based algorithm for an architecture

1

enforcing

constant

(thus neither exploding nor vanishing) error ow through internal states of

special units (provided the gradient computation is truncated at certain architecture-specic points

| this does not aect long-term error ow though).

Outline of paper.

Section 2 will briey review previous work. Section 3 begins with an outline

of the detailed analysis of vanishing errors due to Hochreiter (1991). It will then introduce a naive

approach to constant error backprop for didactic purposes, and highlight its problems concerning

information storage and retrieval. These problems will lead to the LSTM architecture as described

in Section 4. Section 5 will present numerous experiments and comparisons with competing

methods. LSTM outperforms them, and also learns to solve complex, articial tasks no other

recurrent net algorithm has solved. Section 6 will discuss LSTM's limitations and advantages. The

appendix contains a detailed description of the algorithm (A.1), and explicit error ow formulae

(A.2).

2 PREVIOUS WORK

This section will focus on recurrent nets with time-varying inputs (as opposed to nets with sta-

tionary inputs and xpoint-based gradient calculations, e.g., Almeida 1987, Pineda 1987).

Gradient-descent variants.

The approaches of Elman (1988), Fahlman (1991), Williams

(1989), Schmidhuber (1992a), Pearlmutter (1989), and many of the related algorithms in Pearl-

mutter's comprehensive overview (1995) suer from the same problems as BPTT and RTRL (see

Sections 1 and 3).

Time-delays.

Other methods that seem practical for short time lags only are Time-Delay

Neural Networks (Lang et al. 1990) and Plate's method (Plate 1993), which updates unit activa-

tions based on a weighted sum of old activations (see also de Vries and Principe 1991). Lin et al.

(1995) propose variants of time-delay networks called NARX networks.

Time constants.

To deal with long time lags, Mozer (1992) uses time constants inuencing

changes of unit activations (deVries and Principe's above-mentioned approach (1991) may in fact

be viewed as a mixture of TDNN and time constants). For long time lags, however, the time

constants need external ne tuning (Mozer 1992). Sun et al.'s alternative approach (1993) updates

the activation of a recurrent unit by adding the old activation and the (scaled) current net input.

The net input, however, tends to perturb the stored information, which makes long-term storage

impractical.

Ring's approach.

Ring (1993) also proposed a method for bridging long time lags. Whenever

a unit in his network receives conicting error signals, he adds a higher order unit inuencing

appropriate connections. Although his approach can sometimes be extremely fast, to bridge a

time lag involving 100 steps may require the addition of 100 units. Also, Ring's net does not

generalize to unseen lag durations.

Bengio et al.'s approaches.

Bengio et al. (1994) investigate methods such as simulated

annealing, multi-grid random search, time-weighted pseudo-Newton optimization, and discrete

error propagation. Their \latch" and \2-sequence" problems are very similar to problem 3a with

minimal time lag 100 (see Experiment 3). Bengio and Frasconi (1994) also propose an EM approach

for propagating targets. With

n

so-called \state networks", at a given time, their system can be

in one of only

n

dierent states. See also beginning of Section 5. But to solve continuous problems

such as the \adding problem" (Section 5.4), their system would require an unacceptable number

of states (i.e., state networks).

Kalman lters.

Puskorius and Feldkamp (1994) use Kalman lter techniques to improve

recurrent net performance. Since they use \a derivative discount factor imposed to decay expo-

nentially the eects of past dynamic derivatives," there is no reason to believe that their Kalman

Filter Trained Recurrent Networks will b e useful for very long minimal time lags.

Second order nets.

We will see that LSTM uses multiplicative units (MUs) to protect error

ow from unwanted perturbations. It is not the rst recurrent net method using MUs though.

For instance, Watrous and Kuhn (1992) use MUs in second order nets. Some dierences to LSTM

are: (1) Watrous and Kuhn's architecture does not enforce constant error ow and is not designed

2

to solve long time lag problems. (2) It has fully connected second-order sigma-pi units, while the

LSTM architecture's MUs are used only to gate access to constant error ow. (3) Watrous and

Kuhn's algorithm costs

O

(

W

2

) operations p er time step, ours only

O

(

W

), where

W

is the number

of weights. See also Miller and Giles (1993) for additional work on MUs.

Simple weight guessing.

To avoid long time lag problems of gradient-based approaches we

may simply randomly initialize all network weights until the resulting net happens to classify all

training sequences correctly. In fact, recently we discovered (Schmidhuber and Hochreiter 1996,

Hochreiter and Schmidhuber 1996, 1997) that simple weight guessing solves many of the problems

in (Bengio 1994, Bengio and Frasconi 1994, Miller and Giles 1993, Lin et al. 1995) faster than

the algorithms proposed therein. This does not mean that weight guessing is a good algorithm.

It just means that the problems are very simple. More realistic tasks require either many free

parameters (e.g., input weights) or high weight precision (e.g., for continuous-valued parameters),

such that guessing becomes completely infeasible.

Adaptive sequence chunkers.

Schmidhuber's hierarchical chunker systems (1992b, 1993)

do

have a capability to bridge arbitrary time lags, but only if there is local predictability across the

subsequences causing the time lags (see also Mozer 1992). For instance, in his postdoctoral thesis

(1993), Schmidhuber uses hierarchical recurrent nets to rapidly solve certain grammar learning

tasks involving minimal time lags in excess of 1000 steps. The performance of chunker systems,

however, deteriorates as the noise level increases and the input sequences become less compressible.

LSTM does not suer from this problem.

3 CONSTANT ERROR BACKPROP

3.1 EXPONENTIALLY DECAYING ERROR

Conventional BPTT

(e.g. Williams and Zipser 1992). Output unit

k

's target at time

t

is

denoted by

d

k

(

t

). Using mean squared error,

k

's error signal is

#

k

(

t

) =

f

0

k

(

net

k

(

t

))(

d

k

(

t

)

y

k

(

t

))

;

where

y

i

(

t

) =

f

i

(

net

i

(

t

))

is the activation of a non-input unit

i

with dierentiable activation function

f

i

,

net

i

(

t

) =

X

j

w

ij

y

j

(

t

1)

is unit

i

's current net input, and

w

ij

is the weight on the connection from unit

j

to

i

. Some

non-output unit

j

's backpropagated error signal is

#

j

(

t

) =

f

0

j

(

net

j

(

t

))

X

i

w

ij

#

i

(

t

+ 1)

:

The corresponding contribution to

w

jl

's total weight update is

#

j

(

t

)

y

l

(

t

1), where

is the

learning rate, and

l

stands for an arbitrary unit connected to unit

j

.

Outline of Ho chreiter's analysis

(1991, page 19-21). Suppose we have a fully connected

net whose non-input unit indices range from 1 to

n

. Let us focus on local error ow from unit

u

to unit

v

(later we will see that the analysis immediately extends to global error ow). The error

occurring at an arbitrary unit

u

at time step

t

is propagated \back into time" for

q

time steps, to

an arbitrary unit

v

. This will scale the error by the following factor:

@#

v

(

t

q

)

@#

u

(

t

)

=

(

f

0

v

(

net

v

(

t

1))

w

uv

q

= 1

f

0

v

(

net

v

(

t

q

))

P

n

l

=1

@#

l

(

t

q

+1)

@#

u

(

t

)

w

lv

q >

1

:

(1)

3

With

l

q

=

v

and

l

0

=

u

, we obtain:

@#

v

(

t

q

)

@#

u

(

t

)

=

n

X

l

1

=1

:::

n

X

l

q

1

=1

q

Y

m

=1

f

0

l

m

(

net

l

m

(

t

m

))

w

l

m

l

m

1

(2)

(proof by induction). The sum of the

n

q

1

terms

Q

q

m

=1

f

0

l

m

(

net

l

m

(

t

m

))

w

l

m

l

m

1

determines the

total error back ow (note that since the summation terms may have dierent signs, increasing

the number of units

n

does not necessarily increase error ow).

Intuitive explanation of equation (2).

If

j

f

0

l

m

(

net

l

m

(

t

m

))

w

l

m

l

m

1

j

>

1

:

0

for all

m

(as can happen, e.g., with linear

f

l

m

) then the largest product increases exponentially

with

q

. That is, the error blows up, and conicting error signals arriving at unit

v

can lead to

oscillating weights and unstable learning (for error blow-ups or bifurcations see also Pineda 1988,

Baldi and Pineda 1991, Doya 1992). On the other hand, if

j

f

0

l

m

(

net

l

m

(

t

m

))

w

l

m

l

m

1

j

<

1

:

0

for all

m

, then the largest product

decreases

exponentially with

q

. That is, the error vanishes, and

nothing can be learned in acceptable time.

If

f

l

m

is the logistic sigmoid function, then the maximal value of

f

0

l

m

is 0.25. If

y

l

m

1

is constant

and not equal to zero, then

j

f

0

l

m

(

net

l

m

)

w

l

m

l

m

1

j

takes on maximal values where

w

l

m

l

m

1

=

1

y

l

m

1

coth(

1

2

net

l

m

)

;

goes to zero for

j

w

l

m

l

m

1

j ! 1

, and is less than 1

:

0 for

j

w

l

m

l

m

1

j

<

4

:

0 (e.g., if the absolute max-

imal weight value

w

max

is smaller than 4.0). Hence with conventional logistic sigmoid activation

functions, the error ow tends to vanish as long as the weights have absolute values below 4.0,

especially in the beginning of the training phase. In general the use of larger initial weights will

not help though | as seen above, for

j

w

l

m

l

m

1

j ! 1

the relevant derivative goes to zero \faster"

than the absolute weight can grow (also, some weights will have to change their signs by crossing

zero). Likewise, increasing the learning rate does not help either | it will not change the ratio of

long-range error ow and short-range error ow. BPTT is too sensitive to recent distractions. (A

very similar, more recent analysis was presented by Bengio et al. 1994).

Global error ow.

The local error ow analysis above immediately shows that global error

ow vanishes, too. To see this, compute

X

u

:

u

output unit

@#

v

(

t

q

)

@#

u

(

t

)

:

Weak upp er bound for scaling factor.

The following, slightly extended vanishing error

analysis also takes

n

, the number of units, into account. For

q >

1, formula (2) can be rewritten

as

(

W

u

T

)

T

F

0

(

t

1)

q

1

Y

m

=2

(

W F

0

(

t

m

))

W

v

f

0

v

(

net

v

(

t

q

))

;

where the weight matrix

W

is dened by [

W

]

ij

:=

w

ij

,

v

's outgoing weight vector

W

v

is dened by

[

W

v

]

i

:= [

W

]

iv

=

w

iv

,

u

's incoming weight vector

W

u

T

is dened by [

W

u

T

]

i

:= [

W

]

ui

=

w

ui

, and for

m

= 1

;:::;q

,

F

0

(

t

m

) is the diagonal matrix of rst order derivatives dened as: [

F

0

(

t

m

)]

ij

:= 0

if

i

6

=

j

, and [

F

0

(

t

m

)]

ij

:=

f

0

i

(

net

i

(

t

m

)) otherwise. Here

T

is the transposition operator,

[

A

]

ij

is the element in the

i

-th column and

j

-th row of matrix

A

, and [

x

]

i

is the

i

-th component

of vector

x

.

4

Using a matrix norm

k

:

k

A

compatible with vector norm

k

:

k

x

, we dene

f

0

max

:= max

m

=1

;:::;q

fk

F

0

(

t

m

)

k

A

g

:

For max

i

=1

;:::;n

fj

x

i

jg k

x

k

x

we get

j

x

T

y

j

n

k

x

k

x

k

y

k

x

:

Since

j

f

0

v

(

net

v

(

t

q

))

j k

F

0

(

t

q

)

k

A

f

0

max

;

we obtain the following inequality:

j

@#

v

(

t

q

)

@#

u

(

t

)

j

n

(

f

0

max

)

q

k

W

v

k

x

k

W

u

T

k

x

k

W

k

q

2

A

n

(

f

0

max

k

W

k

A

)

q

:

This inequality results from

k

W

v

k

x

=

k

W e

v

k

x

k

W

k

A

k

e

v

k

x

k

W

k

A

and

k

W

u

T

k

x

=

k

e

u

W

k

x

k

W

k

A

k

e

u

k

x

k

W

k

A

;

where

e

k

is the unit vector whose components are 0 except for the

k

-th component, which is 1.

Note that this is a weak, extreme case upper b ound | it will be reached only if all

k

F

0

(

t

m

)

k

A

take on maximal values, and if the contributions of all paths across which error ows back from

unit

u

to unit

v

have the same sign. Large

k

W

k

A

, however, typically result in small values of

k

F

0

(

t

m

)

k

A

, as conrmed by experiments (see, e.g., Hochreiter 1991).

For example, with norms

k

W

k

A

:= max

r

X

s

j

w

rs

j

and

k

x

k

x

:= max

r

j

x

r

j

;

we have

f

0

max

= 0

:

25 for the logistic sigmoid. We observe that if

j

w

ij

j

w

max

<

4

:

0

n

8

i; j;

then

k

W

k

A

nw

max

<

4

:

0 will result in exponential decay | by setting

:=

nw

max

4

:

0

<

1

:

0,

we obtain

j

@#

v

(

t

q

)

@#

u

(

t

)

j

n

(

)

q

:

We refer to Hochreiter's 1991 thesis for additional results.

3.2 CONSTANT ERROR FLOW: NAIVE APPROACH

A single unit.

To avoid vanishing error signals, how can we achieve constant error ow through

a single unit

j

with a single connection to itself ? According to the rules above, at time

t

,

j

's local

error back ow is

#

j

(

t

) =

f

0

j

(

net

j

(

t

))

#

j

(

t

+ 1)

w

jj

. To enforce

constant

error ow through

j

, we

require

f

0

j

(

net

j

(

t

))

w

jj

= 1

:

0

:

Note the similarity to Mozer's xed time constant system (1992) | a time constant of 1

:

0 is

appropriate for potentially innite time lags

1

.

The constant error carrousel.

Integrating the dierential equation above, we obtain

f

j

(

net

j

(

t

)) =

net

j

(

t

)

w

jj

for arbitrary

net

j

(

t

). This means:

f

j

has to be linear, and unit

j

's acti-

vation has to remain constant:

y

j

(

t

+ 1) =

f

j

(

net

j

(

t

+ 1)) =

f

j

(

w

jj

y

j

(

t

)) =

y

j

(

t

)

:

1

We do not use the expression \time constant" in the dierential sense, as, e.g., Pearlmutter (1995).

5

In the experiments, this will be ensured by using the identity function

f

j

:

f

j

(

x

) =

x;

8

x

, and by

setting

w

jj

= 1

:

0. We refer to this as the constant error carrousel (CEC). CEC will be LSTM's

central feature (see Section 4).

Of course unit

j

will not only be connected to itself but also to other units. This invokes two

obvious, related problems (also inherent in all other gradient-based approaches):

1. Input weight conict:

for simplicity, let us focus on a single additional input weight

w

ji

.

Assume that the total error can be reduced by switching on unit

j

in response to a certain input,

and keeping it active for a long time (until it helps to compute a desired output). Provided

i

is non-

zero, since the same incoming weight has to be used for both storing certain inputs

and

ignoring

others,

w

ji

will often receive conicting weight update signals during this time (recall that

j

is

linear): these signals will attempt to make

w

ji

participate in (1) storing the input (by switching

on

j

)

and

(2) protecting the input (by preventing

j

from being switched o by irrelevant later

inputs). This conict makes learning dicult, and calls for a more context-sensitive mechanism

for controlling \write operations" through input weights.

2. Output weight conict:

assume

j

is switched on and currently stores some previous

input. For simplicity, let us focus on a single additional outgoing weight

w

kj

. The same

w

kj

has

to be used for both retrieving

j

's content at certain times

and

preventing

j

from disturbing

k

at other times. As long as unit

j

is non-zero,

w

kj

will attract conicting weight update signals

generated during sequence processing: these signals will attempt to make

w

kj

participate in (1)

accessing the information stored in

j

and

| at dierent times | (2) protecting unit

k

from being

perturbed by

j

. For instance, with many tasks there are certain \short time lag errors" that can be

reduced in early training stages. However, at later training stages

j

may suddenly start to cause

avoidable errors in situations that already seemed under control by attempting to participate in

reducing more dicult \long time lag errors". Again, this conict makes learning dicult, and

calls for a more context-sensitive mechanism for controlling \read operations" through output

weights.

Of course, input and output weight conicts are not specic for long time lags, but occur for

short time lags as well. Their eects, however, become particularly pronounced in the long time

lag case: as the time lag increases, (1) stored information must be protected against perturbation

for longer and longer periods, and | especially in advanced stages of learning | (2) more and

more already correct outputs also require protection against p erturbation.

Due to the problems above the naive approach does not work well except in case of certain

simple problems involving local input/output representations and non-repeating input patterns

(see Hochreiter 1991 and Silva et al. 1996). The next section shows how to do it right.

4 LONG SHORT-TERM MEMORY

Memory cells and gate units

. To construct an architecture that allows for constant error ow

through special, self-connected units without the disadvantages of the naive approach, we extend

the constant error carrousel CEC embo died by the self-connected, linear unit

j

from Section 3.2

by introducing additional features. A multiplicative

input gate unit

is introduced to protect the

memory contents stored in

j

from perturbation by irrelevant inputs. Likewise, a multiplicative

output gate unit

is introduced which protects other units from perturbation by currently irrelevant

memory contents stored in

j

.

The resulting, more complex unit is called a

memory cell

(see Figure 1). The

j

-th memory cell

is denoted

c

j

. Each memory cell is built around a central linear unit with a xed self-connection

(the CEC). In addition to

net

c

j

,

c

j

gets input from a multiplicative unit

out

j

(the \output gate"),

and from another multiplicative unit

in

j

(the \input gate").

in

j

's activation at time

t

is denoted

by

y

in

j

(

t

),

out

j

's by

y

out

j

(

t

). We have

y

out

j

(

t

) =

f

out

j

(

net

out

j

(

t

));

y

in

j

(

t

) =

f

in

j

(

net

in

j

(

t

));

6

where

net

out

j

(

t

) =

X

u

w

out

j

u

y

u

(

t

1)

;

and

net

in

j

(

t

) =

X

u

w

in

j

u

y

u

(

t

1)

:

We also have

net

c

j

(

t

) =

X

u

w

c

j

u

y

u

(

t

1)

:

The summation indices

u

may stand for input units, gate units, memory cells, or even conventional

hidden units if there are any (see also paragraph on \network topology" below). All these dierent

types of units may convey useful information about the current state of the net. For instance,

an input gate (output gate) may use inputs from other memory cells to decide whether to store

(access) certain information in its memory cell. There even may be recurrent self-connections like

w

c

j

c

j

. It is up to the user to dene the network topology. See Figure 2 for an example.

At time

t

,

c

j

's output

y

c

j

(

t

) is computed as

y

c

j

(

t

) =

y

out

j

(

t

)

h

(

s

c

j

(

t

))

;

where the \internal state"

s

c

j

(

t

) is

s

c

j

(0) = 0

; s

c

j

(

t

) =

s

c

j

(

t

1) +

y

in

j

(

t

)

g

net

c

j

(

t

)

for

t >

0

:

The dierentiable function

g

squashes

net

c

j

; the dierentiable function

h

scales memory cell

outputs computed from the internal state

s

c

j

.

inj

inj

outj

outj

wic j

wic j

ycj

g h

1.0

net

wiwi

yinjyoutj

netcj

g yinj

= g+scjscjyinj

h youtj

net

Figure 1:

Architecture of memory cell

c

j

(the box) and its gate units

in

j

; out

j

. The self-recurrent

connection (with weight 1.0) indicates feedback with a delay of 1 time step. It builds the basis of

the \constant error carrousel" CEC. The gate units open and close access to CEC. See text and

appendix A.1 for details.

Why gate units?

To avoid input weight conicts,

in

j

controls the error ow to memory cell

c

j

's input connections

w

c

j

i

. To circumvent

c

j

's output weight conicts,

out

j

controls the error

ow from unit

j

's output connections. In other words, the net can use

in

j

to decide when to keep

or override information in memory cell

c

j

, and

out

j

to decide when to access memory cell

c

j

and

when to prevent other units from being perturbed by

c

j

(see Figure 1).

Error signals trapped within a memory cell's CEC

cannot

change { but dierent error signals

owing into the cell (at dierent times) via its output gate may get superimposed. The output

gate will have to learn

which

errors to trap in its CEC, by appropriately scaling them. The input

7

gate will have to learn when to release errors, again by appropriately scaling them. Essentially,

the multiplicative gate units open and close access to constant error ow through CEC.

Distributed output representations typically do require output gates. Not always are both

gate types necessary, though | one may be sucient. For instance, in Experiments 2a and 2b in

Section 5, it will be p ossible to use input gates only. In fact, output gates are not required in case

of local output encoding | preventing memory cells from perturbing already learned outputs can

be done by simply setting the corresponding weights to zero. Even in this case, however, output

gates can be benecial: they prevent the net's attempts at storing long time lag memories (which

are usually hard to learn) from perturbing activations representing easily learnable short time lag

memories. (This will prove quite useful in Experiment 1, for instance.)

Network topology.

We use networks with one input layer, one hidden layer, and one output

layer. The (fully) self-connected hidden layer contains memory cells and corresponding gate units

(for convenience, we refer to both memory cells and gate units as being located in the hidden

layer). The hidden layer may also contain \conventional" hidden units providing inputs to gate

units and memory cells. All units (except for gate units) in all layers have directed connections

(serve as inputs) to all units in the layer above (or to all higher layers { Exp eriments 2a and 2b).

Memory cell blocks.

S

memory cells sharing the same input gate and the same output gate

form a structure called a \memory cell block of size

S

". Memory cell blocks facilitate information

storage | as with conventional neural nets, it is not so easy to co de a distributed input within a

single cell. Since each memory cell block has as many gate units as a single memory cell (namely

two), the block architecture can be even slightly more ecient (see paragraph \computational

complexity"). A memory cell block of size 1 is just a simple memory cell. In the experiments

(Section 5), we will use memory cell blocks of various sizes.

Learning.

We use a variant of RTRL (e.g., Robinson and Fallside 1987) which properly takes

into account the altered, multiplicative dynamics caused by input and output gates. However, to

ensure non-decaying error backprop through internal states of memory cells, as with truncated

BPTT (e.g., Williams and Peng 1990), errors arriving at \memory cell net inputs" (for cell

c

j

, this

includes

net

c

j

,

net

in

j

,

net

out

j

) do not get propagated back further in time (although they

do

serve

to change the incoming weights). Only within

2

memory cells, errors are propagated back through

previous internal states

s

c

j

. To visualize this: once an error signal arrives at a memory cell output,

it gets scaled by output gate activation and

h

0

. Then it is within the memory cell's CEC, where it

can ow back indenitely without ever being scaled. Only when it leaves the memory cell through

the input gate and

g

, it is scaled once more by input gate activation and

g

0

. It then serves to

change the incoming weights before it is truncated (see appendix for explicit formulae).

Computational complexity.

As with Mozer's focused recurrent backprop algorithm (Mozer

1989), only the derivatives

@s

c

j

@w

il

need to be stored and updated. Hence the LSTM algorithm is

very ecient, with an excellent update complexity of

O

(

W

), where

W

the number of weights (see

details in appendix A.1). Hence, LSTM and BPTT for fully recurrent nets have the same update

complexity per time step (while RTRL's is much worse). Unlike full BPTT, however, LSTM is

local in space and time

3

: there is no need to store activation values observed during sequence

processing in a stack with potentially unlimited size.

Abuse problem and solutions.

In the beginning of the learning phase, error reduction

may be possible without storing information over time. The network will thus tend to abuse

memory cells, e.g., as bias cells (i.e., it might make their activations constant and use the outgoing

connections as adaptive thresholds for other units). The potential diculty is: it may take a

long time to release abused memory cells and make them available for further learning. A similar

\abuse problem" appears if two memory cells store the same (redundant) information. There are

at least two solutions to the abuse problem:

(1) Sequential network construction

(e.g., Fahlman

1991): a memory cell and the corresponding gate units are added to the network whenever the

2

For intra-cellular backprop in a quite dierent context see also Doya and Yoshizawa (1989).

3

Following Schmidhuber (1989), we say that a recurrent net algorithm is

local in space

if the update complexity

per time step and weight does not depend on network size. We say that a method is

local in time

if its storage

requirements do not depend on input sequence length. For instance, RTRL is local in time but not in space. BPTT

is local in space but not in time.

8

1 1 2

output

hidden

input

out 1

in 1

out 2

in 2

1

cell

block block

1cell

block block

2

cell 2cell 2

Figure 2:

Example of a net with 8 input units, 4 output units, and 2 memory cell blocks of size 2.

in

1

marks the input gate,

out

1

marks the output gate, and

cell

1

=block

1

marks the rst memory

cell of block 1.

cell

1

=block

1

's architecture is identical to the one in Figure 1, with gate units

in

1

and

out

1

(note that by rotating Figure 1 by 90 degrees anti-clockwise, it will match with the

corresponding parts of Figure 1). The example assumes dense connectivity: each gate unit and

each memory cel l see all non-output units. For simplicity, however, outgoing weights of only

one type of unit are shown for each layer. With the ecient, truncated update rule, error ows

only through connections to output units, and through xed self-connections within cell blocks (not

shown here | see Figure 1). Error ow is truncated once it \wants" to leave memory cells or

gate units. Therefore, no connection shown above serves to propagate error back to the unit from

which the connection originates (except for connections to output units), although the connections

themselves are modiable. That's why the truncated LSTM algorithm is so ecient, despite its

ability to bridge very long time lags. See text and appendix A.1 for details. Figure 2 actually shows

the architecture used for Experiment 6a | only the bias of the non-input units is omitted.

error stops decreasing (see Experiment 2 in Section 5).

(2) Output gate bias:

each output gate gets

a negative initial bias, to push initial memory cell activations towards zero. Memory cells with

more negative bias automatically get \allocated" later (see Experiments 1, 3, 4, 5, 6 in Section 5).

Internal state drift and remedies.

If memory cell

c

j

's inputs are mostly positive or mostly

negative, then its internal state

s

j

will tend to drift away over time. This is p otentially dangerous,

for the

h

0

(

s

j

) will then adopt very small values, and the gradient will vanish. One way to cir-

cumvent this problem is to choose an appropriate function

h

. But

h

(

x

) =

x

, for instance, has the

disadvantage of unrestricted memory cell output range. Our simple but eective way of solving

drift problems at the beginning of learning is to initially bias the input gate

in

j

towards zero.

Although there is a tradeo between the magnitudes of

h

0

(

s

j

) on the one hand and of

y

in

j

and

f

0

in

j

on the other, the potential negative eect of input gate bias is negligible compared to the one

of the drifting eect. With logistic sigmoid activation functions, there appears to be no need for

ne-tuning the initial bias, as conrmed by Experiments 4 and 5 in Section 5.4.

5 EXPERIMENTS

Introduction.

Which tasks are appropriate to demonstrate the quality of a novel long time lag

9

algorithm? First of all, minimal time lags between relevant input signals and corresponding teacher

signals must be long for

all

training sequences. In fact, many previous recurrent net algorithms

sometimes manage to generalize from very short training sequences to very long test sequences.

See, e.g., Pollack (1991). But a real long time lag problem does not have

any

short time lag

exemplars in the training set. For instance, Elman's training procedure, BPTT, oine RTRL,

online RTRL, etc., fail miserably on real long time lag problems. See, e.g., Hochreiter (1991) and

Mozer (1992). A second important requirement is that the tasks should be complex enough such

that they cannot be solved quickly by simple-minded strategies such as random weight guessing.

Guessing can outperform many long time lag algorithms.

Recently we discovered

(Schmidhuber and Hochreiter 1996, Ho chreiter and Schmidhuber 1996, 1997) that many long

time lag tasks used in previous work can be solved more quickly by simple random weight guessing

than by the proposed algorithms. For instance, guessing solved a variant of Bengio and Frasconi's

\parity problem" (1994) problem much faster

4

than the seven methods tested by Bengio et al.

(1994) and Bengio and Frasconi (1994). Similarly for some of Miller and Giles' problems (1993). Of

course, this does not mean that guessing is a good algorithm. It just means that some previously

used problems are not extremely appropriate to demonstrate the quality of previously proposed

algorithms.

What's common to Experiments 1{6.

All our experiments (except for Experiment 1)

involve long minimal time lags | there are no short time lag training exemplars facilitating

learning. Solutions to most of our tasks are sparse in weight space. They require either many

parameters/inputs or high weight precision, such that random weight guessing becomes infeasible.

We always use on-line learning (as opposed to batch learning), and logistic sigmoids as acti-

vation functions. For Experiments 1 and 2, initial weights are chosen in the range [

0

:

2

;

0

:

2], for

the other experiments in [

0

:

1

;

0

:

1]. Training sequences are generated randomly according to the

various task descriptions. In slight deviation from the notation in Appendix A1, each discrete

time step of each input sequence involves three processing steps: (1) use current input to set the

input units. (2) Compute activations of hidden units (including input gates, output gates, mem-

ory cells). (3) Compute output unit activations. Except for Experiments 1, 2a, and 2b, sequence

elements are randomly generated on-line, and error signals are generated only at sequence ends.

Net activations are reset after each processed input sequence.

For comparisons with recurrent nets taught by gradient descent, we give results only for RTRL,

except for comparison 2a, which also includes BPTT. Note, however, that untruncated BPTT (see,

e.g., Williams and Peng 1990) computes exactly the same gradient as oine RTRL. With long time

lag problems, oine RTRL (or BPTT) and the online version of RTRL (no activation resets, online

weight changes) lead to almost identical, negative results (as conrmed by additional simulations

in Hochreiter 1991; see also Mozer 1992). This is b ecause oine RTRL, online RTRL, and full

BPTT all suer badly from exponential error decay.

Our LSTM architectures are selected quite arbitrarily. If nothing is known about the complex-

ity of a given problem, a more systematic approach would be: start with a very small net consisting

of one memory cell. If this do es not work, try two cells, etc. Alternatively, use sequential network

construction (e.g., Fahlman 1991).

Outline of experiments.

Experiment 1 focuses on a standard benchmark test for recurrent nets: the embedded Reber

grammar. Since it allows for training sequences with short time lags, it is

not

a long time

lag problem. We include it because (1) it provides a nice example where LSTM's output

gates are truly benecial, and (2) it is a popular b enchmark for recurrent nets that has been

used by many authors | we want to include at least one experiment where conventional

BPTT and RTRL do not fail completely (LSTM, however, clearly outperforms them). The

embedded Reber grammar's minimal time lags represent a b order case in the sense that it

is still possible to learn to bridge them with conventional algorithms. Only slightly longer

4

It should be mentioned, however, that dierent input representations and dierent types of noise may lead to

worse guessing performance (Yoshua Bengio, personal communication, 1996).

10

minimal time lags would make this almost impossible. The more interesting tasks in our

paper, however, are those that RTRL, BPTT, etc. cannot solve at all.

Experiment 2 focuses on noise-free and noisy sequences involving numerous input symbols

distracting from the few important ones. The most dicult task (Task 2c) involves hundreds

of distractor symbols at random positions, and minimal time lags of 1000 steps. LSTM solves

it, while BPTT and RTRL already fail in case of 10-step minimal time lags (see also, e.g.,

Hochreiter 1991 and Mozer 1992). For this reason RTRL and BPTT are omitted in the

remaining, more complex experiments, all of which involve much longer time lags.

Experiment 3 addresses long time lag problems with noise and signal on the same input

line. Experiments 3a/3b focus on Bengio et al.'s 1994 \2-sequence problem". Because

this problem actually can be solved quickly by random weight guessing, we also include a

far more dicult 2-sequence problem (3c) which requires to learn real-valued, conditional

expectations of noisy targets, given the inputs.

Experiments 4 and 5 involve distributed, continuous-valued input representations and require

learning to store precise, real values for very long time periods. Relevant input signals

can occur at quite dierent positions in input sequences. Again minimal time lags involve

hundreds of steps. Similar tasks never have been solved by other recurrent net algorithms.

Experiment 6 involves tasks of a dierent complex type that also has not been solved by

other recurrent net algorithms. Again, relevant input signals can occur at quite dierent

positions in input sequences. The experiment shows that LSTM can extract information

conveyed by the temporal order of widely separated inputs.

Subsection 5.7 will provide a detailed summary of experimental conditions in two tables for

reference.

5.1 EXPERIMENT 1: EMBEDDED REBER GRAMMAR

Task.

Our rst task is to learn the \embedded Reber grammar", e.g. Smith and Zipser (1989),

Cleeremans et al. (1989), and Fahlman (1991). Since it allows for training sequences with short

time lags (of as few as 9 steps), it is

not

a long time lag problem. We include it for two reasons: (1)

it is a popular recurrent net benchmark used by many authors | we wanted to have at least one

experiment where RTRL and BPTT do not fail completely, and (2) it shows nicely how output

gates can be benecial.

B

T

SX

X P

V

T

P V

S

E

Figure 3:

Transition diagram for the Reber

grammar.

B

T

P

E

T

P

GRAMMAR

GRAMMAR

REBER

REBER

Figure 4:

Transition diagram for the embedded

Reber grammar. Each box represents a copy of

the Reber grammar (see Figure 3).

Starting at the leftmost node of the directed graph in Figure 4, symbol strings are generated

sequentially (beginning with the empty string) by following edges | and appending the associated

11

symbols to the current string | until the rightmost node is reached. Edges are chosen randomly

if there is a choice (probability: 0.5). The net's task is to read strings, one symbol at a time,

and to permanently predict the next symbol (error signals occur at every time step). To correctly

predict the symbol before last, the net has to remember the second symbol.

Comparison.

We compare LSTM to \Elman nets trained by Elman's training procedure"

(ELM) (results taken from Cleeremans et al. 1989), Fahlman's \Recurrent Cascade-Correlation"

(RCC) (results taken from Fahlman 1991), and RTRL (results taken from Smith and Zipser (1989),

where only the few successful trials are listed). It should b e mentioned that Smith and Zipser

actually make the task easier by increasing the probability of short time lag exemplars. We didn't

do this for LSTM.

Training/Testing.

We use a local input/output representation (7 input units, 7 output

units). Following Fahlman, we use 256 training strings and 256 separate test strings. The training

set is generated randomly; training exemplars are picked randomly from the training set. Test

sequences are generated randomly, to o, but sequences already used in the training set are not

used for testing. After string presentation, all activations are reinitialized with zeros. A trial is

considered successful if all string symbols of all sequences in both test set and training set are

predicted correctly | that is, if the output unit(s) corresponding to the p ossible next symbol(s)

is(are) always the most active ones.

Architectures.

Architectures for RTRL, ELM, RCC are reported in the references listed

above. For LSTM, we use 3 (4) memory cell blocks. Each block has 2 (1) memory cells. The

output layer's only incoming connections originate at memory cells. Each memory cell and each

gate unit receives incoming connections from all memory cells and gate units (the hidden layer is

fully connected | less connectivity may work as well). The input layer has forward connections

to all units in the hidden layer. The gate units are biased. These architecture parameters make it

easy to store at least 3 input signals (architectures 3-2 and 4-1 are employed to obtain comparable

numbers of weights for both architectures: 264 for 4-1 and 276 for 3-2). Other parameters may be

appropriate as well, however. All sigmoid functions are logistic with output range [0

;

1], except for

h

, whose range is [

1

;

1], and

g

, whose range is [

2

;

2]. All weights are initialized in [

0

:

2

;

0

:

2],

except for the output gate biases, which are initialized to -1, -2, and -3, respectively (see abuse

problem, solution (2) of Section 4). We tried learning rates of 0.1, 0.2 and 0.5.

Results.

We use 3 dierent, randomly generated pairs of training and test sets. With each

such pair we run 10 trials with dierent initial weights. See Table 1 for results (mean of 30

trials). Unlike the other methods, LSTM always learns to solve the task. Even when we ignore

the unsuccessful trials of the other approaches, LSTM learns much faster.

Importance of output gates.

The experiment provides a nice example where the output gate

is truly benecial. Learning to store the rst T or P should not perturb activations representing

the more easily learnable transitions of the original Reber grammar. This is the job of the output

gates. Without output gates, we did not achieve fast learning.

5.2 EXPERIMENT 2: NOISE-FREE AND NOISY SEQUENCES

Task 2a: noise-free sequences with long time lags.

There are

p

+ 1 possible input symbols

denoted

a

1

; :::; a

p

1

; a

p

=

x; a

p

+1

=

y

.

a

i

is \locally" represented by the

p

+ 1-dimensional vector

whose

i

-th component is 1 (all other components are 0). A net with

p

+ 1 input units and

p

+ 1

output units sequentially observes input symbol sequences, one at a time, permanently trying

to predict the next symbol | error signals occur at every single time step. To emphasize the

\long time lag problem", we use a training set consisting of only two very similar sequences:

(

y; a

1

; a

2

;:::;a

p

1

; y

) and (

x; a

1

; a

2

;:::;a

p

1

; x

). Each is selected with probability 0.5. To predict

the nal element, the net has to learn to store a representation of the rst element for

p

time

steps.

We compare \Real-Time Recurrent Learning" for fully recurrent nets (RTRL), \Back-Propa-

gation Through Time" (BPTT), the sometimes very successful 2-net \Neural Sequence Chunker"

(CH, Schmidhuber 1992b), and our new method (LSTM). In all cases, weights are initialized in

[-0.2,0.2]. Due to limited computation time, training is stopped after 5 million sequence presen-

12

method hidden units # weights learning rate % of success success after

RTRL 3

170 0.05 \some fraction" 173,000

RTRL 12

494 0.1 \some fraction" 25,000

ELM 15

435 0

>

200,000

RCC 7-9

119-198 50 182,000

LSTM 4 blocks, size 1 264 0.1 100 39,740

LSTM 3 blocks, size 2 276 0.1 100 21,730

LSTM 3 blocks, size 2 276 0.2 97 14,060

LSTM 4 blocks, size 1 264 0.5 97 9,500

LSTM 3 blocks, size 2 276 0.5 100 8,440

Table 1:

EXPERIMENT 1: Embedded Reber grammar: percentage of successful trials and number

of sequence presentations until success for RTRL (results taken from Smith and Zipser 1989),

\Elman net trained by Elman's procedure" (results taken from Cleeremans et al. 1989), \Recurrent

Cascade-Correlation" (results taken from Fahlman 1991) and our new approach (LSTM). Weight

numbers in the rst 4 rows are estimates | the corresponding papers do not provide all the technical

details. Only LSTM almost always learns to solve the task (only two failures out of 150 trials).

Even when we ignore the unsuccessful trials of the other approaches, LSTM learns much faster

(the number of required training examples in the bottom row varies between 3,800 and 24,100).

tations. A successful run is one that fullls the following criterion: after training, during 10,000

successive, randomly chosen input sequences, the maximal absolute error of all output units is

always below 0

:

25.

Architectures.

RTRL: one self-recurrent hidden unit,

p

+ 1 non-recurrent output units. Each

layer has connections from all layers below. All units use the logistic activation function sigmoid

in [0,1].

BPTT: same architecture as the one trained by RTRL.

CH: both net architectures like RTRL's, but one has an additional output for predicting the

hidden unit of the other one (see Schmidhuber 1992b for details).

LSTM: like with RTRL, but the hidden unit is replaced by a memory cell and an input gate

(no output gate required).

g

is the logistic sigmoid, and

h

is the identity function

h

:

h

(

x

) =

x;

8

x

.

Memory cell and input gate are added once the error has stopped decreasing (see abuse problem:

solution (1) in Section 4).

Results.

Using RTRL and a short 4 time step delay (

p

= 4),

7

9

of all trials were successful.

No trial was successful with

p

= 10

.

With

long

time lags, only the neural sequence chunker

and LSTM achieved successful trials, while BPTT and RTRL failed. With

p

= 100, the 2-net

sequence chunker solved the task in only

1

3

of all trials. LSTM, however, always learned to solve

the task. Comparing successful trials only, LSTM learned much faster. See Table 2 for details. It

should be mentioned, however, that a

hierarchical

chunker can also always quickly solve this task

(Schmidhuber 1992c, 1993).

Task 2b: no local regularities.

With the task above, the chunker sometimes learns to

correctly predict the nal element, but only because of predictable local regularities in the input

stream that allow for compressing the sequence. In an additional, more dicult task (involving

many more dierent possible sequences), we remove compressibility by replacing the determin-

istic subsequence (

a

1

; a

2

;:::;a

p

1

) by a

random

subsequence (of length

p

1) over the alpha-

bet

a

1

; a

2

;:::;a

p

1

. We obtain 2 classes (two sets of sequences)

f

(

y; a

i

1

; a

i

2

;:::;a

i

p

1

; y

)

j

1

i

1

; i

2

;:::;i

p

1

p

1

g

and

f

(

x; a

i

1

; a

i

2

;:::;a

i

p

1

; x

)

j

1

i

1

; i

2

;:::;i

p

1

p

1

g

. Again, every

next sequence element has to be predicted. The only totally predictable targets, however, are

x

and

y

, which occur at sequence ends. Training exemplars are chosen randomly from the 2 classes.

Architectures and parameters are the same as in Experiment 2a. A successful run is one that

fullls the following criterion: after training, during 10,000 successive, randomly chosen input

13

Method Delay

p

Learning rate # weights % Successful trials Success after

RTRL 4 1.0 36 78 1,043,000

RTRL 4 4.0 36 56 892,000

RTRL 4 10.0 36 22 254,000

RTRL 10 1.0-10.0 144 0

>

5,000,000

RTRL 100 1.0-10.0 10404 0

>

5,000,000

BPTT 100 1.0-10.0 10404 0

>

5,000,000

CH 100 1.0 10506 33 32,400

LSTM 100 1.0 10504 100 5,040

Table 2:

Task 2a: Percentage of successful trials and number of training sequences until success,

for \Real-Time Recurrent Learning" (RTRL), \Back-Propagation Through Time" (BPTT), neural

sequence chunking (CH), and the new method (LSTM). Table entries refer to means of 18 trials.

With 100 time step delays, only CH and LSTM achieve successful trials. Even when we ignore the

unsuccessful trials of the other approaches, LSTM learns much faster.

sequences, the maximal absolute error of all output units is below 0

:

25 at sequence end.

Results.

As expected, the c