Content uploaded by Zhen He
Author content
All content in this area was uploaded by Zhen He on Dec 13, 2017
Content may be subject to copyright.
Wider and Deeper, Cheaper and Faster:
Tensorized LSTMs for Sequence Learning
Zhen He1,2,Shaobing Gao3,Liang Xiao2,Daxue Liu2,Hangen He2, and David Barber ∗1,4
1University College London, 2National University of Defense Technology, 3Sichuan University,
4Alan Turing Institute
Abstract
Long ShortTerm Memory (LSTM) is a popular approach to boosting the ability
of Recurrent Neural Networks to store longer term temporal information. The
capacity of an LSTM network can be increased by widening and adding layers.
However, usually the former introduces additional parameters, while the latter
increases the runtime. As an alternative we propose the Tensorized LSTM in
which the hidden states are represented by tensors and updated via a crosslayer
convolution. By increasing the tensor size, the network can be widened efﬁciently
without additional parameters since the parameters are shared across different
locations in the tensor; by delaying the output, the network can be deepened
implicitly with little additional runtime since deep computations for each timestep
are merged into temporal computations of the sequence. Experiments conducted on
ﬁve challenging sequence learning tasks show the potential of
the proposed model.
1 Introduction
We consider the timeseries prediction task of producing a desired output
yt
at each timestep
t∈{1, . . . , T }
given an observed input sequence
x1:t={x1,x2,· · · ,xt}
, where
xt∈RR
and
yt∈RS
are vectors
1
. The Recurrent Neural Network (RNN) [
17
,
43
] is a powerful model that learns
how to use a hidden state vector
ht∈RM
to encapsulate the relevant features of the entire input
history
x1:t
up to timestep
t
. Let
hcat
t−1∈RR+M
be the concatenation of the current input
xt
and the
previous hidden state ht−1:
hcat
t−1= [xt,ht−1](1)
The update of the hidden state htis deﬁned as:
at=hcat
t−1Wh+bh(2)
ht=φ(at)(3)
where
Wh∈R(R+M)×M
is the weight,
bh∈RM
the bias,
at∈RM
the hidden activation, and
φ(·)
the elementwise tanh function. Finally, the output ytat timestep tis generated by:
yt=ϕ(htWy+by)(4)
where
Wy∈RM×S
and
by∈RS
, and
ϕ(·)
can be any differentiable function, depending on the task.
However, this vanilla RNN has difﬁculties in modeling longrange dependencies due to the van
ishing/exploding gradient problem [
4
]. Long ShortTerm Memories (LSTMs) [
19
,
24
] alleviate
∗
Corresponding authors: Shaobing Gao <gaoshaobing@scu.edu.cn> and Zhen He <hezhen.cs@gmail.com>.
1Vectors are assumed to be in row form throughout this paper.
31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
these problems by employing memory cells to preserve information for longer, and adopting gating
mechanisms to modulate the information ﬂow. Given the success of the LSTM in sequence modeling,
it is natural to consider how to increase the complexity of the model and thereby increase the set of
tasks for which the LSTM can be proﬁtably applied.
We consider the capacity of a network to consist of two components: the width (the amount of
information handled in parallel) and the depth (the number of computation steps) [
5
]. A naive way
to widen the LSTM is to increase the number of units in a hidden layer; however, the parameter
number scales quadratically with the number of units. To deepen the LSTM, the popular Stacked
LSTM (sLSTM) stacks multiple LSTM layers [20]; however, runtime is proportional to the number
of layers and information from the input is potentially lost (due to gradient vanishing/explosion) as it
propagates vertically through the layers.
In this paper, we introduce a way to both widen and deepen the LSTM whilst keeping the parameter
number and runtime largely unchanged. In summary, we make the following contributions:
(a)
We tensorize RNN hidden state vectors into higherdimensional tensors which allow more ﬂexible
parameter sharing and can be widened more efﬁciently without additional parameters.
(b)
Based on (a), we merge RNN deep computations into its temporal computations so that the
network can be deepened with little additional runtime, resulting in a Tensorized RNN (tRNN).
(c)
We extend the tRNN to an LSTM, namely the Tensorized LSTM (tLSTM), which integrates a
novel memory cell convolution to help to prevent the vanishing/exploding gradients.
2 Method
2.1 Tensorizing Hidden States
It can be seen from (2) that in an RNN, the parameter number scales quadratically with the size of the
hidden state. A popular way to limit the parameter number when widening the network is to organize
parameters as higherdimensional tensors which can be factorized into lowerrank subtensors that
contain signiﬁcantly fewer elements [
6
,
15
,
18
,
26
,
32
,
39
,
46
,
47
,
51
], which is is known as tensor
factorization. This implicitly widens the network since the hidden state vectors are in fact broadcast to
interact with the tensorized parameters. Another common way to reduce the parameter number is to
share a small set of parameters across different locations in the hidden state, similar to Convolutional
Neural Networks (CNNs) [34, 35].
We adopt parameter sharing to cutdown the parameter number for RNNs, since compared with
factorization, it has the following advantages: (i) scalability, i.e., the number of shared parameters
can be set independent of the hidden state size, and (ii) separability, i.e., the information ﬂow can be
carefully managed by controlling the receptive ﬁeld, allowing one to shift RNN deep computations to
the temporal domain (see Sec. 2.2). We also explicitly tensorize the RNN hidden state vectors, since
compared with vectors, tensors have a better: (i) ﬂexibility, i.e., one can specify which dimensions
to share parameters and then can just increase the size of those dimensions without introducing
additional parameters, and (ii) efﬁciency, i.e., with higherdimensional tensors, the network can be
widened faster w.r.t. its depth when ﬁxing the parameter number (see Sec. 2.3).
For ease of exposition, we ﬁrst consider 2D tensors (matrices): we tensorize the hidden state
ht∈RM
to become
Ht∈RP×M
, where
P
is the tensor size, and
M
the channel size. We locallyconnect the
ﬁrst dimension of
Ht
in order to share parameters, and fullyconnect the second dimension of
Ht
to
allow global interactions. This is analogous to the CNN which fullyconnects one dimension (e.g.,
the RGB channel for input images) to globally fuse different feature planes. Also, if one compares
Ht
to the hidden state of a Stacked RNN (sRNN) (see Fig. 1(a)), then
P
is akin to the number of
stacked hidden layers, and
M
the size of each hidden layer. We start to describe our model based on
2D tensors, and ﬁnally show how to strengthen the model with higherdimensional tensors.
2.2 Merging Deep Computations
Since an RNN is already deep in its temporal direction, we can deepen an inputtooutput computation
by associating the input
xt
with a (delayed) future output. In doing this, we need to ensure that the
output
yt
is separable, i.e., not inﬂuenced by any future input
xt0
(
t0> t
). Thus, we concatenate
the projection of
xt
to the top of the previous hidden state
Ht−1
, then gradually shift the input
2
Figure 1: Examples of sRNN, tRNNs and tLSTMs. (a) A 3layer sRNN. (b) A 2D tRNN without (–)
feedback (F) connections, which can be thought as a skewed version of (a). (c) A 2D tRNN. (d) A 2D
tLSTM without (–) memory (M) cell convolutions. (e) A 2D tLSTM. In each model, the blank circles
in column 1 to 4 denote the hidden state at timestep
t−1
to
t+2
, respectively, and the blue region
denotes the receptive ﬁeld of the current output
yt
. In (b)(e), the outputs are delayed by
L−1=2
timesteps, where L=3 is the depth.
information down when the temporal computation proceeds, and ﬁnally generate
yt
from the bottom
of
Ht+L−1
, where
L−1
is the number of delayed timesteps for computations of depth
L
. An example
with
L=3
is shown in Fig. 1(b). This is in fact a skewed sRNN as used in [
1
] (also similar to [
48
]).
However, our method does not need to change the network structure and also allows different kinds
of interactions as long as the output is separable, e.g, one can increase the local connections and use
feedback (see Fig. 1(c)), which can be beneﬁcial for sRNNs [
10
]. In order to share parameters, we
update
Ht
using a convolution with a learnable kernel. In this manner we increase the complexity of
the inputtooutput mapping (by delaying outputs) and limit parameter growth (by sharing transition
parameters using convolutions).
To describe the resulting tRNN model, let
Hcat
t−1∈R(P+1)×M
be the concatenated hidden state, and
p∈Z+
the location at a tensor. The channel vector
hcat
t−1,p ∈RM
at location
p
of
Hcat
t−1
is deﬁned as:
hcat
t−1,p =xtWx+bxif p= 1
ht−1,p−1if p > 1(5)
where
Wx∈RR×M
and
bx∈RM
. Then, the update of tensor
Ht
is implemented via a convolution:
At=Hcat
t−1~{Wh,bh}(6)
Ht=φ(At)(7)
where
Wh∈RK×Mi
×Mo
is the kernel weight of size
K
, with
Mi=M
input channels and
Mo=M
output channels,
bh∈RMo
is the kernel bias,
At∈RP×Mo
is the hidden activation, and
~
is the
convolution operator (see Appendix A.1 for a more detailed deﬁnition). Since the kernel convolves
across different hidden layers, we call it the crosslayer convolution. The kernel enables interaction,
both bottomup and topdown across layers. Finally, we generate
yt
from the channel vector
ht+L−1,P ∈RMwhich is located at the bottom of Ht+L−1:
yt=ϕ(ht+L−1,P Wy+by)(8)
where
Wy∈RM×S
and
by∈RS
. To guarantee that the receptive ﬁeld of
yt
only covers the current
and previous inputs x1:t(see Fig. 1(c)), L,P, and Kshould satisfy the constraint:
L=l2P
K−Kmod 2 m(9)
where d·e is the ceil operation. For the derivation of (9), please see Appendix B.
We call the model deﬁned in (5)(8) the Tensorized RNN (tRNN). The model can be widened by
increasing the tensor size
P
, whilst the parameter number remains ﬁxed (thanks to the convolution).
Also, unlike the sRNN of runtime complexity
O(T L)
,tRNN breaks down the runtime complexity to
O(T+L)
, which means either increasing the sequence length
T
or the network depth
L
would not
signiﬁcantly increase the runtime.
3
2.3 Extending to LSTMs
To allow the tRNN to capture longrange temporal dependencies, one can straightforwardly extend it
to an LSTM by replacing the tRNN tensor update equations of (6)(7) as follows:
[Ag
t,Ai
t,Af
t,Ao
t] = Hcat
t−1~{Wh,bh}(10)
[Gt,It,Ft,Ot] = [φ(Ag
t), σ(Ai
t), σ(Af
t), σ(Ao
t)] (11)
Ct=GtIt+Ct−1Ft(12)
Ht=φ(Ct)Ot(13)
where the kernel
{Wh,bh}
is of size
K
, with
Mi=M
input channels and
Mo=4M
output channels,
Ag
t,Ai
t,Af
t,Ao
t∈RP×M
are activations for the new content
Gt
, input gate
It
, forget gate
Ft
, and
output gate
Ot
, respectively,
σ(·)
is the elementwise sigmoid function, and
Ct∈RP×M
is the
memory cell. However, since in (12) the previous memory cell
Ct−1
is only gated along the temporal
direction (see Fig. 1(d)), longrange dependencies from the input to output might be lost when the
tensor size Pbecomes large.
Memory Cell Convolution.
To capture longrange dependencies from multiple directions, we
additionally introduce a novel memory cell convolution, by which the memory cells can have a larger
receptive ﬁeld (see Fig. 1(e)). We also dynamically generate this convolution kernel so that it is
both time and locationdependent, allowing for ﬂexible control over longrange dependencies from
different directions. This results in our tLSTM tensor update equations:
[Ag
t,Ai
t,Af
t,Ao
t,Aq
t] = Hcat
t−1~{Wh,bh}(14)
[Gt,It,Ft,Ot,Qt]=[φ(Ag
t), σ(Ai
t), σ(Af
t), σ(Ao
t), ς(Aq
t)] (15)
Wc
t(p) = reshape (qt,p,[K, 1,1]) (16)
Cconv
t−1=Ct−1~Wc
t(p)(17)
Ct=GtIt+Cconv
t−1Ft(18)
Ht=φ(Ct)Ot(19)
Figure 2: Illustration of gener
ating the memory cell convolu
tion kernel, where (a) is for 2D
tensors and (b) for 3D tensors.
where, in contrast to (10)(13), the kernel
{Wh,bh}
has additional
hKi
output channels
2
to generate the activation
Aq
t∈RP×hKi
for
the dynamic kernel bank
Qt∈RP×hKi
,
qt,p ∈RhKi
is the vectorized
adaptive kernel at the location
p
of
Qt
, and
Wc
t(p)∈RK×1×1
is
the dynamic kernel of size
K
with a single input/output channel,
which is reshaped from
qt,p
(see Fig. 2(a) for an illustration). In
(17), each channel of the previous memory cell
Ct−1
is convolved
with
Wc
t(p)
whose values vary with
p
, forming a memory cell
convolution (see Appendix A.2 for a more detailed deﬁnition),
which produces a convolved memory cell
Cconv
t−1∈RP×M
. Note
that in (15) we employ a softmax function
ς(·)
to normalize the
channel dimension of
Qt
, which, similar to [
37
], can stabilize the
value of memory cells and help to prevent the vanishing/exploding
gradients (see Appendix C for details).
The idea of dynamically generating network weights has been used
in many works [
6
,
14
,
15
,
23
,
44
,
46
], where in [
14
] location
dependent convolutional kernels are also dynamically generated to improve CNNs. In contrast to
these works, we focus on broadening the receptive ﬁeld of tLSTM memory cells. Whilst the ﬂexibility
is retained, fewer parameters are required to generate the kernel since the kernel is shared by different
memory cell channels.
Channel Normalization.
To improve training, we adapt Layer Normalization (LN) [
3
] to our
tLSTM. Similar to the observation in [
3
] that LN does not work well in CNNs where channel vectors
at different locations have very different statistics, we ﬁnd that LN is also unsuitable for tLSTM
where lower level information is near the input while higher level information is near the output. We
2The operator h·i returns the cumulative product of all elements in the input variable.
4
therefore normalize the channel vectors at different locations with their own statistics, forming a
Channel Normalization (CN), with its operator CN (·):
CN (Z;Γ,B) = b
ZΓ+B(20)
where
Z,b
Z,Γ,B∈RP×Mz
are the original tensor, normalized tensor, gain parameter, and bias
parameter, respectively. The mzth channel of Z, i.e. zmz∈RP, is normalized elementwisely:
b
zmz= (zmz−zµ)/zσ(21)
where
zµ,zσ∈RP
are the mean and standard deviation along the channel dimension of
Z
, respec
tively, and
b
zmz∈RP
is the
mz
th channel of
b
Z
. Note that the number of parameters introduced by
CN/LN can be neglected as it is very small compared to the number of other parameters in the model.
Using HigherDimensional Tensors.
One can observe from (9) that when ﬁxing the kernel size
K
, the tensor size
P
of a 2D tLSTM grows linearly w.r.t. its depth
L
. How can we expand the tensor
volume more rapidly so that the network can be widened more efﬁciently? We can achieve this goal
by leveraging higherdimensional tensors. Based on previous deﬁnitions for 2D tLSTMs, we replace
the 2D tensors with
D
dimensional (
D > 2
) tensors, obtaining
Ht,Ct∈RP1
×P2
×...×PD−1
×M
with the
tensor size
P=[P1, P2, . . . , PD−1]
. Since the hidden states are no longer matrices, we concatenate
the projection of xtto one corner of Ht−1, and thus (5) is extended as:
hcat
t−1,p=
xtWx+bxif pd= 1 for d= 1,2, . . . , D −1
ht−1,p−1if pd>1for d= 1,2, . . . , D −1
0otherwise
(22)
where
hcat
t−1,p∈RM
is the channel vector at location
p∈ZD−1
+
of the concatenated hidden state
Hcat
t−1∈R(P1+1)×(P2+1)×...×(PD−1+1)×M
. For the tensor update, the convolution kernel
Wh
and
Wc
t(·)
also increase their dimensionality with kernel size
K= [K1, K2, . . . , KD−1]
. Note that
Wc
t(·)
is
reshaped from the vector, as illustrated in Fig. 2(b). Correspondingly, we generate the output
yt
from
the opposite corner of Ht+L−1, and therefore (8) is modiﬁed as:
yt=ϕ(ht+L−1,PWy+by)(23)
For convenience, we set
Pd=P
and
Kd=K
for
d= 1,2, . . . , D −1
so that all dimensions of P
and Kcan satisfy (9) with the same depth
L
. In addition, CN still normalizes the channel dimension
of tensors.
3 Experiments
We evaluate tLSTM on ﬁve challenging sequence learning tasks under different conﬁgurations:
(a) sLSTM (baseline)
: our implementation of sLSTM [
21
] with parameters shared across all layers.
(b) 2D tLSTM: the standard 2D tLSTM, as deﬁned in (14)(19).
(c) 2D tLSTM–M: removing (–) memory (M) cell convolutions from (b), as deﬁned in (10)(13).
(d) 2D tLSTM–F: removing (–) feedback (F) connections from (b).
(e) 3D tLSTM: tensorizing (b) into 3D tLSTM.
(f) 3D tLSTM+LN: applying (+) LN [3] to (e).
(g) 3D tLSTM+CN: applying (+) CN to (e), as deﬁned in (20).
To compare different conﬁgurations, we also use
L
to denote the number of layers of a sLSTM, and
M
to denote the hidden size of each sLSTM layer. We set the kernel size
K
to 2 for 2D tLSTM–F
and 3 for other tLSTMs, in which case we have L=Paccording to (9).
For each conﬁguration, we ﬁx the parameter number and increase the tensor size to see if the
performance of tLSTM can be boosted without increasing the parameter number. We also investigate
how the runtime is affected by the depth, where the runtime is measured by the average GPU
milliseconds spent by a forward and backward pass over one timestep of a single example. Next, we
compare tLSTM against the stateoftheart methods to evaluate its ability. Finally, we visualize the
internal working mechanism of tLSTM. Please see Appendix D for training details.
5
3.1 Wikipedia Language Modeling
Figure 3: Performance and run
time of different conﬁgurations
on Wikipedia.
The Hutter Prize Wikipedia dataset [
25
] consists of 100 million
characters taken from 205 different characters including alpha
bets, XML markups and special symbols. We model the dataset
at the characterlevel, and try to predict the next character of the
input sequence.
We ﬁx the parameter number to 10M, corresponding to channel
sizes
M
of 1120 for sLSTM and 2D tLSTM–F, 901 for other
2D tLSTMs, and 522 for 3D tLSTMs. All conﬁgurations are
evaluated with depths
L= 1,2,3,4
. We use Bitspercharacter
(BPC) to measure the model performance.
Results are shown in Fig. 3. When
L≤2
,sLSTM and 2D
tLSTM–F outperform other models because of a larger
M
. With
L
increasing, the performances of sLSTM and 2D tLSTM–M
improve but become saturated when
L≥3
, while tLSTMs with
memory cell convolutions improve with increasing
L
and ﬁnally
outperform both sLSTM and 2D tLSTM–M. When
L= 4
, 2D
tLSTM–F is surpassed by 2D tLSTM, which is in turn surpassed
by 3D tLSTM. The performance of 3D tLSTM+LN beneﬁts from
LN only when
L≤2
. However, 3D tLSTM+CN consistently
improves 3D tLSTM with different L.
Table 1: Test BPC on Wikipedia.
BPC # Param.
MILSTM [51] 1.44 ≈17M
mLSTM [32] 1.42 ≈20M
HyperLSTM+LN [23] 1.34 26.5M
HMLSTM+LN [11] 1.32 ≈35M
Large RHN [54] 1.27 ≈46M
Large FSLSTM4 [38] 1.245 ≈47M
2×Large FSLSTM4 [38] 1.198 ≈94M
3D tLSTM+CN (L= 6,M= 1200) 1.264 50.1M
Whilst the runtime of sLSTM is al
most proportional to
L
, it is nearly
constant in each tLSTM conﬁguration
and largely independent of L.
We compare a larger model, i.e. a
3D tLSTM+CN with
L=6
and
M=
1200
, to the stateoftheart methods
on the test set, as reported in Table 1.
Our model achieves 1.264 BPC with
50.1M parameters, and is competitive
to the best performing methods [
38
,
54] with similar parameter numbers.
3.2 Algorithmic Tasks
Figure 4: Performance and runtime of different conﬁgurations
on the addition (left) and memorization (right) tasks.
(a)
Addition
: The task is to sum
two 15digit integers. The network
ﬁrst reads two integers with one
digit per timestep, and then predicts
the summation. We follow the pro
cessing of [
30
], where a symbol
‘

’ is used to delimit the integers
as well as pad the input/target se
quence. A 3digit integer addition
task is of the form:
Input: 123900
Target:1023
(b)
Memorization
: The goal of this
task is to memorize a sequence of
20 random symbols. Similar to the
addition task, we use 65 different
6
symbols. A 5symbol memorization task is of the form:
Input: abccb
Target: abccb
We evaluate all conﬁgurations with
L= 1,4,7,10
on both tasks, where
M
is 400 for addition and
100 for memorization. The performance is measured by the symbol prediction accuracy.
Fig. 4 show the results. In both tasks, large
L
degrades the performances of sLSTM and 2D tLSTM–
M. In contrast, the performance of 2D tLSTM–F steadily improves with
L
increasing, and is further
enhanced by using feedback connections, higherdimensional tensors, and CN, while LN helps only
when
L=1
. Note that in both tasks, the correct solution can be found (when
100%
test accuracy is
achieved) due to the repetitive nature of the task. In our experiment, we also observe that for the
addition task, 3D tLSTM+CN with
L= 7
outperforms other conﬁgurations and ﬁnds the solution
with only 298K training samples, while for the memorization task, 3D tLSTM+CN with
L=10
beats
others conﬁgurations and achieves perfect memorization after seeing 54K training samples. Also,
unlike in sLSTM, the runtime of all tLSTMs is largely unaffected by L.
Table 2: Test accuracies on two algorithmic tasks.
Addition Memorization
Acc. # Samp. Acc. # Samp.
Stacked LSTM [21]
51%
5M >
50%
900K
Grid LSTM [30] >
99%
550K >
99%
150K
3D tLSTM+CN (L= 7)>
99%
298K >
99%
115K
3D tLSTM+CN (L= 10)>
99%
317K >
99%
54K
We further compare the best
performing conﬁgurations to
the stateoftheart methods
for both tasks (see Table 2).
Our models solve both tasks
signiﬁcantly faster (i.e., using
fewer training samples) than
other models, achieving the
new stateoftheart results.
3.3 MNIST Image Classiﬁcation
Figure 5: Performance and runtime of different conﬁgurations
on sequential MNIST (left) and sequential pMNIST (right).
The MNIST dataset [
35
] consists
of 50000/10000/10000 handwritten
digit images of size
28×28
for train
ing/validation/test. We have two
tasks on this dataset:
(a)
Sequential MNIST
: The goal
is to classify the digit after sequen
tially reading the pixels in a scan
line order [
33
]. It is therefore a
784 timestep sequence learning task
where a single output is produced at
the last timestep; the task requires
very long range dependencies in the
sequence.
(b)
Sequential Permuted MNIST
:
We permute the original image pix
els in a ﬁxed random order as in
[
2
], resulting in a permuted MNIST
(pMNIST) problem that has even longer range dependencies across pixels and is harder.
In both tasks, all conﬁgurations are evaluated with
M=100
and
L=1,3,5
. The model performance
is measured by the classiﬁcation accuracy.
Results are shown in Fig. 5. sLSTM and 2D tLSTM–M no longer beneﬁt from the increased depth
when
L= 5
. Both increasing the depth and tensorization boost the performance of 2D tLSTM.
However, removing feedback connections from 2D tLSTM seems not to affect the performance. On
the other hand, CN enhances the 3D tLSTM and when
L≥3
it outperforms LN. 3D tLSTM+CN
with L=5 achieves the highest performances in both tasks, with a validation accuracy of 99.1% for
MNIST and 95.6% for pMNIST. The runtime of tLSTMs is negligibly affected by
L
, and all tLSTMs
become faster than sLSTM when L=5.
7
Figure 6: Visualization of the diagonal channel means of the tLSTM memory cells for each task. In
each horizontal bar, the rows from top to bottom correspond to the diagonal locations from
pin
to
pout
, the columns from left to right correspond to different timesteps (from
1
to
T+L−1
for the full
sequence, where
L−1
is the time delay), and the values are normalized to be in range
[0,1]
for better
visualization. Both full sequences in (d) and (e) are zoomed out horizontally.
Table 3: Test accuracies (%) on sequential MNIST/pMNIST.
MNIST pMNIST
iRNN [33] 97.0 82.0
LSTM [2] 98.2 88.0
uRNN [2] 95.1 91.4
Fullcapacity uRNN [49] 96.9 94.1
sTANH [53] 98.1 94.0
BNLSTM [13] 99.0 95.4
Dilated GRU [8] 99.2 94.6
Dilated CNN [40] in [8] 98.3 96.7
3D tLSTM+CN (L= 3)99.2 94.9
3D tLSTM+CN (L= 5) 99.0 95.7
We also compare the conﬁgura
tions of the highest test accuracies
to the stateoftheart methods (see
Table 3
). For sequential MNIST, our
3D tLSTM+CN with
L=3
performs
as well as the stateoftheart Dilated
GRU model [
8
], with a test accu
racy of 99.2%. For the sequential
pMNIST, our 3D tLSTM+CN with
L= 5
has a test accuracy of 95.7%,
which is close to the stateoftheart
of 96.7% produced by the Dilated
CNN [40] in [8].
3.4 Analysis
The experimental results of different model conﬁgurations on different tasks suggest that the perfor
mance of tLSTMs can be improved by increasing the tensor size and network depth, requiring no
additional parameters and little additional runtime. As the network gets wider and deeper, we found
that the memory cell convolution mechanism is crucial to maintain improvement in performance.
Also, we found that feedback connections are useful for tasks of sequential output (e.g., our Wikipedia
and algorithmic tasks). Moreover, tLSTM can be further strengthened via tensorization or CN.
It is also intriguing to examine the internal working mechanism of tLSTM. Thus, we visualize the
memory cell which gives insight into how information is routed. For each task, the best performing
tLSTM is run on a random example. We record the channel mean (the mean over channels, e.g., it is
of size
P×P
for 3D tLSTMs) of the memory cell at each timestep, and visualize the diagonal values
of the channel mean from location pin =[1,1] (near the input) to pout =[P, P ](near the output).
Visualization results in Fig. 6 reveal the distinct behaviors of tLSTM when dealing with different tasks:
(i) Wikipedia: the input can be carried to the output location with less modiﬁcation if it is sufﬁcient
to determine the next character, and vice versa; (ii) addition: the ﬁrst integer is gradually encoded
into memories and then interacts (performs addition) with the second integer, producing the sum; (iii)
memorization: the network behaves like a shift register that continues to move the input symbol to the
output location at the correct timestep; (iv) sequential MNIST: the network is more sensitive to the
pixel value change (representing the contour, or topology of the digit) and can gradually accumulate
evidence for the ﬁnal prediction; (v) sequential pMNIST: the network is sensitive to high value pixels
(representing the foreground digit), and we conjecture that this is because the permutation destroys
the topology of the digit, making each high value pixel potentially important.
From Fig. 6 we can also observe common phenomena for all tasks: (i) at each timestep, the values
at different tensor locations are markedly different, implying that wider (larger) tensors can encode
more information, with less effort to compress it; (ii) from the input to the output, the values become
increasingly distinct and are shifted by time, revealing that deep computations are indeed performed
together with temporal computations, with longrange dependencies carried by memory cells.
8
Figure 7: Examples of models related to tLSTMs. (a) A single layer cLSTM [
48
] with vector array
input. (b) A 3layer sLSTM [
21
]. (c) A 3layer Grid LSTM [
30
]. (d) A 3layer RHN [
54
]. (e) A
3layer QRNN [7] with kernel size 2, where costly computations are done by temporal convolution.
4 Related Work
Convolutional LSTMs.
Convolutional LSTMs (cLSTMs) are proposed to parallelize the compu
tation of LSTMs when the input at each timestep is structured (see Fig. 7(a)), e.g., a vector array
[
48
], a vector matrix [
41
,
42
,
50
,
52
], or a vector tensor [
9
,
45
]. Unlike cLSTMs, tLSTM aims to
increase the capacity of LSTMs when the input at each timestep is nonstructured, i.e., a single vector,
and is advantageous over cLSTMs in that: (i) it performs the convolution across different hidden
layers whose structure is independent of the input structure, and integrates information bottomup
and topdown; while cLSTM performs the convolution within each hidden layer whose structure is
coupled with the input structure, thus will fall back to the vanilla LSTM if the input at each timestep
is a single vector; (ii) it can be widened efﬁciently without additional parameters by increasing the
tensor size; while cLSTM can be widened by increasing the kernel size or kernel channel, which
signiﬁcantly increases the number of parameters; (iii) it can be deepened with little additional run
time by delaying the output; while cLSTM can be deepened by using more hidden layers, which
signiﬁcantly increases the runtime; (iv) it captures longrange dependencies from multiple directions
through the memory cell convolution; while cLSTM struggles to capture longrange dependencies
from multiple directions since memory cells are only gated along one direction.
Deep LSTMs.
Deep LSTMs (dLSTMs) extend sLSTMs by making them deeper (see Fig. 7(b)(d)).
To keep the parameter number small and ease training, Graves
[22]
, Kalchbrenner et al.
[30]
, Mujika
et al. [38], Zilly et al. [54] apply another RNN/LSTM along the depth direction of dLSTMs, which,
however, multiplies the runtime. Though there are implementations to accelerate the deep computation
[
1
,
16
], they generally aim at simple architectures such sLSTMs. Compared with dLSTMs, tLSTM
performs the deep computation with little additional runtime, and employs a crosslayer convolution to
enable the feedback mechanism. Moreover, the capacity of tLSTM can be increased more efﬁciently
by using higherdimensional tensors, whereas in dLSTM all hidden layers as a whole only equal to a
2D tensor (i.e., a stack of hidden vectors), the dimensionality of which is ﬁxed.
Other Parallelization Methods.
Some methods [
7
,
8
,
28
,
29
,
36
,
40
] parallelize the temporal
computation of the sequence (e.g., use the temporal convolution, as in Fig. 7(e)) during training, in
which case full input/target sequences are accessible. However, during the online inference when the
input presents sequentially, temporal computations can no longer be parallelized and will be blocked
by deep computations of each timestep, making these methods potentially unsuitable for realtime
applications that demand a high sampling/output frequency. Unlike these methods, tLSTM can speed
up not only training but also online inference for many tasks since it performs the deep computation
by the temporal computation, which is also humanlike: we convert each signal to an action and
meanwhile receive new signals in a nonblocking way. Note that for the online inference of tasks
that use the previous output
yt−1
for the current input
xt
(e.g., autoregressive sequence generation),
tLSTM cannot parallel the deep computation since it requires to delay L−1timesteps to get yt−1.
5 Conclusion
We introduced the Tensorized LSTM, which employs tensors to share parameters and utilizes the
temporal computation to perform the deep computation for sequential tasks. We validated our model
on a variety of tasks, showing its potential over other popular approaches.
9
Acknowledgements
This work is supported by the NSFC grant 91220301, the Alan Turing Institute under the EPSRC
grant EP/N510129/1, and the China Scholarship Council.
References
[1]
Jeremy Appleyard, Tomas Kocisky, and Phil Blunsom. Optimizing performance of recurrent neural networks
on gpus. arXiv preprint arXiv:1604.01946, 2016. 3, 9
[2]
Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. In ICML,
2016. 7, 8
[3]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint
arXiv:1607.06450, 2016. 4, 5
[4]
Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning longterm dependencies with gradient descent
is difﬁcult. IEEE TNN, 5(2):157–166, 1994. 1
[5]
Yoshua Bengio. Learning deep architectures for ai. Foundations and trends
R
in Machine Learning, 2009. 2
[6]
Luca Bertinetto, João F Henriques, Jack Valmadre, Philip Torr, and Andrea Vedaldi. Learning feedforward
oneshot learners. In NIPS, 2016. 2, 4
[7]
James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. Quasirecurrent neural networks. In
ICLR, 2017. 9
[8]
Shiyu Chang, Yang Zhang, Wei Han, Mo Yu, Xiaoxiao Guo, Wei Tan, Xiaodong Cui, Michael Witbrock,
Mark HasegawaJohnson, and Thomas Huang. Dilated recurrent neural networks. In NIPS, 2017. 8, 9
[9]
Jianxu Chen, Lin Yang, Yizhe Zhang, Mark Alber, and Danny Z Chen. Combining fully convolutional and
recurrent neural networks for 3d biomedical image segmentation. In NIPS, 2016. 9
[10]
Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Gated feedback recurrent neural
networks. In ICML, 2015. 3, 13
[11]
Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks. In
ICLR, 2017. 6
[12]
Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. Torch7: A matlablike environment for
machine learning. In NIPS Workshop, 2011. 13
[13]
Tim Cooijmans, Nicolas Ballas, César Laurent, and Aaron Courville. Recurrent batch normalization. In
ICLR, 2017. 8
[14]
Bert De Brabandere, Xu Jia, Tinne Tuytelaars, and Luc Van Gool. Dynamic ﬁlter networks. In NIPS, 2016.
4
[15]
Misha Denil, Babak Shakibi, Laurent Dinh, Nando de Freitas, et al. Predicting parameters in deep learning.
In NIPS, 2013. 2, 4
[16]
Greg Diamos, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski, Adam Coates, Erich Elsen, Jesse
Engel, Awni Hannun, and Sanjeev Satheesh. Persistent rnns: Stashing recurrent weights onchip. In ICML,
2016. 9
[17] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990. 1
[18]
Timur Garipov, Dmitry Podoprikhin, Alexander Novikov, and Dmitry Vetrov. Ultimate tensorization:
compressing convolutional and fc layers alike. In NIPS Workshop, 2016. 2
[19]
Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with lstm.
Neural computation, 12(10):2451–2471, 2000. 1
[20]
Alex Graves, Abdelrahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent
neural networks. In ICASSP, 2013. 2
[21]
Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
5, 7, 9
[22]
Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983,
2016. 9
[23] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. In ICLR, 2017. 4, 6
[24]
Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780,
1997. 1
[25] Marcus Hutter. The human knowledge compression contest. URL http://prize.hutter1.net, 2012. 6
[26]
Ozan Irsoy and Claire Cardie. Modeling compositionality with multiplicative recurrent neural networks.
In ICLR, 2015. 2
10
[27]
Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An empirical exploration of recurrent network
architectures. In ICML, 2015. 13
[28] Łukasz Kaiser and Samy Bengio. Can active memory replace attention? In NIPS, 2016. 9
[29] Łukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. In ICLR, 2016. 9
[30]
Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. Grid long shortterm memory. In ICLR, 2016. 6, 7, 9,
13
[31] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. 13
[32]
Ben Krause, Liang Lu, Iain Murray, and Steve Renals. Multiplicative lstm for sequence modelling. In
ICLR Workshop, 2017. 2, 6
[33]
Quoc V Le, Navdeep Jaitly, and Geoffrey E Hinton. A simple way to initialize recurrent networks of
rectiﬁed linear units. arXiv preprint arXiv:1504.00941, 2015. 7, 8
[34]
Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard,
and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation,
1(4):541–551, 1989. 2
[35]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to
document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. 2, 7
[36] Tao Lei and Yu Zhang. Training rnns as fast as cnns. arXiv preprint arXiv:1709.02755, 2017. 9
[37]
Gundram Leifert, Tobias Strauß, Tobias Grüning, Welf Wustlich, and Roger Labahn. Cells in multidimen
sional recurrent neural networks. JMLR, 17(1):3313–3349, 2016. 4, 13
[38]
Asier Mujika, Florian Meier, and Angelika Steger. Fastslow recurrent neural networks. In NIPS, 2017. 6,
9
[39]
Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks.
In NIPS, 2015. 2
[40]
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal
Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv
preprint arXiv:1609.03499, 2016. 8, 9
[41]
Viorica Patraucean, Ankur Handa, and Roberto Cipolla. Spatiotemporal video autoencoder with differen
tiable memory. In ICLR Workshop, 2016. 9
[42]
Bernardino RomeraParedes and Philip Hilaire Sean Torr. Recurrent instance segmentation. In ECCV,
2016. 9
[43]
David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back
propagating errors. Nature, 323(6088):533–536, 1986. 1
[44]
Jürgen Schmidhuber. Learning to control fastweight memories: An alternative to dynamic recurrent
networks. Neural Computation, 4(1):131–139, 1992. 4
[45]
Marijn F Stollenga, Wonmin Byeon, Marcus Liwicki, and Juergen Schmidhuber. Parallel multidimensional
lstm, with application to fast biomedical volumetric image segmentation. In NIPS, 2015. 9
[46]
Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In
ICML, 2011. 2, 4
[47]
Graham W Taylor and Geoffrey E Hinton. Factored conditional restricted boltzmann machines for modeling
motion style. In ICML, 2009. 2
[48]
Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In
ICML, 2016. 3, 9
[49]
Scott Wisdom, Thomas Powers, John Hershey, Jonathan Le Roux, and Les Atlas. Fullcapacity unitary
recurrent neural networks. In NIPS, 2016. 8
[50]
Lin Wu, Chunhua Shen, and Anton van den Hengel. Deep recurrent convolutional networks for videobased
person reidentiﬁcation: An endtoend approach. arXiv preprint arXiv:1606.01609, 2016. 9
[51]
Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, and Ruslan Salakhutdinov. On multiplicative
integration with recurrent neural networks. In NIPS, 2016. 2, 6
[52]
SHI Xingjian, Zhourong Chen, Hao Wang, DitYan Yeung, Waikin Wong, and Wangchun Woo. Convolu
tional lstm network: A machine learning approach for precipitation nowcasting. In NIPS, 2015. 9
[53]
Saizheng Zhang, Yuhuai Wu, Tong Che, Zhouhan Lin, Roland Memisevic, Ruslan R Salakhutdinov, and
Yoshua Bengio. Architectural complexity measures of recurrent neural networks. In NIPS, 2016. 8
[54]
Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutník, and Jürgen Schmidhuber. Recurrent highway
networks. In ICML, 2017. 6, 9
11
A Mathematical Deﬁnition for CrossLayer Convolutions
A.1 Hidden State Convolution
The hidden state convolution in (6) is deﬁned as:
At,p,mo=
K
X
k=1
Mi
X
mi=1
Hcat
t−1,p−K−1
2+k,mi·Wh
k,mi,mo
+bh
mo(24)
where mo∈{1,2,· · · , M o}and zero padding is applied to keep the tensor size.
A.2 Memory Cell Convolution
The memory cell convolution in (17) is deﬁned as:
Cconv
t−1,p,m =
K
X
k=1
Ct−1,p−K−1
2+k,m ·Wc
t,k,1,1(p)(25)
To prevent the stored information from being ﬂushed away,
Ct−1
is padded with the replication of its
boundary values instead of zeros or input projections.
B Derivation for the Constraint of L,P, and K
Figure 8: Illustration of calculating the constraint of
L
,
P
, and
K
. Each column is a concatenated
hidden state tensor with tensor size
P+1=4
and channel size
M
. The volume of the output receptive
ﬁeld (blue region) is determined by the kernel radius
Kr
. The output
yt
for current timestep
t
is
delayed by L−1=2 timesteps.
Here we derive the constraint of
L
,
P
, and
K
that is deﬁned in (9). The kernel center location is
ceiled in case that the kernel size Kis not odd. Then, the kernel radius Krcan be calculated by:
Kr=K−Kmod 2
2(26)
As shown in Fig. 8, to guarantee the receptive ﬁeld of
yt
covers
x1:t
while does not cover
xt+1:T
,
the following constraint should be satisﬁed:
tan ∠AOD 6tan ∠B OD <tan ∠COD (27)
which means:
P
L6Kr
1<P
L−1(28)
Plugging (26) into (28), we get:
L=l2P
K−Kmod 2 m(29)
12
C Memory Cell Convolution Helps to Prevent the Vanishing/Exploding
Gradients
Leifert et al.
[37]
have proved that the lambda gate, which is very similar to our memory cell
convolution kernel, can help to prevent the vanishing/exploding gradients (see Theorem 1718 in
[37]). The differences between our approach and their lambda gate are: (i) we normalize the kernel
values though a softmax function, while they normalize the gate values by dividing them with their
sum, and (ii) we share the kernel for all channels, while they do not. However, as neither modiﬁcations
affects the conditions of validity for Theorem 1718 in [
37
], our memory cell convolution can also
help to prevent the vanishing/exploding gradients.
D Training Details
D.1 Objective Function
The training objective is to minimize the negative loglikelihood (NLL) of the training sequences
w.r.t. the parameter θ(vectorized), i.e.,
min
θ
1
N
N
X
n=1
Tn
X
t=1
−ln p(yd
n,tf(xd
n,1:t;θ)) (30)
where
N
is the number of training sequences,
Tn
the length of the
n
th training sequence, and
p(yd
n,tf(xd
n,1:t;θ))
the likelihood of target
yd
n,t
conditioned on its prediction
yn,t =f(xd
n,1:t;θ)
.
Since all experiment are classiﬁcation problems,
yd
n,t
is represented as the onehot encoding of the
class label, and the output function
ϕ(·)
is deﬁned as a softmax function, which is used to generate
the class distribution yn,t. Then, the likelihood can be calculated by p(yd
n,tyn,t ) = yn,t,syd
n,t,s=1 .
D.2 Common Settings
In all tasks, the NLL (see (30)) is used as the training objective and is minimized by Adam [
31
] with
a learning rate of 0.001. Forget gate biases are set to 4 for image classiﬁcation tasks and 1 [
27
] for
others. All models are implemented by Torch7 [
12
] and accelerated by cuDNN on Tesla K80 GPUs.
We only apply CN to the output of the tLSTM hidden state as we have tried different combinations
and found this is the most robust way that can always improve the performance for all tasks. With
CN, the output of hidden state becomes:
Ht=φ(CN (Ct;Γ,B)) O(31)
D.3 Wikipedia Language Modeling
As in [
10
], we split the dataset into 90M/5M/5M for training/validation/test. In each iteration, we
feed the model with a minibatch of 100 subsequences of length 50. During the forward pass, the
hidden values at the last timestep are preserved to initialize the next iteration. We terminate training
after 50 epochs.
D.4 Algorithmic Tasks
Following [
30
], for both tasks we randomly generate 5M samples for training and 100 samples for
test, and set the minibatch size to 15. Training proceeds for at most 1 epoch
3
and will be terminated
if 100% test accuracy is achieved.
D.5 MNIST Image Classiﬁcation
We set the minibatch size to 50 and use early stopping for training. The training loss is calculated at
the last timestep.
3To simulate the online learning process, we use all training samples only once.
13