Conference PaperPDF Available

Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning

Authors:

Abstract and Figures

Long Short-Term Memory (LSTM) is a popular approach to boosting the ability of Recurrent Neural Networks to store longer term temporal information. The capacity of an LSTM network can be increased by widening and adding layers. However, the former introduces additional parameters, while the latter increases the runtime. As an alternative we propose the Tensorized LSTM in which the hidden states are represented by tensors and updated via a cross-layer convolution. By increasing the tensor size, the network can be widened efficiently without additional parameters since the parameters are shared across different locations in the tensor; by delaying the output, the network can be deepened implicitly with little additional runtime since deep computations for each timestep are merged into temporal computations of the sequence. Experiments conducted on five challenging sequence learning tasks show the potential of the proposed model.
Content may be subject to copyright.
Wider and Deeper, Cheaper and Faster:
Tensorized LSTMs for Sequence Learning
Zhen He1,2,Shaobing Gao3,Liang Xiao2,Daxue Liu2,Hangen He2, and David Barber 1,4
1University College London, 2National University of Defense Technology, 3Sichuan University,
4Alan Turing Institute
Abstract
Long Short-Term Memory (LSTM) is a popular approach to boosting the ability
of Recurrent Neural Networks to store longer term temporal information. The
capacity of an LSTM network can be increased by widening and adding layers.
However, usually the former introduces additional parameters, while the latter
increases the runtime. As an alternative we propose the Tensorized LSTM in
which the hidden states are represented by tensors and updated via a cross-layer
convolution. By increasing the tensor size, the network can be widened efficiently
without additional parameters since the parameters are shared across different
locations in the tensor; by delaying the output, the network can be deepened
implicitly with little additional runtime since deep computations for each timestep
are merged into temporal computations of the sequence. Experiments conducted on
five challenging sequence learning tasks show the potential of
the proposed model.
1 Introduction
We consider the time-series prediction task of producing a desired output
yt
at each timestep
t{1, . . . , T }
given an observed input sequence
x1:t={x1,x2,· · · ,xt}
, where
xtRR
and
ytRS
are vectors
1
. The Recurrent Neural Network (RNN) [
17
,
43
] is a powerful model that learns
how to use a hidden state vector
htRM
to encapsulate the relevant features of the entire input
history
x1:t
up to timestep
t
. Let
hcat
t1RR+M
be the concatenation of the current input
xt
and the
previous hidden state ht1:
hcat
t1= [xt,ht1](1)
The update of the hidden state htis defined as:
at=hcat
t1Wh+bh(2)
ht=φ(at)(3)
where
WhR(R+M)×M
is the weight,
bhRM
the bias,
atRM
the hidden activation, and
φ(·)
the element-wise tanh function. Finally, the output ytat timestep tis generated by:
yt=ϕ(htWy+by)(4)
where
WyRM×S
and
byRS
, and
ϕ(·)
can be any differentiable function, depending on the task.
However, this vanilla RNN has difficulties in modeling long-range dependencies due to the van-
ishing/exploding gradient problem [
4
]. Long Short-Term Memories (LSTMs) [
19
,
24
] alleviate
Corresponding authors: Shaobing Gao <gaoshaobing@scu.edu.cn> and Zhen He <hezhen.cs@gmail.com>.
1Vectors are assumed to be in row form throughout this paper.
31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
these problems by employing memory cells to preserve information for longer, and adopting gating
mechanisms to modulate the information flow. Given the success of the LSTM in sequence modeling,
it is natural to consider how to increase the complexity of the model and thereby increase the set of
tasks for which the LSTM can be profitably applied.
We consider the capacity of a network to consist of two components: the width (the amount of
information handled in parallel) and the depth (the number of computation steps) [
5
]. A naive way
to widen the LSTM is to increase the number of units in a hidden layer; however, the parameter
number scales quadratically with the number of units. To deepen the LSTM, the popular Stacked
LSTM (sLSTM) stacks multiple LSTM layers [20]; however, runtime is proportional to the number
of layers and information from the input is potentially lost (due to gradient vanishing/explosion) as it
propagates vertically through the layers.
In this paper, we introduce a way to both widen and deepen the LSTM whilst keeping the parameter
number and runtime largely unchanged. In summary, we make the following contributions:
(a)
We tensorize RNN hidden state vectors into higher-dimensional tensors which allow more flexible
parameter sharing and can be widened more efficiently without additional parameters.
(b)
Based on (a), we merge RNN deep computations into its temporal computations so that the
network can be deepened with little additional runtime, resulting in a Tensorized RNN (tRNN).
(c)
We extend the tRNN to an LSTM, namely the Tensorized LSTM (tLSTM), which integrates a
novel memory cell convolution to help to prevent the vanishing/exploding gradients.
2 Method
2.1 Tensorizing Hidden States
It can be seen from (2) that in an RNN, the parameter number scales quadratically with the size of the
hidden state. A popular way to limit the parameter number when widening the network is to organize
parameters as higher-dimensional tensors which can be factorized into lower-rank sub-tensors that
contain significantly fewer elements [
6
,
15
,
18
,
26
,
32
,
39
,
46
,
47
,
51
], which is is known as tensor
factorization. This implicitly widens the network since the hidden state vectors are in fact broadcast to
interact with the tensorized parameters. Another common way to reduce the parameter number is to
share a small set of parameters across different locations in the hidden state, similar to Convolutional
Neural Networks (CNNs) [34, 35].
We adopt parameter sharing to cutdown the parameter number for RNNs, since compared with
factorization, it has the following advantages: (i) scalability, i.e., the number of shared parameters
can be set independent of the hidden state size, and (ii) separability, i.e., the information flow can be
carefully managed by controlling the receptive field, allowing one to shift RNN deep computations to
the temporal domain (see Sec. 2.2). We also explicitly tensorize the RNN hidden state vectors, since
compared with vectors, tensors have a better: (i) flexibility, i.e., one can specify which dimensions
to share parameters and then can just increase the size of those dimensions without introducing
additional parameters, and (ii) efficiency, i.e., with higher-dimensional tensors, the network can be
widened faster w.r.t. its depth when fixing the parameter number (see Sec. 2.3).
For ease of exposition, we first consider 2D tensors (matrices): we tensorize the hidden state
htRM
to become
HtRP×M
, where
P
is the tensor size, and
M
the channel size. We locally-connect the
first dimension of
Ht
in order to share parameters, and fully-connect the second dimension of
Ht
to
allow global interactions. This is analogous to the CNN which fully-connects one dimension (e.g.,
the RGB channel for input images) to globally fuse different feature planes. Also, if one compares
Ht
to the hidden state of a Stacked RNN (sRNN) (see Fig. 1(a)), then
P
is akin to the number of
stacked hidden layers, and
M
the size of each hidden layer. We start to describe our model based on
2D tensors, and finally show how to strengthen the model with higher-dimensional tensors.
2.2 Merging Deep Computations
Since an RNN is already deep in its temporal direction, we can deepen an input-to-output computation
by associating the input
xt
with a (delayed) future output. In doing this, we need to ensure that the
output
yt
is separable, i.e., not influenced by any future input
xt0
(
t0> t
). Thus, we concatenate
the projection of
xt
to the top of the previous hidden state
Ht1
, then gradually shift the input
2
Figure 1: Examples of sRNN, tRNNs and tLSTMs. (a) A 3-layer sRNN. (b) A 2D tRNN without (–)
feedback (F) connections, which can be thought as a skewed version of (a). (c) A 2D tRNN. (d) A 2D
tLSTM without (–) memory (M) cell convolutions. (e) A 2D tLSTM. In each model, the blank circles
in column 1 to 4 denote the hidden state at timestep
t1
to
t+2
, respectively, and the blue region
denotes the receptive field of the current output
yt
. In (b)-(e), the outputs are delayed by
L1=2
timesteps, where L=3 is the depth.
information down when the temporal computation proceeds, and finally generate
yt
from the bottom
of
Ht+L1
, where
L1
is the number of delayed timesteps for computations of depth
L
. An example
with
L=3
is shown in Fig. 1(b). This is in fact a skewed sRNN as used in [
1
] (also similar to [
48
]).
However, our method does not need to change the network structure and also allows different kinds
of interactions as long as the output is separable, e.g, one can increase the local connections and use
feedback (see Fig. 1(c)), which can be beneficial for sRNNs [
10
]. In order to share parameters, we
update
Ht
using a convolution with a learnable kernel. In this manner we increase the complexity of
the input-to-output mapping (by delaying outputs) and limit parameter growth (by sharing transition
parameters using convolutions).
To describe the resulting tRNN model, let
Hcat
t1R(P+1)×M
be the concatenated hidden state, and
pZ+
the location at a tensor. The channel vector
hcat
t1,p RM
at location
p
of
Hcat
t1
is defined as:
hcat
t1,p =xtWx+bxif p= 1
ht1,p1if p > 1(5)
where
WxRR×M
and
bxRM
. Then, the update of tensor
Ht
is implemented via a convolution:
At=Hcat
t1~{Wh,bh}(6)
Ht=φ(At)(7)
where
WhRK×Mi
×Mo
is the kernel weight of size
K
, with
Mi=M
input channels and
Mo=M
output channels,
bhRMo
is the kernel bias,
AtRP×Mo
is the hidden activation, and
~
is the
convolution operator (see Appendix A.1 for a more detailed definition). Since the kernel convolves
across different hidden layers, we call it the cross-layer convolution. The kernel enables interaction,
both bottom-up and top-down across layers. Finally, we generate
yt
from the channel vector
ht+L1,P RMwhich is located at the bottom of Ht+L1:
yt=ϕ(ht+L1,P Wy+by)(8)
where
WyRM×S
and
byRS
. To guarantee that the receptive field of
yt
only covers the current
and previous inputs x1:t(see Fig. 1(c)), L,P, and Kshould satisfy the constraint:
L=l2P
KKmod 2 m(9)
where d·e is the ceil operation. For the derivation of (9), please see Appendix B.
We call the model defined in (5)-(8) the Tensorized RNN (tRNN). The model can be widened by
increasing the tensor size
P
, whilst the parameter number remains fixed (thanks to the convolution).
Also, unlike the sRNN of runtime complexity
O(T L)
,tRNN breaks down the runtime complexity to
O(T+L)
, which means either increasing the sequence length
T
or the network depth
L
would not
significantly increase the runtime.
3
2.3 Extending to LSTMs
To allow the tRNN to capture long-range temporal dependencies, one can straightforwardly extend it
to an LSTM by replacing the tRNN tensor update equations of (6)-(7) as follows:
[Ag
t,Ai
t,Af
t,Ao
t] = Hcat
t1~{Wh,bh}(10)
[Gt,It,Ft,Ot] = [φ(Ag
t), σ(Ai
t), σ(Af
t), σ(Ao
t)] (11)
Ct=GtIt+Ct1Ft(12)
Ht=φ(Ct)Ot(13)
where the kernel
{Wh,bh}
is of size
K
, with
Mi=M
input channels and
Mo=4M
output channels,
Ag
t,Ai
t,Af
t,Ao
tRP×M
are activations for the new content
Gt
, input gate
It
, forget gate
Ft
, and
output gate
Ot
, respectively,
σ(·)
is the element-wise sigmoid function, and
CtRP×M
is the
memory cell. However, since in (12) the previous memory cell
Ct1
is only gated along the temporal
direction (see Fig. 1(d)), long-range dependencies from the input to output might be lost when the
tensor size Pbecomes large.
Memory Cell Convolution.
To capture long-range dependencies from multiple directions, we
additionally introduce a novel memory cell convolution, by which the memory cells can have a larger
receptive field (see Fig. 1(e)). We also dynamically generate this convolution kernel so that it is
both time- and location-dependent, allowing for flexible control over long-range dependencies from
different directions. This results in our tLSTM tensor update equations:
[Ag
t,Ai
t,Af
t,Ao
t,Aq
t] = Hcat
t1~{Wh,bh}(14)
[Gt,It,Ft,Ot,Qt]=[φ(Ag
t), σ(Ai
t), σ(Af
t), σ(Ao
t), ς(Aq
t)] (15)
Wc
t(p) = reshape (qt,p,[K, 1,1]) (16)
Cconv
t1=Ct1~Wc
t(p)(17)
Ct=GtIt+Cconv
t1Ft(18)
Ht=φ(Ct)Ot(19)
Figure 2: Illustration of gener-
ating the memory cell convolu-
tion kernel, where (a) is for 2D
tensors and (b) for 3D tensors.
where, in contrast to (10)-(13), the kernel
{Wh,bh}
has additional
hKi
output channels
2
to generate the activation
Aq
tRP×hKi
for
the dynamic kernel bank
QtRP×hKi
,
qt,p RhKi
is the vectorized
adaptive kernel at the location
p
of
Qt
, and
Wc
t(p)RK×1×1
is
the dynamic kernel of size
K
with a single input/output channel,
which is reshaped from
qt,p
(see Fig. 2(a) for an illustration). In
(17), each channel of the previous memory cell
Ct1
is convolved
with
Wc
t(p)
whose values vary with
p
, forming a memory cell
convolution (see Appendix A.2 for a more detailed definition),
which produces a convolved memory cell
Cconv
t1RP×M
. Note
that in (15) we employ a softmax function
ς(·)
to normalize the
channel dimension of
Qt
, which, similar to [
37
], can stabilize the
value of memory cells and help to prevent the vanishing/exploding
gradients (see Appendix C for details).
The idea of dynamically generating network weights has been used
in many works [
6
,
14
,
15
,
23
,
44
,
46
], where in [
14
] location-
dependent convolutional kernels are also dynamically generated to improve CNNs. In contrast to
these works, we focus on broadening the receptive field of tLSTM memory cells. Whilst the flexibility
is retained, fewer parameters are required to generate the kernel since the kernel is shared by different
memory cell channels.
Channel Normalization.
To improve training, we adapt Layer Normalization (LN) [
3
] to our
tLSTM. Similar to the observation in [
3
] that LN does not work well in CNNs where channel vectors
at different locations have very different statistics, we find that LN is also unsuitable for tLSTM
where lower level information is near the input while higher level information is near the output. We
2The operator h·i returns the cumulative product of all elements in the input variable.
4
therefore normalize the channel vectors at different locations with their own statistics, forming a
Channel Normalization (CN), with its operator CN (·):
CN (Z;Γ,B) = b
ZΓ+B(20)
where
Z,b
Z,Γ,BRP×Mz
are the original tensor, normalized tensor, gain parameter, and bias
parameter, respectively. The mz-th channel of Z, i.e. zmzRP, is normalized element-wisely:
b
zmz= (zmzzµ)/zσ(21)
where
zµ,zσRP
are the mean and standard deviation along the channel dimension of
Z
, respec-
tively, and
b
zmzRP
is the
mz
-th channel of
b
Z
. Note that the number of parameters introduced by
CN/LN can be neglected as it is very small compared to the number of other parameters in the model.
Using Higher-Dimensional Tensors.
One can observe from (9) that when fixing the kernel size
K
, the tensor size
P
of a 2D tLSTM grows linearly w.r.t. its depth
L
. How can we expand the tensor
volume more rapidly so that the network can be widened more efficiently? We can achieve this goal
by leveraging higher-dimensional tensors. Based on previous definitions for 2D tLSTMs, we replace
the 2D tensors with
D
-dimensional (
D > 2
) tensors, obtaining
Ht,CtRP1
×P2
×...×PD1
×M
with the
tensor size
P=[P1, P2, . . . , PD1]
. Since the hidden states are no longer matrices, we concatenate
the projection of xtto one corner of Ht1, and thus (5) is extended as:
hcat
t1,p=
xtWx+bxif pd= 1 for d= 1,2, . . . , D 1
ht1,p1if pd>1for d= 1,2, . . . , D 1
0otherwise
(22)
where
hcat
t1,pRM
is the channel vector at location
pZD1
+
of the concatenated hidden state
Hcat
t1R(P1+1)×(P2+1)×...×(PD1+1)×M
. For the tensor update, the convolution kernel
Wh
and
Wc
t(·)
also increase their dimensionality with kernel size
K= [K1, K2, . . . , KD1]
. Note that
Wc
t(·)
is
reshaped from the vector, as illustrated in Fig. 2(b). Correspondingly, we generate the output
yt
from
the opposite corner of Ht+L1, and therefore (8) is modified as:
yt=ϕ(ht+L1,PWy+by)(23)
For convenience, we set
Pd=P
and
Kd=K
for
d= 1,2, . . . , D 1
so that all dimensions of P
and Kcan satisfy (9) with the same depth
L
. In addition, CN still normalizes the channel dimension
of tensors.
3 Experiments
We evaluate tLSTM on five challenging sequence learning tasks under different configurations:
(a) sLSTM (baseline)
: our implementation of sLSTM [
21
] with parameters shared across all layers.
(b) 2D tLSTM: the standard 2D tLSTM, as defined in (14)-(19).
(c) 2D tLSTM–M: removing (–) memory (M) cell convolutions from (b), as defined in (10)-(13).
(d) 2D tLSTM–F: removing (–) feedback (F) connections from (b).
(e) 3D tLSTM: tensorizing (b) into 3D tLSTM.
(f) 3D tLSTM+LN: applying (+) LN [3] to (e).
(g) 3D tLSTM+CN: applying (+) CN to (e), as defined in (20).
To compare different configurations, we also use
L
to denote the number of layers of a sLSTM, and
M
to denote the hidden size of each sLSTM layer. We set the kernel size
K
to 2 for 2D tLSTM–F
and 3 for other tLSTMs, in which case we have L=Paccording to (9).
For each configuration, we fix the parameter number and increase the tensor size to see if the
performance of tLSTM can be boosted without increasing the parameter number. We also investigate
how the runtime is affected by the depth, where the runtime is measured by the average GPU
milliseconds spent by a forward and backward pass over one timestep of a single example. Next, we
compare tLSTM against the state-of-the-art methods to evaluate its ability. Finally, we visualize the
internal working mechanism of tLSTM. Please see Appendix D for training details.
5
3.1 Wikipedia Language Modeling
Figure 3: Performance and run-
time of different configurations
on Wikipedia.
The Hutter Prize Wikipedia dataset [
25
] consists of 100 million
characters taken from 205 different characters including alpha-
bets, XML markups and special symbols. We model the dataset
at the character-level, and try to predict the next character of the
input sequence.
We fix the parameter number to 10M, corresponding to channel
sizes
M
of 1120 for sLSTM and 2D tLSTM–F, 901 for other
2D tLSTMs, and 522 for 3D tLSTMs. All configurations are
evaluated with depths
L= 1,2,3,4
. We use Bits-per-character
(BPC) to measure the model performance.
Results are shown in Fig. 3. When
L2
,sLSTM and 2D
tLSTM–F outperform other models because of a larger
M
. With
L
increasing, the performances of sLSTM and 2D tLSTM–M
improve but become saturated when
L3
, while tLSTMs with
memory cell convolutions improve with increasing
L
and finally
outperform both sLSTM and 2D tLSTM–M. When
L= 4
, 2D
tLSTM–F is surpassed by 2D tLSTM, which is in turn surpassed
by 3D tLSTM. The performance of 3D tLSTM+LN benefits from
LN only when
L2
. However, 3D tLSTM+CN consistently
improves 3D tLSTM with different L.
Table 1: Test BPC on Wikipedia.
BPC # Param.
MI-LSTM [51] 1.44 17M
mLSTM [32] 1.42 20M
HyperLSTM+LN [23] 1.34 26.5M
HM-LSTM+LN [11] 1.32 35M
Large RHN [54] 1.27 46M
Large FS-LSTM-4 [38] 1.245 47M
2×Large FS-LSTM-4 [38] 1.198 94M
3D tLSTM+CN (L= 6,M= 1200) 1.264 50.1M
Whilst the runtime of sLSTM is al-
most proportional to
L
, it is nearly
constant in each tLSTM configuration
and largely independent of L.
We compare a larger model, i.e. a
3D tLSTM+CN with
L=6
and
M=
1200
, to the state-of-the-art methods
on the test set, as reported in Table 1.
Our model achieves 1.264 BPC with
50.1M parameters, and is competitive
to the best performing methods [
38
,
54] with similar parameter numbers.
3.2 Algorithmic Tasks
Figure 4: Performance and runtime of different configurations
on the addition (left) and memorization (right) tasks.
(a)
Addition
: The task is to sum
two 15-digit integers. The network
first reads two integers with one
digit per timestep, and then predicts
the summation. We follow the pro-
cessing of [
30
], where a symbol
-
’ is used to delimit the integers
as well as pad the input/target se-
quence. A 3-digit integer addition
task is of the form:
Input: -123-900-----
Target:--------1023-
(b)
Memorization
: The goal of this
task is to memorize a sequence of
20 random symbols. Similar to the
addition task, we use 65 different
6
symbols. A 5-symbol memorization task is of the form:
Input: -abccb------
Target: ------abccb-
We evaluate all configurations with
L= 1,4,7,10
on both tasks, where
M
is 400 for addition and
100 for memorization. The performance is measured by the symbol prediction accuracy.
Fig. 4 show the results. In both tasks, large
L
degrades the performances of sLSTM and 2D tLSTM–
M. In contrast, the performance of 2D tLSTM–F steadily improves with
L
increasing, and is further
enhanced by using feedback connections, higher-dimensional tensors, and CN, while LN helps only
when
L=1
. Note that in both tasks, the correct solution can be found (when
100%
test accuracy is
achieved) due to the repetitive nature of the task. In our experiment, we also observe that for the
addition task, 3D tLSTM+CN with
L= 7
outperforms other configurations and finds the solution
with only 298K training samples, while for the memorization task, 3D tLSTM+CN with
L=10
beats
others configurations and achieves perfect memorization after seeing 54K training samples. Also,
unlike in sLSTM, the runtime of all tLSTMs is largely unaffected by L.
Table 2: Test accuracies on two algorithmic tasks.
Addition Memorization
Acc. # Samp. Acc. # Samp.
Stacked LSTM [21]
51%
5M >
50%
900K
Grid LSTM [30] >
99%
550K >
99%
150K
3D tLSTM+CN (L= 7)>
99%
298K >
99%
115K
3D tLSTM+CN (L= 10)>
99%
317K >
99%
54K
We further compare the best
performing configurations to
the state-of-the-art methods
for both tasks (see Table 2).
Our models solve both tasks
significantly faster (i.e., using
fewer training samples) than
other models, achieving the
new state-of-the-art results.
3.3 MNIST Image Classification
Figure 5: Performance and runtime of different configurations
on sequential MNIST (left) and sequential pMNIST (right).
The MNIST dataset [
35
] consists
of 50000/10000/10000 handwritten
digit images of size
28×28
for train-
ing/validation/test. We have two
tasks on this dataset:
(a)
Sequential MNIST
: The goal
is to classify the digit after sequen-
tially reading the pixels in a scan-
line order [
33
]. It is therefore a
784 timestep sequence learning task
where a single output is produced at
the last timestep; the task requires
very long range dependencies in the
sequence.
(b)
Sequential Permuted MNIST
:
We permute the original image pix-
els in a fixed random order as in
[
2
], resulting in a permuted MNIST
(pMNIST) problem that has even longer range dependencies across pixels and is harder.
In both tasks, all configurations are evaluated with
M=100
and
L=1,3,5
. The model performance
is measured by the classification accuracy.
Results are shown in Fig. 5. sLSTM and 2D tLSTM–M no longer benefit from the increased depth
when
L= 5
. Both increasing the depth and tensorization boost the performance of 2D tLSTM.
However, removing feedback connections from 2D tLSTM seems not to affect the performance. On
the other hand, CN enhances the 3D tLSTM and when
L3
it outperforms LN. 3D tLSTM+CN
with L=5 achieves the highest performances in both tasks, with a validation accuracy of 99.1% for
MNIST and 95.6% for pMNIST. The runtime of tLSTMs is negligibly affected by
L
, and all tLSTMs
become faster than sLSTM when L=5.
7
Figure 6: Visualization of the diagonal channel means of the tLSTM memory cells for each task. In
each horizontal bar, the rows from top to bottom correspond to the diagonal locations from
pin
to
pout
, the columns from left to right correspond to different timesteps (from
1
to
T+L1
for the full
sequence, where
L1
is the time delay), and the values are normalized to be in range
[0,1]
for better
visualization. Both full sequences in (d) and (e) are zoomed out horizontally.
Table 3: Test accuracies (%) on sequential MNIST/pMNIST.
MNIST pMNIST
iRNN [33] 97.0 82.0
LSTM [2] 98.2 88.0
uRNN [2] 95.1 91.4
Full-capacity uRNN [49] 96.9 94.1
sTANH [53] 98.1 94.0
BN-LSTM [13] 99.0 95.4
Dilated GRU [8] 99.2 94.6
Dilated CNN [40] in [8] 98.3 96.7
3D tLSTM+CN (L= 3)99.2 94.9
3D tLSTM+CN (L= 5) 99.0 95.7
We also compare the configura-
tions of the highest test accuracies
to the state-of-the-art methods (see
Table 3
). For sequential MNIST, our
3D tLSTM+CN with
L=3
performs
as well as the state-of-the-art Dilated
GRU model [
8
], with a test accu-
racy of 99.2%. For the sequential
pMNIST, our 3D tLSTM+CN with
L= 5
has a test accuracy of 95.7%,
which is close to the state-of-the-art
of 96.7% produced by the Dilated
CNN [40] in [8].
3.4 Analysis
The experimental results of different model configurations on different tasks suggest that the perfor-
mance of tLSTMs can be improved by increasing the tensor size and network depth, requiring no
additional parameters and little additional runtime. As the network gets wider and deeper, we found
that the memory cell convolution mechanism is crucial to maintain improvement in performance.
Also, we found that feedback connections are useful for tasks of sequential output (e.g., our Wikipedia
and algorithmic tasks). Moreover, tLSTM can be further strengthened via tensorization or CN.
It is also intriguing to examine the internal working mechanism of tLSTM. Thus, we visualize the
memory cell which gives insight into how information is routed. For each task, the best performing
tLSTM is run on a random example. We record the channel mean (the mean over channels, e.g., it is
of size
P×P
for 3D tLSTMs) of the memory cell at each timestep, and visualize the diagonal values
of the channel mean from location pin =[1,1] (near the input) to pout =[P, P ](near the output).
Visualization results in Fig. 6 reveal the distinct behaviors of tLSTM when dealing with different tasks:
(i) Wikipedia: the input can be carried to the output location with less modification if it is sufficient
to determine the next character, and vice versa; (ii) addition: the first integer is gradually encoded
into memories and then interacts (performs addition) with the second integer, producing the sum; (iii)
memorization: the network behaves like a shift register that continues to move the input symbol to the
output location at the correct timestep; (iv) sequential MNIST: the network is more sensitive to the
pixel value change (representing the contour, or topology of the digit) and can gradually accumulate
evidence for the final prediction; (v) sequential pMNIST: the network is sensitive to high value pixels
(representing the foreground digit), and we conjecture that this is because the permutation destroys
the topology of the digit, making each high value pixel potentially important.
From Fig. 6 we can also observe common phenomena for all tasks: (i) at each timestep, the values
at different tensor locations are markedly different, implying that wider (larger) tensors can encode
more information, with less effort to compress it; (ii) from the input to the output, the values become
increasingly distinct and are shifted by time, revealing that deep computations are indeed performed
together with temporal computations, with long-range dependencies carried by memory cells.
8
Figure 7: Examples of models related to tLSTMs. (a) A single layer cLSTM [
48
] with vector array
input. (b) A 3-layer sLSTM [
21
]. (c) A 3-layer Grid LSTM [
30
]. (d) A 3-layer RHN [
54
]. (e) A
3-layer QRNN [7] with kernel size 2, where costly computations are done by temporal convolution.
4 Related Work
Convolutional LSTMs.
Convolutional LSTMs (cLSTMs) are proposed to parallelize the compu-
tation of LSTMs when the input at each timestep is structured (see Fig. 7(a)), e.g., a vector array
[
48
], a vector matrix [
41
,
42
,
50
,
52
], or a vector tensor [
9
,
45
]. Unlike cLSTMs, tLSTM aims to
increase the capacity of LSTMs when the input at each timestep is non-structured, i.e., a single vector,
and is advantageous over cLSTMs in that: (i) it performs the convolution across different hidden
layers whose structure is independent of the input structure, and integrates information bottom-up
and top-down; while cLSTM performs the convolution within each hidden layer whose structure is
coupled with the input structure, thus will fall back to the vanilla LSTM if the input at each timestep
is a single vector; (ii) it can be widened efficiently without additional parameters by increasing the
tensor size; while cLSTM can be widened by increasing the kernel size or kernel channel, which
significantly increases the number of parameters; (iii) it can be deepened with little additional run-
time by delaying the output; while cLSTM can be deepened by using more hidden layers, which
significantly increases the runtime; (iv) it captures long-range dependencies from multiple directions
through the memory cell convolution; while cLSTM struggles to capture long-range dependencies
from multiple directions since memory cells are only gated along one direction.
Deep LSTMs.
Deep LSTMs (dLSTMs) extend sLSTMs by making them deeper (see Fig. 7(b)-(d)).
To keep the parameter number small and ease training, Graves
[22]
, Kalchbrenner et al.
[30]
, Mujika
et al. [38], Zilly et al. [54] apply another RNN/LSTM along the depth direction of dLSTMs, which,
however, multiplies the runtime. Though there are implementations to accelerate the deep computation
[
1
,
16
], they generally aim at simple architectures such sLSTMs. Compared with dLSTMs, tLSTM
performs the deep computation with little additional runtime, and employs a cross-layer convolution to
enable the feedback mechanism. Moreover, the capacity of tLSTM can be increased more efficiently
by using higher-dimensional tensors, whereas in dLSTM all hidden layers as a whole only equal to a
2D tensor (i.e., a stack of hidden vectors), the dimensionality of which is fixed.
Other Parallelization Methods.
Some methods [
7
,
8
,
28
,
29
,
36
,
40
] parallelize the temporal
computation of the sequence (e.g., use the temporal convolution, as in Fig. 7(e)) during training, in
which case full input/target sequences are accessible. However, during the online inference when the
input presents sequentially, temporal computations can no longer be parallelized and will be blocked
by deep computations of each timestep, making these methods potentially unsuitable for real-time
applications that demand a high sampling/output frequency. Unlike these methods, tLSTM can speed
up not only training but also online inference for many tasks since it performs the deep computation
by the temporal computation, which is also human-like: we convert each signal to an action and
meanwhile receive new signals in a non-blocking way. Note that for the online inference of tasks
that use the previous output
yt1
for the current input
xt
(e.g., autoregressive sequence generation),
tLSTM cannot parallel the deep computation since it requires to delay L1timesteps to get yt1.
5 Conclusion
We introduced the Tensorized LSTM, which employs tensors to share parameters and utilizes the
temporal computation to perform the deep computation for sequential tasks. We validated our model
on a variety of tasks, showing its potential over other popular approaches.
9
Acknowledgements
This work is supported by the NSFC grant 91220301, the Alan Turing Institute under the EPSRC
grant EP/N510129/1, and the China Scholarship Council.
References
[1]
Jeremy Appleyard, Tomas Kocisky, and Phil Blunsom. Optimizing performance of recurrent neural networks
on gpus. arXiv preprint arXiv:1604.01946, 2016. 3, 9
[2]
Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. In ICML,
2016. 7, 8
[3]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint
arXiv:1607.06450, 2016. 4, 5
[4]
Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent
is difficult. IEEE TNN, 5(2):157–166, 1994. 1
[5]
Yoshua Bengio. Learning deep architectures for ai. Foundations and trends
R
in Machine Learning, 2009. 2
[6]
Luca Bertinetto, João F Henriques, Jack Valmadre, Philip Torr, and Andrea Vedaldi. Learning feed-forward
one-shot learners. In NIPS, 2016. 2, 4
[7]
James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. Quasi-recurrent neural networks. In
ICLR, 2017. 9
[8]
Shiyu Chang, Yang Zhang, Wei Han, Mo Yu, Xiaoxiao Guo, Wei Tan, Xiaodong Cui, Michael Witbrock,
Mark Hasegawa-Johnson, and Thomas Huang. Dilated recurrent neural networks. In NIPS, 2017. 8, 9
[9]
Jianxu Chen, Lin Yang, Yizhe Zhang, Mark Alber, and Danny Z Chen. Combining fully convolutional and
recurrent neural networks for 3d biomedical image segmentation. In NIPS, 2016. 9
[10]
Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Gated feedback recurrent neural
networks. In ICML, 2015. 3, 13
[11]
Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks. In
ICLR, 2017. 6
[12]
Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. Torch7: A matlab-like environment for
machine learning. In NIPS Workshop, 2011. 13
[13]
Tim Cooijmans, Nicolas Ballas, César Laurent, and Aaron Courville. Recurrent batch normalization. In
ICLR, 2017. 8
[14]
Bert De Brabandere, Xu Jia, Tinne Tuytelaars, and Luc Van Gool. Dynamic filter networks. In NIPS, 2016.
4
[15]
Misha Denil, Babak Shakibi, Laurent Dinh, Nando de Freitas, et al. Predicting parameters in deep learning.
In NIPS, 2013. 2, 4
[16]
Greg Diamos, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski, Adam Coates, Erich Elsen, Jesse
Engel, Awni Hannun, and Sanjeev Satheesh. Persistent rnns: Stashing recurrent weights on-chip. In ICML,
2016. 9
[17] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990. 1
[18]
Timur Garipov, Dmitry Podoprikhin, Alexander Novikov, and Dmitry Vetrov. Ultimate tensorization:
compressing convolutional and fc layers alike. In NIPS Workshop, 2016. 2
[19]
Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with lstm.
Neural computation, 12(10):2451–2471, 2000. 1
[20]
Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent
neural networks. In ICASSP, 2013. 2
[21]
Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
5, 7, 9
[22]
Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983,
2016. 9
[23] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. In ICLR, 2017. 4, 6
[24]
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780,
1997. 1
[25] Marcus Hutter. The human knowledge compression contest. URL http://prize.hutter1.net, 2012. 6
[26]
Ozan Irsoy and Claire Cardie. Modeling compositionality with multiplicative recurrent neural networks.
In ICLR, 2015. 2
10
[27]
Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An empirical exploration of recurrent network
architectures. In ICML, 2015. 13
[28] Łukasz Kaiser and Samy Bengio. Can active memory replace attention? In NIPS, 2016. 9
[29] Łukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. In ICLR, 2016. 9
[30]
Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. Grid long short-term memory. In ICLR, 2016. 6, 7, 9,
13
[31] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. 13
[32]
Ben Krause, Liang Lu, Iain Murray, and Steve Renals. Multiplicative lstm for sequence modelling. In
ICLR Workshop, 2017. 2, 6
[33]
Quoc V Le, Navdeep Jaitly, and Geoffrey E Hinton. A simple way to initialize recurrent networks of
rectified linear units. arXiv preprint arXiv:1504.00941, 2015. 7, 8
[34]
Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard,
and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation,
1(4):541–551, 1989. 2
[35]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. 2, 7
[36] Tao Lei and Yu Zhang. Training rnns as fast as cnns. arXiv preprint arXiv:1709.02755, 2017. 9
[37]
Gundram Leifert, Tobias Strauß, Tobias Grüning, Welf Wustlich, and Roger Labahn. Cells in multidimen-
sional recurrent neural networks. JMLR, 17(1):3313–3349, 2016. 4, 13
[38]
Asier Mujika, Florian Meier, and Angelika Steger. Fast-slow recurrent neural networks. In NIPS, 2017. 6,
9
[39]
Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks.
In NIPS, 2015. 2
[40]
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal
Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv
preprint arXiv:1609.03499, 2016. 8, 9
[41]
Viorica Patraucean, Ankur Handa, and Roberto Cipolla. Spatio-temporal video autoencoder with differen-
tiable memory. In ICLR Workshop, 2016. 9
[42]
Bernardino Romera-Paredes and Philip Hilaire Sean Torr. Recurrent instance segmentation. In ECCV,
2016. 9
[43]
David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-
propagating errors. Nature, 323(6088):533–536, 1986. 1
[44]
Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent
networks. Neural Computation, 4(1):131–139, 1992. 4
[45]
Marijn F Stollenga, Wonmin Byeon, Marcus Liwicki, and Juergen Schmidhuber. Parallel multi-dimensional
lstm, with application to fast biomedical volumetric image segmentation. In NIPS, 2015. 9
[46]
Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In
ICML, 2011. 2, 4
[47]
Graham W Taylor and Geoffrey E Hinton. Factored conditional restricted boltzmann machines for modeling
motion style. In ICML, 2009. 2
[48]
Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In
ICML, 2016. 3, 9
[49]
Scott Wisdom, Thomas Powers, John Hershey, Jonathan Le Roux, and Les Atlas. Full-capacity unitary
recurrent neural networks. In NIPS, 2016. 8
[50]
Lin Wu, Chunhua Shen, and Anton van den Hengel. Deep recurrent convolutional networks for video-based
person re-identification: An end-to-end approach. arXiv preprint arXiv:1606.01609, 2016. 9
[51]
Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, and Ruslan Salakhutdinov. On multiplicative
integration with recurrent neural networks. In NIPS, 2016. 2, 6
[52]
SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, and Wang-chun Woo. Convolu-
tional lstm network: A machine learning approach for precipitation nowcasting. In NIPS, 2015. 9
[53]
Saizheng Zhang, Yuhuai Wu, Tong Che, Zhouhan Lin, Roland Memisevic, Ruslan R Salakhutdinov, and
Yoshua Bengio. Architectural complexity measures of recurrent neural networks. In NIPS, 2016. 8
[54]
Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutník, and Jürgen Schmidhuber. Recurrent highway
networks. In ICML, 2017. 6, 9
11
A Mathematical Definition for Cross-Layer Convolutions
A.1 Hidden State Convolution
The hidden state convolution in (6) is defined as:
At,p,mo=
K
X
k=1
Mi
X
mi=1
Hcat
t1,pK1
2+k,mi·Wh
k,mi,mo
+bh
mo(24)
where mo{1,2,· · · , M o}and zero padding is applied to keep the tensor size.
A.2 Memory Cell Convolution
The memory cell convolution in (17) is defined as:
Cconv
t1,p,m =
K
X
k=1
Ct1,pK1
2+k,m ·Wc
t,k,1,1(p)(25)
To prevent the stored information from being flushed away,
Ct1
is padded with the replication of its
boundary values instead of zeros or input projections.
B Derivation for the Constraint of L,P, and K
Figure 8: Illustration of calculating the constraint of
L
,
P
, and
K
. Each column is a concatenated
hidden state tensor with tensor size
P+1=4
and channel size
M
. The volume of the output receptive
field (blue region) is determined by the kernel radius
Kr
. The output
yt
for current timestep
t
is
delayed by L1=2 timesteps.
Here we derive the constraint of
L
,
P
, and
K
that is defined in (9). The kernel center location is
ceiled in case that the kernel size Kis not odd. Then, the kernel radius Krcan be calculated by:
Kr=KKmod 2
2(26)
As shown in Fig. 8, to guarantee the receptive field of
yt
covers
x1:t
while does not cover
xt+1:T
,
the following constraint should be satisfied:
tan AOD 6tan B OD <tan COD (27)
which means:
P
L6Kr
1<P
L1(28)
Plugging (26) into (28), we get:
L=l2P
KKmod 2 m(29)
12
C Memory Cell Convolution Helps to Prevent the Vanishing/Exploding
Gradients
Leifert et al.
[37]
have proved that the lambda gate, which is very similar to our memory cell
convolution kernel, can help to prevent the vanishing/exploding gradients (see Theorem 17-18 in
[37]). The differences between our approach and their lambda gate are: (i) we normalize the kernel
values though a softmax function, while they normalize the gate values by dividing them with their
sum, and (ii) we share the kernel for all channels, while they do not. However, as neither modifications
affects the conditions of validity for Theorem 17-18 in [
37
], our memory cell convolution can also
help to prevent the vanishing/exploding gradients.
D Training Details
D.1 Objective Function
The training objective is to minimize the negative log-likelihood (NLL) of the training sequences
w.r.t. the parameter θ(vectorized), i.e.,
min
θ
1
N
N
X
n=1
Tn
X
t=1
ln p(yd
n,t|f(xd
n,1:t;θ)) (30)
where
N
is the number of training sequences,
Tn
the length of the
n
-th training sequence, and
p(yd
n,t|f(xd
n,1:t;θ))
the likelihood of target
yd
n,t
conditioned on its prediction
yn,t =f(xd
n,1:t;θ)
.
Since all experiment are classification problems,
yd
n,t
is represented as the one-hot encoding of the
class label, and the output function
ϕ(·)
is defined as a softmax function, which is used to generate
the class distribution yn,t. Then, the likelihood can be calculated by p(yd
n,t|yn,t ) = yn,t,s|yd
n,t,s=1 .
D.2 Common Settings
In all tasks, the NLL (see (30)) is used as the training objective and is minimized by Adam [
31
] with
a learning rate of 0.001. Forget gate biases are set to 4 for image classification tasks and 1 [
27
] for
others. All models are implemented by Torch7 [
12
] and accelerated by cuDNN on Tesla K80 GPUs.
We only apply CN to the output of the tLSTM hidden state as we have tried different combinations
and found this is the most robust way that can always improve the performance for all tasks. With
CN, the output of hidden state becomes:
Ht=φ(CN (Ct;Γ,B)) O(31)
D.3 Wikipedia Language Modeling
As in [
10
], we split the dataset into 90M/5M/5M for training/validation/test. In each iteration, we
feed the model with a mini-batch of 100 subsequences of length 50. During the forward pass, the
hidden values at the last timestep are preserved to initialize the next iteration. We terminate training
after 50 epochs.
D.4 Algorithmic Tasks
Following [
30
], for both tasks we randomly generate 5M samples for training and 100 samples for
test, and set the mini-batch size to 15. Training proceeds for at most 1 epoch
3
and will be terminated
if 100% test accuracy is achieved.
D.5 MNIST Image Classification
We set the mini-batch size to 50 and use early stopping for training. The training loss is calculated at
the last timestep.
3To simulate the online learning process, we use all training samples only once.
13
... Representative approaches include pruning [34,40], quantization [15,58], knowledge distillation [21], and low-rank factorization [27,33,43,52]. Among these techniques, low-rank tensor compression [6,9,20,27,33,36] has achieved possibly the most significant compression, leading to promising reduction of FLOPS and hardware cost [7,27]. The recent progress of algorithm/hardware co-design [7,54] of tensor operations can further reduce the run-time and boost the energy efficiency of tensorized models on edge devices (e.g., on field-programmable gate arrays (FPGAs) and application-specific circuits). ...
... In Figure 8 we plot the rank determination output of a single training run using our LU prior. We note that our algorithm discovers the actual ranks that are nearly impossible to determine via hand-tuning or combinatorial search (for example, [1,20,3,2,1] in the TTM model from a maximum rank of [1,20,20,20,1], which may require up to 16,000 searches). ...
... In Figure 8 we plot the rank determination output of a single training run using our LU prior. We note that our algorithm discovers the actual ranks that are nearly impossible to determine via hand-tuning or combinatorial search (for example, [1,20,3,2,1] in the TTM model from a maximum rank of [1,20,20,20,1], which may require up to 16,000 searches). ...
Article
Full-text available
Post-training model compression can reduce the inference costs of deep neural networks, but uncompressed training still consumes enormous hardware resources and energy. To enable low-energy training on edge devices, it is highly desirable to directly train a compact neural network from scratch with a low memory cost. Low-rank tensor decomposition is an effective approach to reduce the memory and computing costs of large neural networks. However, directly training low-rank tensorized neural networks is a very challenging task because it is hard to determine a proper tensor rank a priori, and the tensor rank controls both model complexity and accuracy. This paper presents a novel end-to-end framework for low-rank tensorized training. We first develop a Bayesian model that supports various low-rank tensor formats (e.g., CANDECOMP/PARAFAC, Tucker, tensor-train, and tensor-train matrix) and reduces neural network parameters with automatic rank determination during training. Then we develop a customized Bayesian solver to train large-scale tensorized neural networks. Our training methods shows orders-of-magnitude parameter reduction and little accuracy loss (or even better accuracy) in the experiments. On a very large deep learning recommendation system with over $4.2\times 10^9$ model parameters, our method can reduce the parameter number to $1.6\times 10^5$ automatically in the training process (i.e., by $2.6\times 10^4$ times) while achieving almost the same accuracy. Code is available at https://github.com/colehawkins/bayesian-tensor-rank-determination.
... Recurrent Neural Networks (RNNs) [1] have proven to be among the most successful machine learning approaches for sequence modelling. Despite considerable success, RNNs are known to be notoriously difficult to train when considering large-dimensional inputs, due to an exponential increase in the number of parameters [2], [3]. One way to tackle this issue is through the compression properties of tensor decomposition. ...
... A short background necessary for this work is given below, and is based on the work in [2], [13], [15]. The tensor indices in this paper are grouped according to the Little-Endian convention [28]. ...
Preprint
Recurrent Neural Networks (RNNs) represent the de facto standard machine learning tool for sequence modelling, owing to their expressive power and memory. However, when dealing with large dimensional data, the corresponding exponential increase in the number of parameters imposes a computational bottleneck. The necessity to equip RNNs with the ability to deal with the curse of dimensionality, such as through the parameter compression ability inherent to tensors, has led to the development of the Tensor-Train RNN (TT-RNN). Despite achieving promising results in many applications, the full potential of the TT-RNN is yet to be explored in the context of interpretable financial modelling, a notoriously challenging task characterized by multi-modal data with low signal-to-noise ratio. To address this issue, we investigate the potential of TT-RNN in the task of financial forecasting of currencies. We show, through the analysis of TT-factors, that the physical meaning underlying tensor decomposition, enables the TT-RNN model to aid the interpretability of results, thus mitigating the notorious "black-box" issue associated with neural networks. Furthermore, simulation results highlight the regularization power of TT decomposition, demonstrating the superior performance of TT-RNN over its uncompressed RNN counterpart and other tensor forecasting methods.
... To improve the learning ability of LSTM, many extensions of it have been proposed [25,29]. In [29], the tensorized LSTM is proposed to increase the capacity of LSTM by widening and adding layers. ...
... To improve the learning ability of LSTM, many extensions of it have been proposed [25,29]. In [29], the tensorized LSTM is proposed to increase the capacity of LSTM by widening and adding layers. For concreteness, the hidden states are represented by tensors and updated via a crosslayer convolution. ...
Article
Full-text available
Attention mechanism plays an important role in the perception and cognition of human beings. Among others, many machine learning models have been developed to memorize the sequential data, such as the Long Short-Term Memory (LSTM) network and its extensions. However, due to lack of the attention mechanism, they cannot pay special attention to the important parts of the sequences. In this paper, we present a novel machine learning method called attention-augmented machine memory (AAMM). It seamlessly integrates the attention mechanism into the memory cell of LSTM. As a result, it facilitates the network to focus on valuable information in the sequences and ignore irrelevant information during its learning. We have conducted experiments on two sequence classification tasks for pattern classification and sentiment analysis, respectively. The experimental results demonstrate the advantages of AAMM over LSTM and some other related approaches. Hence, AAMM can be considered as a substitute of LSTM in the sequence learning applications.
... Despite Deep Neural Networks (DNNs) have made impressive achievements on machine learning tasks from computer vision to speech recognition and natural language processing [7,5,23], DNNs are vulnerable when it encounters the specially-crafted adversarial examples [16,32,60]. This problem provokes people to worry about the potential risk of the DNNs and attracts the researchers' interest in the adversarial robustness of models. ...
Preprint
With the proliferation of mobile devices and the Internet of Things, deep learning models are increasingly deployed on devices with limited computing resources and memory, and are exposed to the threat of adversarial noise. Learning deep models with both lightweight and robustness is necessary for these equipments. However, current deep learning solutions are difficult to learn a model that possesses these two properties without degrading one or the other. As is well known, the fully-connected layers contribute most of the parameters of convolutional neural networks. We perform a separable structural transformation of the fully-connected layer to reduce the parameters, where the large-scale weight matrix of the fully-connected layer is decoupled by the tensor product of several separable small-sized matrices. Note that data, such as images, no longer need to be flattened before being fed to the fully-connected layer, retaining the valuable spatial geometric information of the data. Moreover, in order to further enhance both lightweight and robustness, we propose a joint constraint of sparsity and differentiable condition number, which is imposed on these separable matrices. We evaluate the proposed approach on MLP, VGG-16 and Vision Transformer. The experimental results on datasets such as ImageNet, SVHN, CIFAR-100 and CIFAR10 show that we successfully reduce the amount of network parameters by 90%, while the robust accuracy loss is less than 1.5%, which is better than the SOTA methods based on the original fully-connected layer. Interestingly, it can achieve an overwhelming advantage even at a high compression rate, e.g., 200 times.
... In addition, it is also normal that the price of online learning activities often cheaper as compared to the traditional approaches. In explaining such an argument, He, Gao, Xiao, Liu, He and Barber (2017), stressed that of course online learning is a cheaper way to teach students, thus the fee applies must also be cheaper. ...
Article
Full-text available
In Malaysia, and other countries around the globe, it appears that one who speaks a native language enters the sector of employment with abilities and talents. The reason is that they are able to communicate not only with the foreigner when they also speak international as well as the key national language, but also handsomely be able to explain as well as understand what the locals need through communication. Accordingly, in Malaysia and specifically in Sabah, children are encouraged to learn their native language either via a formal or informal learning process. This includes the use of the audio-lingual model, communicative approach, direct instruction, grammartranslation model, and total physical response. Nevertheless, the advent of the COVID-19 pandemic led to the change in the trend, potential, and challenges of native or local language learning. This paper presents a discussion on the trend, potential, and challenges of local language learning during the COVID- 19 pandemic through the use of a qualitative approach, an approach in which required an examination of information gathered from various sources such as books, journals, and media reports. It is hoped that the paper helps readers to understand exactly is the new trend, potential, and challenges of local language learning during the COVID-19 pandemic.
... A mixture attention was proposed in (Guo, Lin, and Antulov-Fantulin 2019) to enhance the interpretability with a tensorized LSTM structure for singlestep prediction. The idea of tensorizing hidden states has shown its advantages for multivariate time series prediction in recent work (He et al. 2017;Xu et al. 2020 ...
Preprint
Full-text available
We propose a continuous neural network architecture, termed Explainable Tensorized Neural Ordinary Differential Equations (ETN-ODE), for multi-step time series prediction at arbitrary time points. Unlike the existing approaches, which mainly handle univariate time series for multi-step prediction or multivariate time series for single-step prediction, ETN-ODE could model multivariate time series for arbitrary-step prediction. In addition, it enjoys a tandem attention, w.r.t. temporal attention and variable attention, being able to provide explainable insights into the data. Specifically, ETN-ODE combines an explainable Tensorized Gated Recurrent Unit (Tensorized GRU or TGRU) with Ordinary Differential Equations (ODE). The derivative of the latent states is parameterized with a neural network. This continuous-time ODE network enables a multi-step prediction at arbitrary time points. We quantitatively and qualitatively demonstrate the effectiveness and the interpretability of ETN-ODE on five different multi-step prediction tasks and one arbitrary-step prediction task. Extensive experiments show that ETN-ODE can lead to accurate predictions at arbitrary time points while attaining best performance against the baseline methods in standard multi-step time series prediction.
... We also design and implement a distributed optimization system with dynamic decomposition, which can significantly reduce the energy cost and accelerate the training process. We wish to continue to accelerate LSTM inference, and there are a number of other interesting directions to explore such as quantization [58], parallel timesteps [31], wide LSTM [20], and Sparsified SGD [32,65]. Due to limited space in this paper, we defer investigation of these to future studies. ...
Article
Full-text available
Long short-term memory (LSTM) is a powerful deep learning technique that has been widely used in many real-world data-mining applications such as language modeling and machine translation. In this paper, we aim to minimize the latency of LSTM inference on cloud systems without losing accuracy. If an LSTM model does not fit in cache, the latency due to data movement will likely be greater than that due to computation. In this case, we reduce model parameters. If, as in most applications we consider, the LSTM models are able to fit the cache of cloud server processors, we focus on reducing the number of floating point operations, which has a corresponding linear impact on the latency of the inference calculation. Thus, in our system, we dynamically reduce model parameters or flops depending on which most impacts latency. Our inference system is based on singular value decomposition and canonical polyadic decomposition. Our system is accurate and low latency. We evaluate our system based on models from a series of real-world applications like language modeling, computer vision, question answering, and sentiment analysis. Users of our system can use either pre-trained models or start from scratch. Our system achieves 15\(\times \) average speedup for six real-world applications without losing accuracy in inference. We also design and implement a distributed optimization system with dynamic decomposition, which can significantly reduce the energy cost and accelerate the training process.
Article
Capture of long-term dependencies is a core task in sequence learning, and imitating the way of human memorization is a promising orientation. The existing algorithms can fractionate the sequence into segments with fixed length according to a prior knowledge, but the segmentation depends on the context and is difficult to assign length before network training. Thus in this paper, we propose a variant of segmented-memory neural network which can segment the sequence into arbitrary lengths and then perform cascade via a binarized mask of memory slots. For optimization of the network, we deduce an optimal mask theoretically, and then apply it in a novel scheme based on a sparsity regularizer. In experiments, we conduct ablation analysis and evaluation on some algorithmic or classification tasks, several models including the proposed one optimized by using lasso regularizer are adopted for comparison. Both of the fixed- and variable-length sequences are tested, and the results in different criteria have demonstrated the superiority of our proposed method.
Article
Full-text available
Notoriously, learning with recurrent neural networks (RNNs) on long sequences is a difficult task. There are three major challenges: 1) extracting complex dependencies, 2) vanishing and exploding gradients, and 3) efficient parallelization. In this paper, we introduce a simple yet effective RNN connection structure, the DILATEDRNN, which simultaneously tackles all these challenges. The proposed architecture is characterized by multi-resolution dilated recurrent skip connections and can be combined flexibly with different RNN cells. Moreover, the DILATEDRNN reduces the number of parameters and enhances training efficiency significantly, while matching state-of-the-art performance (even with Vanilla RNN cells) in tasks involving very long-term dependencies. To provide a theory-based quantification of the architecture's advantages, we introduce a memory capacity measure - the mean recurrent length, which is more suitable for RNNs with long skip connections than existing measures. We rigorously prove the advantages of the DILATEDRNN over other recurrent neural architectures.
Article
Full-text available
Recurrent neural networks are powerful models for processing sequential data, but they are generally plagued by vanishing and exploding gradient problems. Unitary recurrent neural networks (uRNNs), which use unitary recurrence matrices, have recently been proposed as a means to avoid these issues. However, in previous experiments, the recurrence matrices were restricted to be a product of parameterized unitary matrices, and an open question remains: when does such a parameterization fail to represent all unitary matrices, and how does this restricted representational capacity limit what can be learned? To address this question, we propose full-capacity uRNNs that optimize their recurrence matrix over all unitary matrices, leading to significantly improved performance over uRNNs that use a restricted-capacity recurrence matrix. Our contribution consists of two main components. First, we provide a theoretical argument to determine if a unitary parameterization has restricted capacity. Using this argument, we show that a recently proposed unitary parameterization has restricted capacity for hidden state dimension greater than 7. Second, we show how a complete, full-capacity unitary recurrence matrix can be optimized over the differentiable manifold of unitary matrices. The resulting multiplicative gradient step is very simple and does not require gradient clipping or learning rate adaptation. We confirm the utility of our claims by empirically evaluating our new full-capacity uRNNs on both synthetic and natural data, achieving superior performance compared to both LSTMs and the original restricted-capacity uRNNs.
Article
Full-text available
This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.
Conference Paper
Torch7 is a versatile numeric computing framework and machine learning library that extends Lua. Its goal is to provide a flexible environment to design and train learning machines. Flexibility is obtained via Lua, an extremely lightweight scripting language. High performance is obtained via efficient OpenMP/SSE and CUDA implementations of low-level numeric routines. Torch7 can easily be in- terfaced to third-party software thanks to Lua's light interface.
Conference Paper
We describe a new spatio-temporal video autoencoder, based on a classic spatial image autoencoder and a novel nested temporal autoencoder. The temporal encoder is represented by a differentiable visual memory composed of convolutional long short-term memory (LSTM) cells that integrate changes over time. Here we target motion changes and use as temporal decoder a robust optical flow prediction module together with an image sampler serving as built-in feedback loop. The architecture is end-to-end differentiable. At each time step, the system receives as input a video frame, predicts the optical flow based on the current observation and the LSTM memory state as a dense transformation map, and applies it to the current frame to generate the next frame. By minimising the reconstruction error between the predicted next frame and the corresponding ground truth next frame, we train the whole system to extract features useful for motion estimation without any supervision effort. We believe these features can in turn facilitate learning high-level tasks such as path planning, semantic segmentation, or action recognition, reducing the overall supervision effort.
Article
Recurrent neural networks scale poorly due to the intrinsic difficulty in parallelizing their state computations. For instance, the forward pass computation of $h_t$ is blocked until the entire computation of $h_{t-1}$ finishes, which is a major bottleneck for parallel computing. In this work, we propose an alternative RNN implementation by deliberately simplifying the state computation and exposing more parallelism. The proposed recurrent unit operates as fast as a convolutional layer and 5-10x faster than cuDNN-optimized LSTM. We demonstrate the unit's effectiveness across a wide range of applications including classification, question answering, language modeling, translation and speech recognition. We open source our implementation in PyTorch and CNTK.
Article
Processing sequential data of variable length is a major challenge in a wide range of applications, such as speech recognition, language modeling, generative image modeling and machine translation. Here, we address this challenge by proposing a novel recurrent neural network (RNN) architecture, the Fast-Slow RNN (FS-RNN). The FS-RNN incorporates the strengths of both multiscale RNNs and deep transition RNNs as it processes sequential data on different timescales and learns complex transition functions from one time step to the next. We evaluate the FS-RNN on two character level language modeling data sets, Penn Treebank and Hutter Prize Wikipedia, where we improve state of the art results to $1.19$ and $1.25$ bits-per-character (BPC), respectively. In addition, an ensemble of two FS-RNNs achieves $1.20$ BPC on Hutter Prize Wikipedia outperforming the best known compression algorithm with respect to the BPC measure. We also present an empirical investigation of the learning and network dynamics of the FS-RNN, which explains the improved performance compared to other RNN architectures. Our approach is general as any kind of RNN cell is a possible building block for the FS-RNN architecture, and thus can be flexibly applied to different tasks.
Article
Recurrent neural networks are a powerful tool for modeling sequential data, but the dependence of each timestep's computation on the previous timestep's output limits parallelism and makes RNNs unwieldy for very long sequences. We introduce quasi-recurrent neural networks (QRNNs), an approach to neural sequence modeling that alternates convolutional layers, which apply in parallel across timesteps, and a minimalist recurrent pooling function that applies in parallel across channels. Despite lacking trainable recurrent layers, stacked QRNNs have better predictive accuracy than stacked LSTMs of the same hidden size. Due to their increased parallelism, they are up to 16 times faster at train and test time. Experiments on language modeling, sentiment classification, and character-level neural machine translation demonstrate these advantages and underline the viability of QRNNs as a basic building block for a variety of sequence tasks.
Article
Several mechanisms to focus attention of a neural network on selected parts of its input or memory have been used successfully in deep learning models in recent years. Attention has improved image classification, image captioning, speech recognition, generative models, and learning algorithmic tasks, but it had probably the largest impact on neural machine translation. Recently, similar improvements have been obtained using alternative mechanisms that do not focus on a single part of a memory but operate on all of it in parallel, in a uniform way. Such mechanism, which we call active memory, improved over attention in algorithmic tasks, image processing, and in generative modelling. So far, however, active memory has not improved over attention for most natural language processing tasks, in particular for machine translation. We analyze this shortcoming in this paper and propose an extended model of active memory that matches existing attention models on neural machine translation and generalizes better to longer sentences. We investigate this model and explain why previous active memory models did not succeed. Finally, we discuss when active memory brings most benefits and where attention can be a better choice.
Conference Paper
Instance segmentation is the problem of detecting and delineating each distinct object of interest appearing in an image. Current instance segmentation approaches consist of ensembles of modules that are trained independently of each other, thus missing opportunities for joint learning. Here we propose a new instance segmentation paradigm consisting in an end-to-end method that learns how to segment instances sequentially. The model is based on a recurrent neural network that sequentially finds objects and their segmentations one at a time. This net is provided with a spatial memory that keeps track of what pixels have been explained and allows occlusion handling. In order to train the model we designed a principled loss function that accurately represents the properties of the instance segmentation problem. In the experiments carried out, we found that our method outperforms recent approaches on multiple person segmentation, and all state of the art approaches on the Plant Phenotyping dataset for leaf counting.