Conference PaperPDF Available

Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning

Authors:

Abstract and Figures

Long Short-Term Memory (LSTM) is a popular approach to boosting the ability of Recurrent Neural Networks to store longer term temporal information. The capacity of an LSTM network can be increased by widening and adding layers. However, the former introduces additional parameters, while the latter increases the runtime. As an alternative we propose the Tensorized LSTM in which the hidden states are represented by tensors and updated via a cross-layer convolution. By increasing the tensor size, the network can be widened efficiently without additional parameters since the parameters are shared across different locations in the tensor; by delaying the output, the network can be deepened implicitly with little additional runtime since deep computations for each timestep are merged into temporal computations of the sequence. Experiments conducted on five challenging sequence learning tasks show the potential of the proposed model.
Content may be subject to copyright.
Wider and Deeper, Cheaper and Faster:
Tensorized LSTMs for Sequence Learning
Zhen He1,2,Shaobing Gao3,Liang Xiao2,Daxue Liu2,Hangen He2, and David Barber 1,4
1University College London, 2National University of Defense Technology, 3Sichuan University,
4Alan Turing Institute
Abstract
Long Short-Term Memory (LSTM) is a popular approach to boosting the ability
of Recurrent Neural Networks to store longer term temporal information. The
capacity of an LSTM network can be increased by widening and adding layers.
However, usually the former introduces additional parameters, while the latter
increases the runtime. As an alternative we propose the Tensorized LSTM in
which the hidden states are represented by tensors and updated via a cross-layer
convolution. By increasing the tensor size, the network can be widened efficiently
without additional parameters since the parameters are shared across different
locations in the tensor; by delaying the output, the network can be deepened
implicitly with little additional runtime since deep computations for each timestep
are merged into temporal computations of the sequence. Experiments conducted on
five challenging sequence learning tasks show the potential of
the proposed model.
1 Introduction
We consider the time-series prediction task of producing a desired output
yt
at each timestep
t{1, . . . , T }
given an observed input sequence
x1:t={x1,x2,· · · ,xt}
, where
xtRR
and
ytRS
are vectors
1
. The Recurrent Neural Network (RNN) [
17
,
43
] is a powerful model that learns
how to use a hidden state vector
htRM
to encapsulate the relevant features of the entire input
history
x1:t
up to timestep
t
. Let
hcat
t1RR+M
be the concatenation of the current input
xt
and the
previous hidden state ht1:
hcat
t1= [xt,ht1](1)
The update of the hidden state htis defined as:
at=hcat
t1Wh+bh(2)
ht=φ(at)(3)
where
WhR(R+M)×M
is the weight,
bhRM
the bias,
atRM
the hidden activation, and
φ(·)
the element-wise tanh function. Finally, the output ytat timestep tis generated by:
yt=ϕ(htWy+by)(4)
where
WyRM×S
and
byRS
, and
ϕ(·)
can be any differentiable function, depending on the task.
However, this vanilla RNN has difficulties in modeling long-range dependencies due to the van-
ishing/exploding gradient problem [
4
]. Long Short-Term Memories (LSTMs) [
19
,
24
] alleviate
Corresponding authors: Shaobing Gao <gaoshaobing@scu.edu.cn> and Zhen He <hezhen.cs@gmail.com>.
1Vectors are assumed to be in row form throughout this paper.
31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
these problems by employing memory cells to preserve information for longer, and adopting gating
mechanisms to modulate the information flow. Given the success of the LSTM in sequence modeling,
it is natural to consider how to increase the complexity of the model and thereby increase the set of
tasks for which the LSTM can be profitably applied.
We consider the capacity of a network to consist of two components: the width (the amount of
information handled in parallel) and the depth (the number of computation steps) [
5
]. A naive way
to widen the LSTM is to increase the number of units in a hidden layer; however, the parameter
number scales quadratically with the number of units. To deepen the LSTM, the popular Stacked
LSTM (sLSTM) stacks multiple LSTM layers [20]; however, runtime is proportional to the number
of layers and information from the input is potentially lost (due to gradient vanishing/explosion) as it
propagates vertically through the layers.
In this paper, we introduce a way to both widen and deepen the LSTM whilst keeping the parameter
number and runtime largely unchanged. In summary, we make the following contributions:
(a)
We tensorize RNN hidden state vectors into higher-dimensional tensors which allow more flexible
parameter sharing and can be widened more efficiently without additional parameters.
(b)
Based on (a), we merge RNN deep computations into its temporal computations so that the
network can be deepened with little additional runtime, resulting in a Tensorized RNN (tRNN).
(c)
We extend the tRNN to an LSTM, namely the Tensorized LSTM (tLSTM), which integrates a
novel memory cell convolution to help to prevent the vanishing/exploding gradients.
2 Method
2.1 Tensorizing Hidden States
It can be seen from (2) that in an RNN, the parameter number scales quadratically with the size of the
hidden state. A popular way to limit the parameter number when widening the network is to organize
parameters as higher-dimensional tensors which can be factorized into lower-rank sub-tensors that
contain significantly fewer elements [
6
,
15
,
18
,
26
,
32
,
39
,
46
,
47
,
51
], which is is known as tensor
factorization. This implicitly widens the network since the hidden state vectors are in fact broadcast to
interact with the tensorized parameters. Another common way to reduce the parameter number is to
share a small set of parameters across different locations in the hidden state, similar to Convolutional
Neural Networks (CNNs) [34, 35].
We adopt parameter sharing to cutdown the parameter number for RNNs, since compared with
factorization, it has the following advantages: (i) scalability, i.e., the number of shared parameters
can be set independent of the hidden state size, and (ii) separability, i.e., the information flow can be
carefully managed by controlling the receptive field, allowing one to shift RNN deep computations to
the temporal domain (see Sec. 2.2). We also explicitly tensorize the RNN hidden state vectors, since
compared with vectors, tensors have a better: (i) flexibility, i.e., one can specify which dimensions
to share parameters and then can just increase the size of those dimensions without introducing
additional parameters, and (ii) efficiency, i.e., with higher-dimensional tensors, the network can be
widened faster w.r.t. its depth when fixing the parameter number (see Sec. 2.3).
For ease of exposition, we first consider 2D tensors (matrices): we tensorize the hidden state
htRM
to become
HtRP×M
, where
P
is the tensor size, and
M
the channel size. We locally-connect the
first dimension of
Ht
in order to share parameters, and fully-connect the second dimension of
Ht
to
allow global interactions. This is analogous to the CNN which fully-connects one dimension (e.g.,
the RGB channel for input images) to globally fuse different feature planes. Also, if one compares
Ht
to the hidden state of a Stacked RNN (sRNN) (see Fig. 1(a)), then
P
is akin to the number of
stacked hidden layers, and
M
the size of each hidden layer. We start to describe our model based on
2D tensors, and finally show how to strengthen the model with higher-dimensional tensors.
2.2 Merging Deep Computations
Since an RNN is already deep in its temporal direction, we can deepen an input-to-output computation
by associating the input
xt
with a (delayed) future output. In doing this, we need to ensure that the
output
yt
is separable, i.e., not influenced by any future input
xt0
(
t0> t
). Thus, we concatenate
the projection of
xt
to the top of the previous hidden state
Ht1
, then gradually shift the input
2
Figure 1: Examples of sRNN, tRNNs and tLSTMs. (a) A 3-layer sRNN. (b) A 2D tRNN without (–)
feedback (F) connections, which can be thought as a skewed version of (a). (c) A 2D tRNN. (d) A 2D
tLSTM without (–) memory (M) cell convolutions. (e) A 2D tLSTM. In each model, the blank circles
in column 1 to 4 denote the hidden state at timestep
t1
to
t+2
, respectively, and the blue region
denotes the receptive field of the current output
yt
. In (b)-(e), the outputs are delayed by
L1=2
timesteps, where L=3 is the depth.
information down when the temporal computation proceeds, and finally generate
yt
from the bottom
of
Ht+L1
, where
L1
is the number of delayed timesteps for computations of depth
L
. An example
with
L=3
is shown in Fig. 1(b). This is in fact a skewed sRNN as used in [
1
] (also similar to [
48
]).
However, our method does not need to change the network structure and also allows different kinds
of interactions as long as the output is separable, e.g, one can increase the local connections and use
feedback (see Fig. 1(c)), which can be beneficial for sRNNs [
10
]. In order to share parameters, we
update
Ht
using a convolution with a learnable kernel. In this manner we increase the complexity of
the input-to-output mapping (by delaying outputs) and limit parameter growth (by sharing transition
parameters using convolutions).
To describe the resulting tRNN model, let
Hcat
t1R(P+1)×M
be the concatenated hidden state, and
pZ+
the location at a tensor. The channel vector
hcat
t1,p RM
at location
p
of
Hcat
t1
is defined as:
hcat
t1,p =xtWx+bxif p= 1
ht1,p1if p > 1(5)
where
WxRR×M
and
bxRM
. Then, the update of tensor
Ht
is implemented via a convolution:
At=Hcat
t1~{Wh,bh}(6)
Ht=φ(At)(7)
where
WhRK×Mi
×Mo
is the kernel weight of size
K
, with
Mi=M
input channels and
Mo=M
output channels,
bhRMo
is the kernel bias,
AtRP×Mo
is the hidden activation, and
~
is the
convolution operator (see Appendix A.1 for a more detailed definition). Since the kernel convolves
across different hidden layers, we call it the cross-layer convolution. The kernel enables interaction,
both bottom-up and top-down across layers. Finally, we generate
yt
from the channel vector
ht+L1,P RMwhich is located at the bottom of Ht+L1:
yt=ϕ(ht+L1,P Wy+by)(8)
where
WyRM×S
and
byRS
. To guarantee that the receptive field of
yt
only covers the current
and previous inputs x1:t(see Fig. 1(c)), L,P, and Kshould satisfy the constraint:
L=l2P
KKmod 2 m(9)
where d·e is the ceil operation. For the derivation of (9), please see Appendix B.
We call the model defined in (5)-(8) the Tensorized RNN (tRNN). The model can be widened by
increasing the tensor size
P
, whilst the parameter number remains fixed (thanks to the convolution).
Also, unlike the sRNN of runtime complexity
O(T L)
,tRNN breaks down the runtime complexity to
O(T+L)
, which means either increasing the sequence length
T
or the network depth
L
would not
significantly increase the runtime.
3
2.3 Extending to LSTMs
To allow the tRNN to capture long-range temporal dependencies, one can straightforwardly extend it
to an LSTM by replacing the tRNN tensor update equations of (6)-(7) as follows:
[Ag
t,Ai
t,Af
t,Ao
t] = Hcat
t1~{Wh,bh}(10)
[Gt,It,Ft,Ot] = [φ(Ag
t), σ(Ai
t), σ(Af
t), σ(Ao
t)] (11)
Ct=GtIt+Ct1Ft(12)
Ht=φ(Ct)Ot(13)
where the kernel
{Wh,bh}
is of size
K
, with
Mi=M
input channels and
Mo=4M
output channels,
Ag
t,Ai
t,Af
t,Ao
tRP×M
are activations for the new content
Gt
, input gate
It
, forget gate
Ft
, and
output gate
Ot
, respectively,
σ(·)
is the element-wise sigmoid function, and
CtRP×M
is the
memory cell. However, since in (12) the previous memory cell
Ct1
is only gated along the temporal
direction (see Fig. 1(d)), long-range dependencies from the input to output might be lost when the
tensor size Pbecomes large.
Memory Cell Convolution.
To capture long-range dependencies from multiple directions, we
additionally introduce a novel memory cell convolution, by which the memory cells can have a larger
receptive field (see Fig. 1(e)). We also dynamically generate this convolution kernel so that it is
both time- and location-dependent, allowing for flexible control over long-range dependencies from
different directions. This results in our tLSTM tensor update equations:
[Ag
t,Ai
t,Af
t,Ao
t,Aq
t] = Hcat
t1~{Wh,bh}(14)
[Gt,It,Ft,Ot,Qt]=[φ(Ag
t), σ(Ai
t), σ(Af
t), σ(Ao
t), ς(Aq
t)] (15)
Wc
t(p) = reshape (qt,p,[K, 1,1]) (16)
Cconv
t1=Ct1~Wc
t(p)(17)
Ct=GtIt+Cconv
t1Ft(18)
Ht=φ(Ct)Ot(19)
Figure 2: Illustration of gener-
ating the memory cell convolu-
tion kernel, where (a) is for 2D
tensors and (b) for 3D tensors.
where, in contrast to (10)-(13), the kernel
{Wh,bh}
has additional
hKi
output channels
2
to generate the activation
Aq
tRP×hKi
for
the dynamic kernel bank
QtRP×hKi
,
qt,p RhKi
is the vectorized
adaptive kernel at the location
p
of
Qt
, and
Wc
t(p)RK×1×1
is
the dynamic kernel of size
K
with a single input/output channel,
which is reshaped from
qt,p
(see Fig. 2(a) for an illustration). In
(17), each channel of the previous memory cell
Ct1
is convolved
with
Wc
t(p)
whose values vary with
p
, forming a memory cell
convolution (see Appendix A.2 for a more detailed definition),
which produces a convolved memory cell
Cconv
t1RP×M
. Note
that in (15) we employ a softmax function
ς(·)
to normalize the
channel dimension of
Qt
, which, similar to [
37
], can stabilize the
value of memory cells and help to prevent the vanishing/exploding
gradients (see Appendix C for details).
The idea of dynamically generating network weights has been used
in many works [
6
,
14
,
15
,
23
,
44
,
46
], where in [
14
] location-
dependent convolutional kernels are also dynamically generated to improve CNNs. In contrast to
these works, we focus on broadening the receptive field of tLSTM memory cells. Whilst the flexibility
is retained, fewer parameters are required to generate the kernel since the kernel is shared by different
memory cell channels.
Channel Normalization.
To improve training, we adapt Layer Normalization (LN) [
3
] to our
tLSTM. Similar to the observation in [
3
] that LN does not work well in CNNs where channel vectors
at different locations have very different statistics, we find that LN is also unsuitable for tLSTM
where lower level information is near the input while higher level information is near the output. We
2The operator h·i returns the cumulative product of all elements in the input variable.
4
therefore normalize the channel vectors at different locations with their own statistics, forming a
Channel Normalization (CN), with its operator CN (·):
CN (Z;Γ,B) = b
ZΓ+B(20)
where
Z,b
Z,Γ,BRP×Mz
are the original tensor, normalized tensor, gain parameter, and bias
parameter, respectively. The mz-th channel of Z, i.e. zmzRP, is normalized element-wisely:
b
zmz= (zmzzµ)/zσ(21)
where
zµ,zσRP
are the mean and standard deviation along the channel dimension of
Z
, respec-
tively, and
b
zmzRP
is the
mz
-th channel of
b
Z
. Note that the number of parameters introduced by
CN/LN can be neglected as it is very small compared to the number of other parameters in the model.
Using Higher-Dimensional Tensors.
One can observe from (9) that when fixing the kernel size
K
, the tensor size
P
of a 2D tLSTM grows linearly w.r.t. its depth
L
. How can we expand the tensor
volume more rapidly so that the network can be widened more efficiently? We can achieve this goal
by leveraging higher-dimensional tensors. Based on previous definitions for 2D tLSTMs, we replace
the 2D tensors with
D
-dimensional (
D > 2
) tensors, obtaining
Ht,CtRP1
×P2
×...×PD1
×M
with the
tensor size
P=[P1, P2, . . . , PD1]
. Since the hidden states are no longer matrices, we concatenate
the projection of xtto one corner of Ht1, and thus (5) is extended as:
hcat
t1,p=
xtWx+bxif pd= 1 for d= 1,2, . . . , D 1
ht1,p1if pd>1for d= 1,2, . . . , D 1
0otherwise
(22)
where
hcat
t1,pRM
is the channel vector at location
pZD1
+
of the concatenated hidden state
Hcat
t1R(P1+1)×(P2+1)×...×(PD1+1)×M
. For the tensor update, the convolution kernel
Wh
and
Wc
t(·)
also increase their dimensionality with kernel size
K= [K1, K2, . . . , KD1]
. Note that
Wc
t(·)
is
reshaped from the vector, as illustrated in Fig. 2(b). Correspondingly, we generate the output
yt
from
the opposite corner of Ht+L1, and therefore (8) is modified as:
yt=ϕ(ht+L1,PWy+by)(23)
For convenience, we set
Pd=P
and
Kd=K
for
d= 1,2, . . . , D 1
so that all dimensions of P
and Kcan satisfy (9) with the same depth
L
. In addition, CN still normalizes the channel dimension
of tensors.
3 Experiments
We evaluate tLSTM on five challenging sequence learning tasks under different configurations:
(a) sLSTM (baseline)
: our implementation of sLSTM [
21
] with parameters shared across all layers.
(b) 2D tLSTM: the standard 2D tLSTM, as defined in (14)-(19).
(c) 2D tLSTM–M: removing (–) memory (M) cell convolutions from (b), as defined in (10)-(13).
(d) 2D tLSTM–F: removing (–) feedback (F) connections from (b).
(e) 3D tLSTM: tensorizing (b) into 3D tLSTM.
(f) 3D tLSTM+LN: applying (+) LN [3] to (e).
(g) 3D tLSTM+CN: applying (+) CN to (e), as defined in (20).
To compare different configurations, we also use
L
to denote the number of layers of a sLSTM, and
M
to denote the hidden size of each sLSTM layer. We set the kernel size
K
to 2 for 2D tLSTM–F
and 3 for other tLSTMs, in which case we have L=Paccording to (9).
For each configuration, we fix the parameter number and increase the tensor size to see if the
performance of tLSTM can be boosted without increasing the parameter number. We also investigate
how the runtime is affected by the depth, where the runtime is measured by the average GPU
milliseconds spent by a forward and backward pass over one timestep of a single example. Next, we
compare tLSTM against the state-of-the-art methods to evaluate its ability. Finally, we visualize the
internal working mechanism of tLSTM. Please see Appendix D for training details.
5
3.1 Wikipedia Language Modeling
Figure 3: Performance and run-
time of different configurations
on Wikipedia.
The Hutter Prize Wikipedia dataset [
25
] consists of 100 million
characters taken from 205 different characters including alpha-
bets, XML markups and special symbols. We model the dataset
at the character-level, and try to predict the next character of the
input sequence.
We fix the parameter number to 10M, corresponding to channel
sizes
M
of 1120 for sLSTM and 2D tLSTM–F, 901 for other
2D tLSTMs, and 522 for 3D tLSTMs. All configurations are
evaluated with depths
L= 1,2,3,4
. We use Bits-per-character
(BPC) to measure the model performance.
Results are shown in Fig. 3. When
L2
,sLSTM and 2D
tLSTM–F outperform other models because of a larger
M
. With
L
increasing, the performances of sLSTM and 2D tLSTM–M
improve but become saturated when
L3
, while tLSTMs with
memory cell convolutions improve with increasing
L
and finally
outperform both sLSTM and 2D tLSTM–M. When
L= 4
, 2D
tLSTM–F is surpassed by 2D tLSTM, which is in turn surpassed
by 3D tLSTM. The performance of 3D tLSTM+LN benefits from
LN only when
L2
. However, 3D tLSTM+CN consistently
improves 3D tLSTM with different L.
Table 1: Test BPC on Wikipedia.
BPC # Param.
MI-LSTM [51] 1.44 17M
mLSTM [32] 1.42 20M
HyperLSTM+LN [23] 1.34 26.5M
HM-LSTM+LN [11] 1.32 35M
Large RHN [54] 1.27 46M
Large FS-LSTM-4 [38] 1.245 47M
2×Large FS-LSTM-4 [38] 1.198 94M
3D tLSTM+CN (L= 6,M= 1200) 1.264 50.1M
Whilst the runtime of sLSTM is al-
most proportional to
L
, it is nearly
constant in each tLSTM configuration
and largely independent of L.
We compare a larger model, i.e. a
3D tLSTM+CN with
L=6
and
M=
1200
, to the state-of-the-art methods
on the test set, as reported in Table 1.
Our model achieves 1.264 BPC with
50.1M parameters, and is competitive
to the best performing methods [
38
,
54] with similar parameter numbers.
3.2 Algorithmic Tasks
Figure 4: Performance and runtime of different configurations
on the addition (left) and memorization (right) tasks.
(a)
Addition
: The task is to sum
two 15-digit integers. The network
first reads two integers with one
digit per timestep, and then predicts
the summation. We follow the pro-
cessing of [
30
], where a symbol
-
’ is used to delimit the integers
as well as pad the input/target se-
quence. A 3-digit integer addition
task is of the form:
Input: -123-900-----
Target:--------1023-
(b)
Memorization
: The goal of this
task is to memorize a sequence of
20 random symbols. Similar to the
addition task, we use 65 different
6
symbols. A 5-symbol memorization task is of the form:
Input: -abccb------
Target: ------abccb-
We evaluate all configurations with
L= 1,4,7,10
on both tasks, where
M
is 400 for addition and
100 for memorization. The performance is measured by the symbol prediction accuracy.
Fig. 4 show the results. In both tasks, large
L
degrades the performances of sLSTM and 2D tLSTM–
M. In contrast, the performance of 2D tLSTM–F steadily improves with
L
increasing, and is further
enhanced by using feedback connections, higher-dimensional tensors, and CN, while LN helps only
when
L=1
. Note that in both tasks, the correct solution can be found (when
100%
test accuracy is
achieved) due to the repetitive nature of the task. In our experiment, we also observe that for the
addition task, 3D tLSTM+CN with
L= 7
outperforms other configurations and finds the solution
with only 298K training samples, while for the memorization task, 3D tLSTM+CN with
L=10
beats
others configurations and achieves perfect memorization after seeing 54K training samples. Also,
unlike in sLSTM, the runtime of all tLSTMs is largely unaffected by L.
Table 2: Test accuracies on two algorithmic tasks.
Addition Memorization
Acc. # Samp. Acc. # Samp.
Stacked LSTM [21]
51%
5M >
50%
900K
Grid LSTM [30] >
99%
550K >
99%
150K
3D tLSTM+CN (L= 7)>
99%
298K >
99%
115K
3D tLSTM+CN (L= 10)>
99%
317K >
99%
54K
We further compare the best
performing configurations to
the state-of-the-art methods
for both tasks (see Table 2).
Our models solve both tasks
significantly faster (i.e., using
fewer training samples) than
other models, achieving the
new state-of-the-art results.
3.3 MNIST Image Classification
Figure 5: Performance and runtime of different configurations
on sequential MNIST (left) and sequential pMNIST (right).
The MNIST dataset [
35
] consists
of 50000/10000/10000 handwritten
digit images of size
28×28
for train-
ing/validation/test. We have two
tasks on this dataset:
(a)
Sequential MNIST
: The goal
is to classify the digit after sequen-
tially reading the pixels in a scan-
line order [
33
]. It is therefore a
784 timestep sequence learning task
where a single output is produced at
the last timestep; the task requires
very long range dependencies in the
sequence.
(b)
Sequential Permuted MNIST
:
We permute the original image pix-
els in a fixed random order as in
[
2
], resulting in a permuted MNIST
(pMNIST) problem that has even longer range dependencies across pixels and is harder.
In both tasks, all configurations are evaluated with
M=100
and
L=1,3,5
. The model performance
is measured by the classification accuracy.
Results are shown in Fig. 5. sLSTM and 2D tLSTM–M no longer benefit from the increased depth
when
L= 5
. Both increasing the depth and tensorization boost the performance of 2D tLSTM.
However, removing feedback connections from 2D tLSTM seems not to affect the performance. On
the other hand, CN enhances the 3D tLSTM and when
L3
it outperforms LN. 3D tLSTM+CN
with L=5 achieves the highest performances in both tasks, with a validation accuracy of 99.1% for
MNIST and 95.6% for pMNIST. The runtime of tLSTMs is negligibly affected by
L
, and all tLSTMs
become faster than sLSTM when L=5.
7
Figure 6: Visualization of the diagonal channel means of the tLSTM memory cells for each task. In
each horizontal bar, the rows from top to bottom correspond to the diagonal locations from
pin
to
pout
, the columns from left to right correspond to different timesteps (from
1
to
T+L1
for the full
sequence, where
L1
is the time delay), and the values are normalized to be in range
[0,1]
for better
visualization. Both full sequences in (d) and (e) are zoomed out horizontally.
Table 3: Test accuracies (%) on sequential MNIST/pMNIST.
MNIST pMNIST
iRNN [33] 97.0 82.0
LSTM [2] 98.2 88.0
uRNN [2] 95.1 91.4
Full-capacity uRNN [49] 96.9 94.1
sTANH [53] 98.1 94.0
BN-LSTM [13] 99.0 95.4
Dilated GRU [8] 99.2 94.6
Dilated CNN [40] in [8] 98.3 96.7
3D tLSTM+CN (L= 3)99.2 94.9
3D tLSTM+CN (L= 5) 99.0 95.7
We also compare the configura-
tions of the highest test accuracies
to the state-of-the-art methods (see
Table 3
). For sequential MNIST, our
3D tLSTM+CN with
L=3
performs
as well as the state-of-the-art Dilated
GRU model [
8
], with a test accu-
racy of 99.2%. For the sequential
pMNIST, our 3D tLSTM+CN with
L= 5
has a test accuracy of 95.7%,
which is close to the state-of-the-art
of 96.7% produced by the Dilated
CNN [40] in [8].
3.4 Analysis
The experimental results of different model configurations on different tasks suggest that the perfor-
mance of tLSTMs can be improved by increasing the tensor size and network depth, requiring no
additional parameters and little additional runtime. As the network gets wider and deeper, we found
that the memory cell convolution mechanism is crucial to maintain improvement in performance.
Also, we found that feedback connections are useful for tasks of sequential output (e.g., our Wikipedia
and algorithmic tasks). Moreover, tLSTM can be further strengthened via tensorization or CN.
It is also intriguing to examine the internal working mechanism of tLSTM. Thus, we visualize the
memory cell which gives insight into how information is routed. For each task, the best performing
tLSTM is run on a random example. We record the channel mean (the mean over channels, e.g., it is
of size
P×P
for 3D tLSTMs) of the memory cell at each timestep, and visualize the diagonal values
of the channel mean from location pin =[1,1] (near the input) to pout =[P, P ](near the output).
Visualization results in Fig. 6 reveal the distinct behaviors of tLSTM when dealing with different tasks:
(i) Wikipedia: the input can be carried to the output location with less modification if it is sufficient
to determine the next character, and vice versa; (ii) addition: the first integer is gradually encoded
into memories and then interacts (performs addition) with the second integer, producing the sum; (iii)
memorization: the network behaves like a shift register that continues to move the input symbol to the
output location at the correct timestep; (iv) sequential MNIST: the network is more sensitive to the
pixel value change (representing the contour, or topology of the digit) and can gradually accumulate
evidence for the final prediction; (v) sequential pMNIST: the network is sensitive to high value pixels
(representing the foreground digit), and we conjecture that this is because the permutation destroys
the topology of the digit, making each high value pixel potentially important.
From Fig. 6 we can also observe common phenomena for all tasks: (i) at each timestep, the values
at different tensor locations are markedly different, implying that wider (larger) tensors can encode
more information, with less effort to compress it; (ii) from the input to the output, the values become
increasingly distinct and are shifted by time, revealing that deep computations are indeed performed
together with temporal computations, with long-range dependencies carried by memory cells.
8
Figure 7: Examples of models related to tLSTMs. (a) A single layer cLSTM [
48
] with vector array
input. (b) A 3-layer sLSTM [
21
]. (c) A 3-layer Grid LSTM [
30
]. (d) A 3-layer RHN [
54
]. (e) A
3-layer QRNN [7] with kernel size 2, where costly computations are done by temporal convolution.
4 Related Work
Convolutional LSTMs.
Convolutional LSTMs (cLSTMs) are proposed to parallelize the compu-
tation of LSTMs when the input at each timestep is structured (see Fig. 7(a)), e.g., a vector array
[
48
], a vector matrix [
41
,
42
,
50
,
52
], or a vector tensor [
9
,
45
]. Unlike cLSTMs, tLSTM aims to
increase the capacity of LSTMs when the input at each timestep is non-structured, i.e., a single vector,
and is advantageous over cLSTMs in that: (i) it performs the convolution across different hidden
layers whose structure is independent of the input structure, and integrates information bottom-up
and top-down; while cLSTM performs the convolution within each hidden layer whose structure is
coupled with the input structure, thus will fall back to the vanilla LSTM if the input at each timestep
is a single vector; (ii) it can be widened efficiently without additional parameters by increasing the
tensor size; while cLSTM can be widened by increasing the kernel size or kernel channel, which
significantly increases the number of parameters; (iii) it can be deepened with little additional run-
time by delaying the output; while cLSTM can be deepened by using more hidden layers, which
significantly increases the runtime; (iv) it captures long-range dependencies from multiple directions
through the memory cell convolution; while cLSTM struggles to capture long-range dependencies
from multiple directions since memory cells are only gated along one direction.
Deep LSTMs.
Deep LSTMs (dLSTMs) extend sLSTMs by making them deeper (see Fig. 7(b)-(d)).
To keep the parameter number small and ease training, Graves
[22]
, Kalchbrenner et al.
[30]
, Mujika
et al. [38], Zilly et al. [54] apply another RNN/LSTM along the depth direction of dLSTMs, which,
however, multiplies the runtime. Though there are implementations to accelerate the deep computation
[
1
,
16
], they generally aim at simple architectures such sLSTMs. Compared with dLSTMs, tLSTM
performs the deep computation with little additional runtime, and employs a cross-layer convolution to
enable the feedback mechanism. Moreover, the capacity of tLSTM can be increased more efficiently
by using higher-dimensional tensors, whereas in dLSTM all hidden layers as a whole only equal to a
2D tensor (i.e., a stack of hidden vectors), the dimensionality of which is fixed.
Other Parallelization Methods.
Some methods [
7
,
8
,
28
,
29
,
36
,
40
] parallelize the temporal
computation of the sequence (e.g., use the temporal convolution, as in Fig. 7(e)) during training, in
which case full input/target sequences are accessible. However, during the online inference when the
input presents sequentially, temporal computations can no longer be parallelized and will be blocked
by deep computations of each timestep, making these methods potentially unsuitable for real-time
applications that demand a high sampling/output frequency. Unlike these methods, tLSTM can speed
up not only training but also online inference for many tasks since it performs the deep computation
by the temporal computation, which is also human-like: we convert each signal to an action and
meanwhile receive new signals in a non-blocking way. Note that for the online inference of tasks
that use the previous output
yt1
for the current input
xt
(e.g., autoregressive sequence generation),
tLSTM cannot parallel the deep computation since it requires to delay L1timesteps to get yt1.
5 Conclusion
We introduced the Tensorized LSTM, which employs tensors to share parameters and utilizes the
temporal computation to perform the deep computation for sequential tasks. We validated our model
on a variety of tasks, showing its potential over other popular approaches.
9
Acknowledgements
This work is supported by the NSFC grant 91220301, the Alan Turing Institute under the EPSRC
grant EP/N510129/1, and the China Scholarship Council.
References
[1]
Jeremy Appleyard, Tomas Kocisky, and Phil Blunsom. Optimizing performance of recurrent neural networks
on gpus. arXiv preprint arXiv:1604.01946, 2016. 3, 9
[2]
Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. In ICML,
2016. 7, 8
[3]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint
arXiv:1607.06450, 2016. 4, 5
[4]
Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent
is difficult. IEEE TNN, 5(2):157–166, 1994. 1
[5]
Yoshua Bengio. Learning deep architectures for ai. Foundations and trends
R
in Machine Learning, 2009. 2
[6]
Luca Bertinetto, João F Henriques, Jack Valmadre, Philip Torr, and Andrea Vedaldi. Learning feed-forward
one-shot learners. In NIPS, 2016. 2, 4
[7]
James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. Quasi-recurrent neural networks. In
ICLR, 2017. 9
[8]
Shiyu Chang, Yang Zhang, Wei Han, Mo Yu, Xiaoxiao Guo, Wei Tan, Xiaodong Cui, Michael Witbrock,
Mark Hasegawa-Johnson, and Thomas Huang. Dilated recurrent neural networks. In NIPS, 2017. 8, 9
[9]
Jianxu Chen, Lin Yang, Yizhe Zhang, Mark Alber, and Danny Z Chen. Combining fully convolutional and
recurrent neural networks for 3d biomedical image segmentation. In NIPS, 2016. 9
[10]
Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Gated feedback recurrent neural
networks. In ICML, 2015. 3, 13
[11]
Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks. In
ICLR, 2017. 6
[12]
Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. Torch7: A matlab-like environment for
machine learning. In NIPS Workshop, 2011. 13
[13]
Tim Cooijmans, Nicolas Ballas, César Laurent, and Aaron Courville. Recurrent batch normalization. In
ICLR, 2017. 8
[14]
Bert De Brabandere, Xu Jia, Tinne Tuytelaars, and Luc Van Gool. Dynamic filter networks. In NIPS, 2016.
4
[15]
Misha Denil, Babak Shakibi, Laurent Dinh, Nando de Freitas, et al. Predicting parameters in deep learning.
In NIPS, 2013. 2, 4
[16]
Greg Diamos, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski, Adam Coates, Erich Elsen, Jesse
Engel, Awni Hannun, and Sanjeev Satheesh. Persistent rnns: Stashing recurrent weights on-chip. In ICML,
2016. 9
[17] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990. 1
[18]
Timur Garipov, Dmitry Podoprikhin, Alexander Novikov, and Dmitry Vetrov. Ultimate tensorization:
compressing convolutional and fc layers alike. In NIPS Workshop, 2016. 2
[19]
Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with lstm.
Neural computation, 12(10):2451–2471, 2000. 1
[20]
Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent
neural networks. In ICASSP, 2013. 2
[21]
Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
5, 7, 9
[22]
Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983,
2016. 9
[23] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. In ICLR, 2017. 4, 6
[24]
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780,
1997. 1
[25] Marcus Hutter. The human knowledge compression contest. URL http://prize.hutter1.net, 2012. 6
[26]
Ozan Irsoy and Claire Cardie. Modeling compositionality with multiplicative recurrent neural networks.
In ICLR, 2015. 2
10
[27]
Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An empirical exploration of recurrent network
architectures. In ICML, 2015. 13
[28] Łukasz Kaiser and Samy Bengio. Can active memory replace attention? In NIPS, 2016. 9
[29] Łukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. In ICLR, 2016. 9
[30]
Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. Grid long short-term memory. In ICLR, 2016. 6, 7, 9,
13
[31] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. 13
[32]
Ben Krause, Liang Lu, Iain Murray, and Steve Renals. Multiplicative lstm for sequence modelling. In
ICLR Workshop, 2017. 2, 6
[33]
Quoc V Le, Navdeep Jaitly, and Geoffrey E Hinton. A simple way to initialize recurrent networks of
rectified linear units. arXiv preprint arXiv:1504.00941, 2015. 7, 8
[34]
Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard,
and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation,
1(4):541–551, 1989. 2
[35]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. 2, 7
[36] Tao Lei and Yu Zhang. Training rnns as fast as cnns. arXiv preprint arXiv:1709.02755, 2017. 9
[37]
Gundram Leifert, Tobias Strauß, Tobias Grüning, Welf Wustlich, and Roger Labahn. Cells in multidimen-
sional recurrent neural networks. JMLR, 17(1):3313–3349, 2016. 4, 13
[38]
Asier Mujika, Florian Meier, and Angelika Steger. Fast-slow recurrent neural networks. In NIPS, 2017. 6,
9
[39]
Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks.
In NIPS, 2015. 2
[40]
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal
Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv
preprint arXiv:1609.03499, 2016. 8, 9
[41]
Viorica Patraucean, Ankur Handa, and Roberto Cipolla. Spatio-temporal video autoencoder with differen-
tiable memory. In ICLR Workshop, 2016. 9
[42]
Bernardino Romera-Paredes and Philip Hilaire Sean Torr. Recurrent instance segmentation. In ECCV,
2016. 9
[43]
David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-
propagating errors. Nature, 323(6088):533–536, 1986. 1
[44]
Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent
networks. Neural Computation, 4(1):131–139, 1992. 4
[45]
Marijn F Stollenga, Wonmin Byeon, Marcus Liwicki, and Juergen Schmidhuber. Parallel multi-dimensional
lstm, with application to fast biomedical volumetric image segmentation. In NIPS, 2015. 9
[46]
Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In
ICML, 2011. 2, 4
[47]
Graham W Taylor and Geoffrey E Hinton. Factored conditional restricted boltzmann machines for modeling
motion style. In ICML, 2009. 2
[48]
Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In
ICML, 2016. 3, 9
[49]
Scott Wisdom, Thomas Powers, John Hershey, Jonathan Le Roux, and Les Atlas. Full-capacity unitary
recurrent neural networks. In NIPS, 2016. 8
[50]
Lin Wu, Chunhua Shen, and Anton van den Hengel. Deep recurrent convolutional networks for video-based
person re-identification: An end-to-end approach. arXiv preprint arXiv:1606.01609, 2016. 9
[51]
Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, and Ruslan Salakhutdinov. On multiplicative
integration with recurrent neural networks. In NIPS, 2016. 2, 6
[52]
SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, and Wang-chun Woo. Convolu-
tional lstm network: A machine learning approach for precipitation nowcasting. In NIPS, 2015. 9
[53]
Saizheng Zhang, Yuhuai Wu, Tong Che, Zhouhan Lin, Roland Memisevic, Ruslan R Salakhutdinov, and
Yoshua Bengio. Architectural complexity measures of recurrent neural networks. In NIPS, 2016. 8
[54]
Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutník, and Jürgen Schmidhuber. Recurrent highway
networks. In ICML, 2017. 6, 9
11
A Mathematical Definition for Cross-Layer Convolutions
A.1 Hidden State Convolution
The hidden state convolution in (6) is defined as:
At,p,mo=
K
X
k=1
Mi
X
mi=1
Hcat
t1,pK1
2+k,mi·Wh
k,mi,mo
+bh
mo(24)
where mo{1,2,· · · , M o}and zero padding is applied to keep the tensor size.
A.2 Memory Cell Convolution
The memory cell convolution in (17) is defined as:
Cconv
t1,p,m =
K
X
k=1
Ct1,pK1
2+k,m ·Wc
t,k,1,1(p)(25)
To prevent the stored information from being flushed away,
Ct1
is padded with the replication of its
boundary values instead of zeros or input projections.
B Derivation for the Constraint of L,P, and K
Figure 8: Illustration of calculating the constraint of
L
,
P
, and
K
. Each column is a concatenated
hidden state tensor with tensor size
P+1=4
and channel size
M
. The volume of the output receptive
field (blue region) is determined by the kernel radius
Kr
. The output
yt
for current timestep
t
is
delayed by L1=2 timesteps.
Here we derive the constraint of
L
,
P
, and
K
that is defined in (9). The kernel center location is
ceiled in case that the kernel size Kis not odd. Then, the kernel radius Krcan be calculated by:
Kr=KKmod 2
2(26)
As shown in Fig. 8, to guarantee the receptive field of
yt
covers
x1:t
while does not cover
xt+1:T
,
the following constraint should be satisfied:
tan AOD 6tan B OD <tan COD (27)
which means:
P
L6Kr
1<P
L1(28)
Plugging (26) into (28), we get:
L=l2P
KKmod 2 m(29)
12
C Memory Cell Convolution Helps to Prevent the Vanishing/Exploding
Gradients
Leifert et al.
[37]
have proved that the lambda gate, which is very similar to our memory cell
convolution kernel, can help to prevent the vanishing/exploding gradients (see Theorem 17-18 in
[37]). The differences between our approach and their lambda gate are: (i) we normalize the kernel
values though a softmax function, while they normalize the gate values by dividing them with their
sum, and (ii) we share the kernel for all channels, while they do not. However, as neither modifications
affects the conditions of validity for Theorem 17-18 in [
37
], our memory cell convolution can also
help to prevent the vanishing/exploding gradients.
D Training Details
D.1 Objective Function
The training objective is to minimize the negative log-likelihood (NLL) of the training sequences
w.r.t. the parameter θ(vectorized), i.e.,
min
θ
1
N
N
X
n=1
Tn
X
t=1
ln p(yd
n,t|f(xd
n,1:t;θ)) (30)
where
N
is the number of training sequences,
Tn
the length of the
n
-th training sequence, and
p(yd
n,t|f(xd
n,1:t;θ))
the likelihood of target
yd
n,t
conditioned on its prediction
yn,t =f(xd
n,1:t;θ)
.
Since all experiment are classification problems,
yd
n,t
is represented as the one-hot encoding of the
class label, and the output function
ϕ(·)
is defined as a softmax function, which is used to generate
the class distribution yn,t. Then, the likelihood can be calculated by p(yd
n,t|yn,t ) = yn,t,s|yd
n,t,s=1 .
D.2 Common Settings
In all tasks, the NLL (see (30)) is used as the training objective and is minimized by Adam [
31
] with
a learning rate of 0.001. Forget gate biases are set to 4 for image classification tasks and 1 [
27
] for
others. All models are implemented by Torch7 [
12
] and accelerated by cuDNN on Tesla K80 GPUs.
We only apply CN to the output of the tLSTM hidden state as we have tried different combinations
and found this is the most robust way that can always improve the performance for all tasks. With
CN, the output of hidden state becomes:
Ht=φ(CN (Ct;Γ,B)) O(31)
D.3 Wikipedia Language Modeling
As in [
10
], we split the dataset into 90M/5M/5M for training/validation/test. In each iteration, we
feed the model with a mini-batch of 100 subsequences of length 50. During the forward pass, the
hidden values at the last timestep are preserved to initialize the next iteration. We terminate training
after 50 epochs.
D.4 Algorithmic Tasks
Following [
30
], for both tasks we randomly generate 5M samples for training and 100 samples for
test, and set the mini-batch size to 15. Training proceeds for at most 1 epoch
3
and will be terminated
if 100% test accuracy is achieved.
D.5 MNIST Image Classification
We set the mini-batch size to 50 and use early stopping for training. The training loss is calculated at
the last timestep.
3To simulate the online learning process, we use all training samples only once.
13
... Recently, Deep Neural Networks (DNNs) have been successfully applied to various high-dimensional machine learning tasks, such as computer vision and natural language processing [12], [9], [28]. Instead of learning the mapping relationship, DNNs learn the intrinsic geometric structure of data representations and flatten the data in higher layers [1], [6]. ...
Preprint
Autoencoders have achieved great success in various computer vision applications. The autoencoder learns appropriate low dimensional image representations through the self-supervised paradigm, i.e., reconstruction. Existing studies mainly focus on the minimizing the reconstruction error on pixel level of image, while ignoring the preservation of Intrinsic Dimension (ID), which is a fundamental geometric property of data representations in Deep Neural Networks (DNNs). Motivated by the important role of ID, in this paper, we propose a novel deep representation learning approach with autoencoder, which incorporates regularization of the global and local ID constraints into the reconstruction of data representations. This approach not only preserves the global manifold structure of the whole dataset, but also maintains the local manifold structure of the feature maps of each point, which makes the learned low-dimensional features more discriminant and improves the performance of the downstream algorithms. To our best knowledge, existing works are rare and limited on exploiting both global and local ID invariant properties on the regularization of autoencoders. Numerical experimental results on benchmark datasets (Extended Yale B, Caltech101 and ImageNet) show that the resulting regularized learning models achieve better discriminative representations for downstream tasks including image classification and clustering.
... With increasing computing power, deep neural networks optimized in Euclidean space have achieved remarkable success from computer vision to natural language processing (e.g., autonomous driving and protein structure prediction) [1,2]. However, to fully exploit the valuable information hidden in the data, most deep learning models tend to increase the capacity of their networks, either by widening the existing layers or by adding more layers [3,4,5]. For example, models often contain hundreds of convolution and pooling layers with various activation functions and multiple fully connected layers, producing millions or billions of parameters during training. ...
Preprint
Although Deep Learning (DL) has achieved success in complex Artificial Intelligence (AI) tasks, it suffers from various notorious problems (e.g., feature redundancy, and vanishing or exploding gradients), since updating parameters in Euclidean space cannot fully exploit the geometric structure of the solution space. As a promising alternative solution, Riemannian-based DL uses geometric optimization to update parameters on Riemannian manifolds and can leverage the underlying geometric information. Accordingly, this article presents a comprehensive survey of applying geometric optimization in DL. At first, this article introduces the basic procedure of the geometric optimization, including various geometric optimizers and some concepts of Riemannian manifold. Subsequently, this article investigates the application of geometric optimization in different DL networks in various AI tasks, e.g., convolution neural network, recurrent neural network, transfer learning, and optimal transport. Additionally, typical public toolboxes that implement optimization on manifold are also discussed. Finally, this article makes a performance comparison between different deep geometric optimization methods under image recognition scenarios.
... The proposed approach could be enriched with other deep learning methods. For example the LSTM [10] networks or graph neural networks [6] could be integrated to take into account the temporal dimension in the analytics pipeline. ...
Chapter
Video is the most widely used media format. Automating the editing process would impact many areas, from the film industry to social media content. The editing process defines the structure of a video. In this paper, we present a new method to analyze and characterize the structure of 30-second videos. Specifically, we study the video structure in terms of sequences of shots. We investigate what type of relation there is between what is shown in the video and the sequence of shots used to represent it and if it is possible to define editing classes. To this aim, labeled data are needed, but unfortunately they are not available. Hence, it is necessary to develop new data-driven methodologies to address this issue. In this paper we present Movie Lens, a data driven approach to discover and characterize editing patterns in the analysis of short movie sequences. Its approach relies on the exploitation of the Levenshtein distance, the K-Means algorithm, and a Multilayer Perceptron (MLP). Through the Levenshtein distance and the K-Means algorithm we indirectly label 30 s long movie shot sequences. Then, we train a Multilayer Perceptron to assess the validity of our approach. Additionally the MLP helps domain experts to assess the semantic concepts encapsulated by the identified clusters. We have taken out data from the Cinescale dataset. We have gathered 23 887 shot sequences from 120 different movies. Each sequence is 30 s long. The performance of Movie Lens in terms of accuracy varies (93% - 77%) in relation to the number of classes considered (4-32). We also present a preliminary characterization concerning the identified classes and their relative editing patterns in 16 classes scenario, reaching an overall accuracy of 81%.KeywordsSequence analysisMovie editingK-MeansMultilayer perceptronLevenshtein distance
Article
Full-text available
Successful integration of deep neural networks (DNNs) or deep learning (DL) has resulted in breakthroughs in many areas. However, deploying these highly accurate models for data-driven, learned, automatic, and practical machine learning (ML) solutions to end-user applications remains challenging. DL algorithms are often computationally expensive, power-hungry, and require large memory to process complex and iterative operations of millions of parameters. Hence, training and inference of DL models are typically performed on high-performance computing (HPC) clusters in the cloud. Data transmission to the cloud results in high latency, round-trip delay, security and privacy concerns, and the inability of real-time decisions. Thus, processing on edge devices can significantly reduce cloud transmission cost. Edge devices are end devices closest to the user, such as mobile phones, cyber–physical systems (CPSs), wearables, the Internet of Things (IoT), embedded and autonomous systems, and intelligent sensors. These devices have limited memory, computing resources, and power-handling capability. Therefore, optimization techniques at both the hardware and software levels have been developed to handle the DL deployment efficiently on the edge. Understanding the existing research, challenges, and opportunities is fundamental to leveraging the next generation of edge devices with artificial intelligence (AI) capability. Mainly, four research directions have been pursued for efficient DL inference on edge devices: 1) novel DL architecture and algorithm design; 2) optimization of existing DL methods; 3) development of algorithm–hardware codesign; and 4) efficient accelerator design for DL deployment. This article focuses on surveying each of the four research directions, providing a comprehensive review of the state-of-the-art tools and techniques for efficient edge inference.
Article
The capacity of long-term memory is an important issue in sequence learning, but it remains challenging for the problems of vanishing gradients or out-of-order dependencies. Inspired by human memory, in which long-term memory is broken into fragments and then can be recalled at appropriate times, we propose a neural network via spontaneous temporal grouping in this article. In the architecture, the segmented layer is used for spontaneous sequence segmentation under guidance of the reset gates which are driven to be sparse in the training process; the cascading layer is used to collect information from the temporal groups, where a filtered long short-term memory with chrono-initialization is proposed to alleviate the gradient vanishing phenomenon, and random skip connections are adopted to capture complex dependencies among the groups. Furthermore, the advantage of our neural architecture in long-term memory is demonstrated via a new measurement method. In experiments, we compare the performance with multiple models on several algorithmic or classification tasks, and both of the sequences with fixed lengths like the MNISTs and with varying lengths like the speech utterances are adopted. The results in different criteria have demonstrated the superiority of our proposed neural network.
Article
In this work, we propose a continuous neural network architecture, referred to as Explainable Tensorized Neural - Ordinary Differential Equations (ETN-ODE) network for multi-step time series prediction at arbitrary time points. Unlike existing approaches which mainly handle univariate time series for multi-step prediction, or multivariate time series for single-step predictions, ETN-ODE is capable of handling multivariate time series with arbitrary-step predictions. An additional benefit is its tandem attention mechanism, with respect to temporal and variable attention, which enable it to greatly facilitate data interpretability. Specifically, the proposed model combines an explainable tensorized gated recurrent unit with ordinary differential equations, with the derivatives of the latent states parameterized through a neural network. We quantitatively and qualitatively demonstrate the effectiveness and interpretability of ETN-ODE on one arbitrary-step prediction task and five standard multi-step prediction tasks. Extensive experiments show that the proposed method achieves very accurate predictions at arbitrary time points while attaining very competitive performance against the baseline methods in standard multi-step time series prediction.
Article
Full-text available
Notoriously, learning with recurrent neural networks (RNNs) on long sequences is a difficult task. There are three major challenges: 1) extracting complex dependencies, 2) vanishing and exploding gradients, and 3) efficient parallelization. In this paper, we introduce a simple yet effective RNN connection structure, the DILATEDRNN, which simultaneously tackles all these challenges. The proposed architecture is characterized by multi-resolution dilated recurrent skip connections and can be combined flexibly with different RNN cells. Moreover, the DILATEDRNN reduces the number of parameters and enhances training efficiency significantly, while matching state-of-the-art performance (even with Vanilla RNN cells) in tasks involving very long-term dependencies. To provide a theory-based quantification of the architecture's advantages, we introduce a memory capacity measure - the mean recurrent length, which is more suitable for RNNs with long skip connections than existing measures. We rigorously prove the advantages of the DILATEDRNN over other recurrent neural architectures.
Article
Full-text available
Recurrent neural networks are powerful models for processing sequential data, but they are generally plagued by vanishing and exploding gradient problems. Unitary recurrent neural networks (uRNNs), which use unitary recurrence matrices, have recently been proposed as a means to avoid these issues. However, in previous experiments, the recurrence matrices were restricted to be a product of parameterized unitary matrices, and an open question remains: when does such a parameterization fail to represent all unitary matrices, and how does this restricted representational capacity limit what can be learned? To address this question, we propose full-capacity uRNNs that optimize their recurrence matrix over all unitary matrices, leading to significantly improved performance over uRNNs that use a restricted-capacity recurrence matrix. Our contribution consists of two main components. First, we provide a theoretical argument to determine if a unitary parameterization has restricted capacity. Using this argument, we show that a recently proposed unitary parameterization has restricted capacity for hidden state dimension greater than 7. Second, we show how a complete, full-capacity unitary recurrence matrix can be optimized over the differentiable manifold of unitary matrices. The resulting multiplicative gradient step is very simple and does not require gradient clipping or learning rate adaptation. We confirm the utility of our claims by empirically evaluating our new full-capacity uRNNs on both synthetic and natural data, achieving superior performance compared to both LSTMs and the original restricted-capacity uRNNs.
Article
Full-text available
This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.
Conference Paper
Torch7 is a versatile numeric computing framework and machine learning library that extends Lua. Its goal is to provide a flexible environment to design and train learning machines. Flexibility is obtained via Lua, an extremely lightweight scripting language. High performance is obtained via efficient OpenMP/SSE and CUDA implementations of low-level numeric routines. Torch7 can easily be in- terfaced to third-party software thanks to Lua's light interface.
Conference Paper
We describe a new spatio-temporal video autoencoder, based on a classic spatial image autoencoder and a novel nested temporal autoencoder. The temporal encoder is represented by a differentiable visual memory composed of convolutional long short-term memory (LSTM) cells that integrate changes over time. Here we target motion changes and use as temporal decoder a robust optical flow prediction module together with an image sampler serving as built-in feedback loop. The architecture is end-to-end differentiable. At each time step, the system receives as input a video frame, predicts the optical flow based on the current observation and the LSTM memory state as a dense transformation map, and applies it to the current frame to generate the next frame. By minimising the reconstruction error between the predicted next frame and the corresponding ground truth next frame, we train the whole system to extract features useful for motion estimation without any supervision effort. We believe these features can in turn facilitate learning high-level tasks such as path planning, semantic segmentation, or action recognition, reducing the overall supervision effort.
Article
Recurrent neural networks scale poorly due to the intrinsic difficulty in parallelizing their state computations. For instance, the forward pass computation of $h_t$ is blocked until the entire computation of $h_{t-1}$ finishes, which is a major bottleneck for parallel computing. In this work, we propose an alternative RNN implementation by deliberately simplifying the state computation and exposing more parallelism. The proposed recurrent unit operates as fast as a convolutional layer and 5-10x faster than cuDNN-optimized LSTM. We demonstrate the unit's effectiveness across a wide range of applications including classification, question answering, language modeling, translation and speech recognition. We open source our implementation in PyTorch and CNTK.
Article
Processing sequential data of variable length is a major challenge in a wide range of applications, such as speech recognition, language modeling, generative image modeling and machine translation. Here, we address this challenge by proposing a novel recurrent neural network (RNN) architecture, the Fast-Slow RNN (FS-RNN). The FS-RNN incorporates the strengths of both multiscale RNNs and deep transition RNNs as it processes sequential data on different timescales and learns complex transition functions from one time step to the next. We evaluate the FS-RNN on two character level language modeling data sets, Penn Treebank and Hutter Prize Wikipedia, where we improve state of the art results to $1.19$ and $1.25$ bits-per-character (BPC), respectively. In addition, an ensemble of two FS-RNNs achieves $1.20$ BPC on Hutter Prize Wikipedia outperforming the best known compression algorithm with respect to the BPC measure. We also present an empirical investigation of the learning and network dynamics of the FS-RNN, which explains the improved performance compared to other RNN architectures. Our approach is general as any kind of RNN cell is a possible building block for the FS-RNN architecture, and thus can be flexibly applied to different tasks.
Article
Recurrent neural networks are a powerful tool for modeling sequential data, but the dependence of each timestep's computation on the previous timestep's output limits parallelism and makes RNNs unwieldy for very long sequences. We introduce quasi-recurrent neural networks (QRNNs), an approach to neural sequence modeling that alternates convolutional layers, which apply in parallel across timesteps, and a minimalist recurrent pooling function that applies in parallel across channels. Despite lacking trainable recurrent layers, stacked QRNNs have better predictive accuracy than stacked LSTMs of the same hidden size. Due to their increased parallelism, they are up to 16 times faster at train and test time. Experiments on language modeling, sentiment classification, and character-level neural machine translation demonstrate these advantages and underline the viability of QRNNs as a basic building block for a variety of sequence tasks.
Article
Several mechanisms to focus attention of a neural network on selected parts of its input or memory have been used successfully in deep learning models in recent years. Attention has improved image classification, image captioning, speech recognition, generative models, and learning algorithmic tasks, but it had probably the largest impact on neural machine translation. Recently, similar improvements have been obtained using alternative mechanisms that do not focus on a single part of a memory but operate on all of it in parallel, in a uniform way. Such mechanism, which we call active memory, improved over attention in algorithmic tasks, image processing, and in generative modelling. So far, however, active memory has not improved over attention for most natural language processing tasks, in particular for machine translation. We analyze this shortcoming in this paper and propose an extended model of active memory that matches existing attention models on neural machine translation and generalizes better to longer sentences. We investigate this model and explain why previous active memory models did not succeed. Finally, we discuss when active memory brings most benefits and where attention can be a better choice.
Conference Paper
Instance segmentation is the problem of detecting and delineating each distinct object of interest appearing in an image. Current instance segmentation approaches consist of ensembles of modules that are trained independently of each other, thus missing opportunities for joint learning. Here we propose a new instance segmentation paradigm consisting in an end-to-end method that learns how to segment instances sequentially. The model is based on a recurrent neural network that sequentially finds objects and their segmentations one at a time. This net is provided with a spatial memory that keeps track of what pixels have been explained and allows occlusion handling. In order to train the model we designed a principled loss function that accurately represents the properties of the instance segmentation problem. In the experiments carried out, we found that our method outperforms recent approaches on multiple person segmentation, and all state of the art approaches on the Plant Phenotyping dataset for leaf counting.