PreprintPDF Available

Language Modeling using LMUs: 10x Better Data Efficiency or Improved Scaling Compared to Transformers

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Recent studies have demonstrated that the performance of transformers on the task of language modeling obeys a power-law relationship with model size over six orders of magnitude. While transformers exhibit impressive scaling, their performance hinges on processing large amounts of data, and their computational and memory requirements grow quadratically with sequence length. Motivated by these considerations, we construct a Legendre Memory Unit based model that introduces a general prior for sequence processing and exhibits an O(n) and O(nlnn)O(n \ln n) (or better) dependency for memory and computation respectively. Over three orders of magnitude, we show that our new architecture attains the same accuracy as transformers with 10x fewer tokens. We also show that for the same amount of training our model improves the loss over transformers about as much as transformers improve over LSTMs. Additionally, we demonstrate that adding global self-attention complements our architecture and the augmented model improves performance even further.
Content may be subject to copyright.
Preprint. Work in progress.
LANGUAGE MODELING USING LMUS:
10XBETTER DATA EFFICIENCY OR IMPROVED
SCALING COMPARED TO TRANSFORMERS
Narsimha Chilkuri, Eric Hunsberger, Aaron Voelker, Gurshaant Malik & Chris Eliasmith
Applied Brain Research
Waterloo, Canada
first.last@appliedbrainresearch.com
ABSTRACT
Recent studies have demonstrated that the performance of transformers on the
task of language modeling obeys a power-law relationship with model size over
six orders of magnitude. While transformers exhibit impressive scaling, their per-
formance hinges on processing large amounts of data, and their computational
and memory requirements grow quadratically with sequence length. Motivated
by these considerations, we construct a Legendre Memory Unit based model
that introduces a general prior for sequence processing and exhibits an O(n)and
O(nln n)(or better) dependency for memory and computation respectively. Over
three orders of magnitude, we show that our new architecture attains the same
accuracy as transformers with 10x fewer tokens. We also show that for the same
amount of training our model improves the loss over transformers about as much
as transformers improve over LSTMs. Additionally, we demonstrate that adding
global self-attention complements our architecture and the augmented model im-
proves performance even further.
1 INTRODUCTION
Self-attention architectures such as the transformer (Vaswani et al., 2017) have been extremely suc-
cessful in dealing with sequential data and have come to replace LSTMs and other RNN-based
methods, especially in the domain of Natural Language Processing (Radford et al., 2018; 2019; De-
vlin et al., 2018). Transformers facilitate parallelization within training examples, and this allows
them to fully leverage hardware accelerators such as GPUs, making training on datasets as large
as 750GB feasible (Raffel et al., 2019; Gao et al., 2020). In addition to parallelizing training, self-
attention architectures are much better at handling long-range dependencies relative to traditional
RNNs, and this allows them to take advantage of context much longer than the 100-1000 tokens
typical of RNNs, like the LSTM (Voelker & Eliasmith, 2018; Kaplan et al., 2020).
Transformers are general-purpose architectures that can be applied to a wide variety of problems and
modalities. One of the drawbacks of such generality is the lack of a priori structure, which makes
them heavily reliant on large quantities of data to achieve good results. Another limiting factor is
that self-attention involves the computation of the attention matrix, QKT, which is of shape n×n,
with nbeing the sequence length. Thus, transformer’s compute and memory requirements grow
quadratically with respect to the sequence length.
In this work, we explore a way of addressing these limitations. We base our approach on the non-
parametric Linear Time-Invariant (LTI) component of the Legendre Memory Unit (Voelker et al.,
2019). This LTI system, which we refer to it here as the LMU,1projects a sliding window of the
input sequence onto Legendre polynomials to provide a temporal representation and compression of
the input signal. Although the LTI system is an RNN, it has been shown to support both sequential
and parallel processing of sequences (Chilkuri & Eliasmith, 2021). Another crucial component of
1We refer to the LTI system as the LMU as it is the distinguishing layer of our architectures. The original
LMU was defined to include the LTI system as well as a subsequent nonlinear layer. We have essentially
expanded this nonlinear layer to include a variety of familiar layers.
1
arXiv:2110.02402v1 [cs.LG] 5 Oct 2021
Preprint. Work in progress.
Figure 1: (left) Illustration of the standard sequential implementation for computing the hidden state
m4. The input x1is fed into the linear recurrent unit to compute the hidden state m1, which, along
with x2, is then used to compute the next hidden state, and so on. (right) Illustration of the time-
domain parallel implementation for computing the hidden state m4. The inputs x1-x4are used to
compute the intermediate multiplies, which are then added together to compute the hidden state m4,
all without the need for any sequential operations.
our model is a modified attention mechanism that operates only on the output of the LMU at each
time step, and not across time steps. The LMU state at each step captures information about the past
tokens, and hence we call this attention mechanism implicit self-attention.
Recent work (Kaplan et al., 2020; Gao et al., 2020) has explored the scaling properties of trans-
formers in the context of autoregressive language modelling. They show that the performance of
transformers scales as a power-law with model size (excluding embedding parameters), dataset size
and the amount of compute used for training, when not bottlenecked by the other two. Inspired by
these results, we validate our method by studying the scaling properties of our models. To that end,
we start by reviewing some necessary background in Section 2, and then present the architectural
details of our model in Section 4. In Section 5, we present our experiments with models containing
up to 10 million parameters (or 1 million non-embedding parameters), which demonstrate a smooth
power-law relationship with respect to the model size, and a loss that scales better than transformers.
When the loss is matched to that of transformers, 10x fewer tokens are required.
2 BACKGRO UND: THE LEGENDRE MEM ORY UNIT
The non-parametric LTI component of the LMU (Voelker & Eliasmith, 2018) is the focus of our
study. This LTI system is mathematically derived to project a sliding window of length θof the
input sequence onto qLegendre polynomials. Thus, the two main hyper-parameters to choose when
using it are θand q. Naturally, if we desire to capture the fine-grained details of the input sequence, a
large θ, which sets the length of the sliding window, should be accompanied by a large q, the number
of Legendre polynomials used in approximating the input. We present the state-update equations of
the LTI component of the LMU below,2
mt=¯
Amt1+¯
Bxt,(1)
where the ¯
A=eARq×qand ¯
B=A1(eAI)BRq×1matrices are frozen during training,
with Aand Bdefined as follows:
Ai,j =(2i+ 1)
θ1i<j
(1)ij+1 ij,(2)
Bi=(2i+ 1)(1)i
θ.(3)
Crucially, when needed, the LTI equation (1) above can be evaluated as a convolution, in parallel, as
shown below (Chilkuri & Eliasmith, 2021):
mt=
t
X
j=1
¯
Atj¯
Bxj,(4)
2Focusing on one-dimensional inputs for now.
2
Preprint. Work in progress.
or equivalently, defining
H=¯
A0¯
B¯
A¯
B. . .Rq×n,(5)
x= [xnxn1xn2. . . x1]TRn×1,(6)
the above convolution equation can be written as an element-wise multiplication in the Fourier space
as follows:
m1:n=F1{F{H} · F{x}}.(7)
3 RE LATED WORK
Our work falls under the broad category of combining convolution with self-attention. Recent work
has demonstrated that combining these two modules can be beneficial for language modeling, speech
and other NLP applications (Yang et al., 2019; Wu et al., 2020; Gulati et al., 2020). For instance, Wu
et al. (2020) introduce the Long-short Range Attention module that features a two-branch design,
with the self-attention branch learning global interactions and the convolutional branch learning
local interactions. Gulati et al. (2020) on the other hand propose a model that uses a single branch
architecture, with the convolutional block connected directly to the attention block, and demonstrate
improved performance on the task of speech recognition.
While our work is mathematically related to the models mentioned above, it differs from the previ-
ous approaches in three crucial ways: first, we do not learn the convolutional weights, but instead
work with the analytically defined weights of the LMU; second, we introduce the novel implicit
self-attention module that works on the hidden states of the LMU at each time-step; finally, we
systematically study the scaling properties of our model in the context of language modeling.
Our model is also related to the many studies on reducing the quadratic computational and memory
complexity of self-attention (Tay et al., 2020). Out of the several ways of improving efficiency,
our work shares some similarity with the sliding window attention approach (Beltagy et al., 2020;
Zaheer et al., 2020). While these methods use a form of masking of the full self-attention matrix
to constraint attention to the k-nearest tokens, our technique relies on the LMU to compute optimal
compressed representations of sliding windows of input vectors, each of which is then used as input
to the implicit attention module.
4 ARCHITECTURE
In this paper, we modify the architecture presented in Chilkuri & Eliasmith (2021) to better deal
with the task of language modelling, especially when the sequences are long and high-dimensional.
Starting with the base LMU model, we describe the major components of our model below. An
illustration of our architecture is presented in Figure 2.
Memory Matrix Consider an input sequence {x1,x2,...,xn}of length nwhere the individual
elements are of dimension xiRd. The most natural way of using an LMU-based model on such
sequences is to set θ=nand use an appropriately large q. The downside of using the LMU in
such a manner, however, is that the hidden state of the LMU scales with input dimension and order:
mRdq. For example, in Section 5, we deal with n= 1024 and dthat is as large as 204. Even
when using a small qof 100, we may end up with hidden states that are as large as R20k, which is
highly undesirable.
One way around this issue is to take inspiration from standard convolutional network architectures
(CNNs) and work with a smaller sliding window, θ10, which in turn allows us to use a small
LMU order, q5, thus taming the hidden state dimension (see Chilkuri & Eliasmith (2021) for
more details). However, enforcing a small sliding window prompts the use of many stacked LMU
layers in order to increase the ‘receptive field’ (or the effective θ) of the model, very similar to how
CNNs often use small kernels with many convolutional layers. Unsurprisingly, such an approach
results in very deep models, which can be problematic to train.
Here, we choose to follow the middle path, i.e, 0qn, but instead of working directly with the
potentially high-dimensional hidden state m, we disentangle the input dimensions from the order. In
3
Preprint. Work in progress.
Figure 2: The LMU and implicit self-attention architecture along with output dimensions. In the
illustration, nrefers to the sequence length, qis the order and q0is the reduced order, and dis the
embedding dimension. Normalization layers and skip connections are not shown. One variant uses
the FFN component right after the input, and the other variant uses global attention.
other words, we perform our operations on the matrix MRd×qand not on the vector mRdq.
This is beneficial because while a fully-connected layer needs d2·q2parameters to process the m
vector, processing the matrix Mwith the help of two fully connected layers, one for the rows and
one for the columns, requires only d2+q2parameters.
Implicit Self-Attention The main feature distinguishing our architecture from past work is the
LMU. As mentioned above, the output of the LMU layer compresses past history at each time-
step, which is captured by the MRq×dmatrix. Our modified self-attention acts on this matrix to
combine temporal information. As a result, self-attention does not act directly on the input sequence,
but rather on a compressed version of the input, which is available at each moment in time and covers
a window, determined by θ. Ignoring the bias vectors, normalization layers, and skip-connections,
we first execute the following sets of operations simultaneously:
Q=σ(L1M)K=σ(L2M)V=σ(L3M),(8)
where LiRq0×q,σis a non-linearity such as gelu, and the matrices Q,K,Vare all in Rq0×q. In
our experiments, we have found the setting q0=q/10 to work well, and thus our attention matrices
contain far fewer elements than n2. For example, in our largest model we set q= 250, resulting in
q0= 25.
Following the computation of Q,K, and Vtwo additional computations result in a d-dimensional
vector m:
M0=softmax(QKT)V,(9)
m=pM 0,(10)
where pR1×q0.
In practice, equation (8) can be made far more efficient computationally, especially during inference,
by following the recipe outlined in Section A.1.
Feedforward Network We have also found it beneficial to include a feedforward network (FFN)
(Vaswani et al., 2017) before the LMU and after the implicit self-attention block. The FFN compo-
4
Preprint. Work in progress.
Table 1: Memory and computation scaling with sequence length during training and inference.
Layer Memory Compute
Full Attention O(n2)O(n2)
LMU (Parallel) O(n)O(nln n)
LMU (Recurrent) O(1) O(n)
Table 2: Parameter counts and compute (forward pass) for one layer of the network, per token.
The first row indicates the number of FLOPs when following the implementation in Section A.1.
Additional background information regarding various implementations of the LMU is provided in
Appendix A.2.
Operation Parameters FLOPs per Token
LMU + Q+K+V3qq03d[5(q0+ 1)(log2n+ 1) + 6q0]+6qq0
QKT2dq02
M02dq02+dq0
mq02dq0
FFN 2dd04dd0
nent is defined below:
y(x) = σ(xW1+b1)W2+b2,(11)
where W1Rd×d0,W2Rd0×d,b1Rd0and b2Rd.
Global Self-Attention We also explore the use of a global self-attention block in place of the FFN
component before the LMU. We find that introducing this layer, while computationally expensive,
further improves the cross-entropy score. We believe that the improvement comes from the fact the
LMU and self-attention are complementary: the LMU’s implicit self-attention is good at prediction
with limited context, and the traditional self-attention captures long-range dependencies. We wish
to explore the use of efficient self-attention blocks – which scale better than O(n2)– in the future.
Complexity As shown in Table 1, our architecture employing the parallel LMU along with implicit
self-attention has memory requirements that are linear with respect to the sequence length, and it
has computational requirements that also grow as nln n. When we use the recurrent version of
the LMU, the memory and compute requirements scale as O(1) and O(n)respectively. Recurrent
implementations, while not as efficient on GPU architectures for large batch sizes, are ideally suited
to edge applications, especially with efficient hardware support. Notably, if we add global attention
to our model, then both compute and memory become quadratic, just like the original transformer.
We also list the number of floating point operations per-token for the (parallel) LMU model in
Table 2.
5 EX PER IME NTS
Dataset We train our models on the publicly available internet text dataset called OpenWebText2
(OWT2).3Similar to the WebText2 dataset (Radford et al., 2019), OWT2 was created using URLs
extracted from Reddit submissions with a minimum score of 3 as a proxy for quality, and it consists
of Reddit submissions from 2005 up until April 2020. After additional filtering, applying the pre-
trained GPT2 tokenizer containing 50257 tokens (Radford et al., 2019) results in approximately 8
billion tokens in total. We use a train/validation/test split of 96/3/1%.
3https://www.eleuther.ai/projects/open-web-text2/
5
Preprint. Work in progress.
Figure 3: Cross-entropy scores in nats, averaged across all the tokens in the sequence. Transformers
and LSTMs fits are from Kaplan et al. (2020). Our models perform better than Transformers and
LSTM models up to 1 million non-embedding parameters.
Training Details We train our models in an autoregressive manner using the Adam optimizer with
all the default settings. We use sequences containing 1024 tokens, and in cases where the documents
have fewer than 1024 tokens, we pack multiple documents into the same sequence, separated by the
<|endofsequence|> token. We use a learning rate schedule with a linear warmup and cosine
decay to zero, while also reducing the learning rate on plateau. We chose to train our models to
process a maximum of 13 billion tokens; at a batch size of 512, this amounts to training for 25000
steps. Additionally, one of the important considerations when doing NLP experiments is the size of
the embedding vectors, d. In this work, in order to facilitate a fair comparison to the transformer
models (Kaplan et al., 2020), we use the following rule to determine d:
d=rN
24,
where Nrepresents the number of non-embedding (and trainable) parameters.
Results Here we present the results of our experiments that use the LMU architecture described
above with the non-embedding (and trainable) parameters ranging from 55k to 1M (i.e., from 2.5M
to 10M, if we include all parameters). The cross-entropy results are presented in Figure 3. For the
transformer models, we list the scores obtained by using the following power-law fit,
Transformer(N, S) = N
6.5·1013 0.077
+S
Smin(S)0.76
,(12)
where Nrefers to the number of non-embedding parameters and Sis the total number of training
steps (set to 25000 at batch size of 512, or equivalent). The power-law is obtained from Kaplan et al.
(2020); the fits were generated by training several transformer models, ranging in size from 768 to
1.5 billion non-embedding parameters, on OpenAI’s WebText2 dataset. In addition, we compare
against the power-law for LSTM models, also from Kaplan et al. (2020), that use 10x more training
steps than the transformer and LMU models:
LSTM(N) = N
7.45 ·1014 0.071
.
Similar to the transformer and LSTM models, we notice that the performance of our models depends
strongly on scale, with the loss of the LMU model exhibiting the following power-law relationship
with respect to N:
LMU(N) = N
1.95 ·1014 0.072
.
6
Preprint. Work in progress.
Figure 4: (left) Comparison of per-token loss of an LMU model (with global attention) and a trans-
former model. (right) Per-token loss of an LMU model (without global attention) alongside the
transformer’s loss.
The LMU model with global self-attention scales as follows:
LMUG(N) = N
3.80 ·1014 0.069
.
It remains to be seen whether our models retain this performance advantage when N106.
The utility of adding global attention becomes clear when we observe the per-token loss plots of
the three models. In Figure 4 (right), we notice that although the LMU’s per-token loss is better
overall than the transformer’s, it flattens relatively early, around 100 tokens, suggesting that implicit
attention alone does not capture long context. The LMU and Attention model, on the other hand,
continues improving with increasing context, similar to the transformer.
It is also interesting to note that we can compare LMUs and transformers by determining approxi-
mately how much training is required for a transformer to match LMU loss. Figure 5 demonstrates
that our models, trained on 13 billion tokens, have similar scaling to transformers trained on 130 bil-
lion tokens. Consequently, the LMU architecture is 10x more data efficient. In addition, our LMU
models with global attention continue to outperform transformer models trained on 10x more tokens
(or with 10x more training steps) by a significant margin.
6 DISCUSSION
Semi-supervised learning has proven to be a very effective technique in Natural Language Process-
ing. General purpose language models pre-trained on a large corpus of text in an unsupervised
manner and fine-tuned on tasks such as sentiment analysis and question answering often outperform
highly task-specific architectures that receive no pre-training. The performance of models on the
task of language modelling is thus a crucial metric that is indicative of the downstream performance
of such models on a slew of tasks involving natural language.
While the performance of our models on the task of language modelling suggests an interesting
trend, due to the scale of our experiments however, we do not consider this to be definitive evidence
for the superiority of our LMU architecture. As a result, a core objective for future research is to
show that the observed trends hold over 6 orders of magnitude, as demonstrated by Kaplan et al.
(2020) for transformers.
7 CONCLUSION
In this work, we employ the Legendre Memory to construct a model that is well-suited to han-
dling long sequences with high-dimensional elements. We apply our architectures to model natural
language in the infinite data limit, demonstrating that: (1) like the established architectures such
7
Preprint. Work in progress.
Figure 5: Approximately matching the loss between transformers and LMUs requires 10x more
training for the transformer. The LMU and Attention model continues to significantly outperform
transformers with 10x less training.
as transformers and LSTMS, our models also exhibit a power-law relationship between the cross-
entropy loss and model size; and (2) at the small-medium scale, our models have better scaling
properties than other approaches.
8
Preprint. Work in progress.
REFERENCES
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.
arXiv preprint arXiv:2004.05150, 2020.
Narsimha Reddy Chilkuri and Chris Eliasmith. Parallelizing legendre memory unit train-
ing. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Confer-
ence on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.
1898–1907. PMLR, 18–24 Jul 2021. URL http://proceedings.mlr.press/v139/
chilkuri21a.html.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason
Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text
for language modeling. arXiv preprint arXiv:2101.00027, 2020.
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo
Wang, Zhengdong Zhang, Yonghui Wu, et al. Conformer: Convolution-augmented transformer
for speech recognition. arXiv preprint arXiv:2005.08100, 2020.
Steven G. Johnson and Matteo Frigo. Implementing FFTs in Practice. In C. Sidney Burrus
(ed.), Fast Fourier Transforms. 2012. URL https://cnx.org/contents/ulXtQbN7@
15/Implementing-FFTs-in-Practice.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child,
Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language
models. arXiv preprint arXiv:2001.08361, 2020.
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language under-
standing by generative pre-training. 2018.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language
models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text
transformer. arXiv preprint arXiv:1910.10683, 2019.
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. arXiv
preprint arXiv:2009.06732, 2020.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information
processing systems, pp. 5998–6008, 2017.
Aaron Voelker, Ivana Kaji´
c, and Chris Eliasmith. Legendre memory units: Continuous-time repre-
sentation in recurrent neural networks. In Advances in Neural Information Processing Systems,
pp. 15544–15553, 2019.
Aaron R Voelker and Chris Eliasmith. Improving spiking dynamical networks: Accurate delays,
higher-order synapses, and time cells. Neural computation, 30(3):569–609, 2018.
Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, and Song Han. Lite transformer with long-short
range attention. In International Conference on Learning Representations, 2020. URL https:
//openreview.net/forum?id=ByeMPlHKPH.
Baosong Yang, Longyue Wang, Derek Wong, Lidia S Chao, and Zhaopeng Tu. Convolutional self-
attention networks. arXiv preprint arXiv:1904.03107, 2019.
Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago
Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for
longer sequences. In NeurIPS, 2020.
9
Preprint. Work in progress.
A APPENDIX
A.1 RE DUCED -ORDER LMU
When implementing the LMU in these models, we use a parallelizable approach that computes the
impulse responses of the LMU (which are essentially the Legendre polynomials), and convolve those
with the input sequences (either using raw convolution or FFT-based convolution). Specifically,
given the q-dimensional impulse response H, we compute the LMU memory state MRd×qat
the current time as
M=XH(13)
where XRn×dis the time-series of previous inputs to the LMU, HRq×nis the LMU impulse
response, and is the convolution operator.
We have found that our models are most expressive when using a value of qthat is significantly larger
than q0, as this allows the LMU to ”remember” the time history with high fidelity, but only use the
parts of the history that are most relevant. Rather than explicitly computing the full LMU output M
and then reducing this with the Litransformations as per equation (8), we propose applying the Li
transformations directly to the impulse responses
˜
Hi=LiH(14)
and then applying these individually to directly compute Q,K, and V:
Q=σ(X˜
H1)K=σ(X˜
H2)V=σ(X˜
H3)(15)
This is mathematically equivalent to the formulation expressed previously, but uses significantly
fewer operations per token, particularly when qis large or the ratio of q0to qis small:
LLMU+Q+K+V = 3[dLFFT(q0)+2qq0](16)
where LFFT(q0)is given by Equation 19 with qq0.
A.2 LMU IMPLEMENTATION TRAD E-O FFS
The LMU itself is a LTI dynamical system, with a number of options for implementation. One
implementation is to perform the update each timestep in state-space, using state-space matrices
discretized using the zero-order hold (ZOH) method for high accuracy. The operations required
(per LMU layer and per token) are the multiplications by the ¯
Aand ¯
Bmatrices (with number of
elements q2and q, respectively):
LSS = 2d(q2+q).(17)
Another option is to use an explicit Runge-Kutta method to update the LMU states. By taking
advantage of the unique structure of the Aand Bmatrices (Equations 2 and 3), this implementation
is able to reduce the complexity from O(q2)to O(q), requiring the following approximate number
of operations:
LRK = 6rdq. (18)
where ris the order of the Runge-Kutta method. The disadvantage to this option is that it does not
implement the exact same dynamics as the ideal system discretized with ZOH, and is less numeri-
cally stable particularly for higher values of q.
A disadvantage to both these options is that they must update LMU states sequentially, which is
particularly ill-suited when using highly parallel hardware (e.g. GPU) with a long sequence of
inputs available. In this case, we can take the impulse response of the LMU system (discretized with
ZOH), and convolve it with an input in the FFT domain. This implements the exact same dynamics
as the ZOH state-space system, but with a complexity that is O(q)rather than O(q2):
LFFT =d
n[C(2n) + cmqn +qC (2n)]
=d[5(log2n+ 1)(q+ 1) + 6q].(19)
10
Preprint. Work in progress.
Here, C(n)is the number of FLOPs for a radix-2 Cooley-Tukey FFT implementation (Johnson &
Frigo, 2012):
C(n)=2Cn
2+n
2(cm+ 2ca)(20)
= 5nlog2n(21)
where cm= 6 is the number of FLOPs per complex multiply, and cais the number of FLOPs per
complex addition. For our standard sequence length of n= 1024, this results in:
LFFT1024 =d(61q+ 55).(22)
11
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
We propose a novel memory cell for recurrent neural networks that dynamically maintains information across long windows of time using relatively few resources. The Legendre Memory Unit (LMU) is mathematically derived to orthogonalize its continuous-time history-doing so by solving d coupled ordinary differential equations (ODEs), whose phase space linearly maps onto sliding windows of time via the Legendre polynomials up to degree d − 1. Backpropagation across LMUs outperforms equivalently-sized LSTMs on a chaotic time-series prediction task, improves memory capacity by two orders of magnitude, and significantly reduces training and inference times. LMUs can efficiently handle temporal dependencies spanning 100,000 time-steps, converge rapidly, and use few internal state-variables to learn complex functions spanning long windows of time-exceeding state-of-the-art performance among RNNs on permuted sequential MNIST. These results are due to the network's disposition to learn scale-invariant features independently of step size. Backpropagation through the ODE solver allows each layer to adapt its internal time-step, enabling the network to learn task-relevant timescales. We demonstrate that LMU memory cells can be implemented using m recurrently-connected Poisson spiking neurons, O(m) time and memory, with error scaling as O(d / √m). We discuss implementations of LMUs on analog and digital neuromorphic hardware.
Article
Full-text available
Researchers building spiking neural networks face the challenge of improving the biological plausibility of their model networks while maintaining the ability to quantitatively characterize network behavior. In this work, we extend the theory behind the neural engineering framework (NEF), a method of building spiking dynamical networks, to permit the use of a broad class of synapse models while maintaining prescribed dynamics up to a given order. This theory improves our understanding of how low-level synaptic properties alter the accuracy of high-level computations in spiking dynamical networks. For completeness, we provide characterizations for both continuous-time (i.e., analog) and discrete-time (i.e., digital) simulations. We demonstrate the utility of these extensions by mapping an optimal delay line onto various spiking dynamical networks using higher-order models of the synapse. We show that these networks nonlinearly encode rolling windows of input history, using a scale invariant representation, with accuracy depending on the frequency content of the input signal. Finally, we reveal that these methods provide a novel explanation of time cell responses during a delay task, which have been observed throughout hippocampus, striatum, and cortex.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
  • Iz Beltagy
  • E Matthew
  • Arman Peters
  • Cohan
  • Longformer
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
Parallelizing legendre memory unit training
  • Narsimha Reddy Chilkuri
  • Chris Eliasmith
Narsimha Reddy Chilkuri and Chris Eliasmith. Parallelizing legendre memory unit training. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 1898-1907. PMLR, 18-24 Jul 2021. URL http://proceedings.mlr.press/v139/ chilkuri21a.html.
  • Jacob Devlin
  • Ming-Wei Chang
  • Kenton Lee
  • Kristina Toutanova Bert
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
The pile: An 800gb dataset of diverse text for language modeling
  • Leo Gao
  • Stella Biderman
  • Sid Black
  • Laurence Golding
  • Travis Hoppe
  • Charles Foster
  • Jason Phang
  • Horace He
  • Anish Thite
  • Noa Nabeshima
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
Implementing FFTs in Practice
  • G Steven
  • Matteo Johnson
  • Frigo
Steven G. Johnson and Matteo Frigo. Implementing FFTs in Practice. In C. Sidney Burrus (ed.), Fast Fourier Transforms. 2012. URL https://cnx.org/contents/ulXtQbN7@ 15/Implementing-FFTs-in-Practice.
  • Jared Kaplan
  • Sam Mccandlish
  • Tom Henighan
  • B Tom
  • Benjamin Brown
  • Rewon Chess
  • Scott Child
  • Alec Gray
  • Jeffrey Radford
  • Dario Wu
  • Amodei
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.