PreprintPDF Available

A Formal Framework for Understanding Length Generalization in Transformers

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

A major challenge for transformers is generalizing to sequences longer than those observed during training. While previous works have empirically shown that transformers can either succeed or fail at length generalization depending on the task, theoretical understanding of this phenomenon remains limited. In this work, we introduce a rigorous theoretical framework to analyze length generalization in causal transformers with learnable absolute positional encodings. In particular, we characterize those functions that are identifiable in the limit from sufficiently long inputs with absolute positional encodings under an idealized inference scheme using a norm-based regularizer. This enables us to prove the possibility of length generalization for a rich family of problems. We experimentally validate the theory as a predictor of success and failure of length generalization across a range of algorithmic and formal language tasks. Our theory not only explains a broad set of empirical observations but also opens the way to provably predicting length generalization capabilities in transformers.
Preprint
A FORMAL FRAMEWORK FOR UNDERSTANDING
LENGTH GENERALIZATION IN TRANSFORMERS
Xinting Huang1Andy Yang2Satwik Bhattamishra3Yash Sarrof1
Andreas Krebs4Hattie Zhou5Preetum Nakkiran6Michael Hahn1
1Saarland University 2University of Notre Dame 3University of Oxford
4University of T¨
ubingen 5Mila, Universit´
e de Montr´
eal 6Apple
ABS TRACT
A major challenge for transformers is generalizing to sequences longer than those
observed during training. While previous works have empirically shown that
transformers can either succeed or fail at length generalization depending on the
task, theoretical understanding of this phenomenon remains limited. In this work,
we introduce a rigorous theoretical framework to analyze length generalization in
causal transformers with learnable absolute positional encodings. In particular, we
characterize those functions that are identifiable in the limit from sufficiently long
inputs with absolute positional encodings under an idealized inference scheme us-
ing a norm-based regularizer. This enables us to prove the possibility of length
generalization for a rich family of problems. We experimentally validate the the-
ory as a predictor of success and failure of length generalization across a range
of algorithmic and formal language tasks. Our theory not only explains a broad
set of empirical observations but also opens the way to provably predicting length
generalization capabilities in transformers.
1 INTRODUCTION
A key problem in neural sequence modeling is generalization from shorter to longer sequences
length generalization. A wide range of empirical research has found that transformers’ ability at
length generalization is mixed, with success found on some problems and failure on others (e.g.
Bhattamishra et al., 2020; Anil et al., 2022; Wang et al., 2024a; Kazemnejad et al., 2023; Zhou
et al., 2024b; Awasthi and Gupta, 2023; Jelassi et al., 2023; 2024; Chang and Bisk, 2024). For
instance, while transformer decoders can easily copy long strings (Bhattamishra et al., 2024), length
generalization is substantially more brittle and depends on the absence of repetitions in the string
(Zhou et al., 2024a; Jelassi et al., 2024). Similarly, while transformers can in theory simulate many
finite-state automata in principle (Liu et al., 2023), their success of length generalization in practice
varies widely across different automata (Liu et al., 2023; Bhattamishra et al., 2020). Theoretical
understanding of these phenomena is largely lacking, making it difficult to anticipate on which
problems transformers will succeed or fail to generalizing beyond the length of their training inputs.
An important step towards theoretical understanding was made in the RASP-L Conjecture (Zhou
et al., 2024a). This conjecture states that transformers show good length generalization exactly on
those problems that have simple programs in RASP-L, a fragment of the RASP language (Weiss
et al., 2021) with substantial restrictions on the ways in which positional information can be used.
While Zhou et al. (2024a) provided empirical evidence in support of this idea, two important gaps
remain: First, the RASP-L language has not been fully formalized and its expressiveness itself is
not well understood; thus, it is largely open how to prove that a certain problem is indeed not repre-
sentable in it. Second, while compelling empirical evidence supports a link between definability in
RASP fragments and length generalization, no formal proof exists.
We present a general theoretical framework analyzing length generalization as ultimate identifia-
bility in the limit: When the input-output behavior of a function is observed at longer and longer
input lengths, we ask under what conditions a learner can at some point converge on inferring the
XH and AY are co-first authors.
Lead senior author. Contact: mhahn@lst.uni-saarland.de
1
arXiv:2410.02140v1 [cs.LG] 3 Oct 2024
Preprint
ground-truth function. We answer this in the positive for a specific idealized learning strategy and a
well-defined class of functions: whenever a function belongs to this class, transformers are guaran-
teed to length-generalize in an idealized setting.
Our results apply to multilayer transformers, focusing on causal transformers with absolute posi-
tional encodings (APE) or without positional encodings (NoPE). A key technical challenge in an-
alyzing length generalization for absolute positional encodings is the scaling of the transformer’s
parameter count with the input length. To address this, we define a transformer-like limiting object,
the Limit Transformer, which encapsulates the computations of a sequence of transformers operat-
ing on longer and longer inputs into a single object. We then define an idealized inference procedure
in which transformers are fitted to reproduce a target function on successively longer inputs while
minimizing a specific norm-based regularizer. Our main theoretical result states that the inference
procedure will ultimately lead to length generalization for sufficiently long training inputs, provided
the ground-truth function is expressible by a single such limiting object across all input lengths:
Theorem 1 (Informal Version of Theorem 7).Let fbe the target function expressible by a single
Limit Transformer at all input lengths, subject to restrictions on the use of positional information.
Choose transformers Tn(n= 1,2,3, . . . ) with context size n, where Tnreproduces the behavior of
fup to length n
2, while minimizing a norm-based regularizer. Then, for large n,Tnwill match the
output of the target function fup to length n.
We then show that the expressivity of Limit Transformers can be understood for many functions.
We extend a recently introduced RASP variant (C-RASP, Yang and Chiang, 2024) to provide lower
bounds, showing that transformers will succeed at length generalization on various concrete prob-
lems under the inference procedure. Conversely, we employ communication complexity to obtain
upper bounds on the class of functions for which length generalization is predicted. Experiments
confirm the success of the theory at predicting empirical length generalization behavior across vari-
ous algorithmic tasks and formal languages. Overall, our results formalize the RASP-L Conjecture
and take a step toward a theoretical understanding of length generalization.
2 MO DEL O F TRANSFORMERS
Positional Encoding Scheme We study two positional encoding schemes. One uses no positional
encoding at all; we refer to this as NoPE (No Positional Encoding). The other one uses Absolute
Positional Encodings (APE), with learned per-position embedding vectors p1,...,pN. We follow
Zhou et al. (2024a) in requiring transformers to be able to perform a task at different offsets within
a longer context. Whereas Zhou et al. (2024a) concatenated different examples of a task, we simply
encode an input xof length |x|=kNusing positional encodings p1+o,...,pk+owhere o
is an offset such that k+oN, and require that the transformer correctly performs the task
independently of the offset o0. This mimics the computations in language models, where the
same reasoning task can typically appear at different places in a long context. For simplicity, we
treat positions outside of the input, including those preceding the offset, as empty.
Parameterization We focus on transformers with causal masking; for simplicity, we will use the
term “transformer” for these throughout. A transformer Tis parameterized by a finite alphabet Σ,
a width dN, a token embedding matrix ER|Σd, a context width N(T)N {+∞},
positional encodings {piRd: 1 i<N(T) + 1}, a depth LNand head count HN, key,
query, and value matrices {Kl,h ,Ql,h ,Vl,h Rd×d: 1 lL, 1hH}, MLP matrices and
biases {AlRd×d,BlRd×d;blRd: 1 lL}, and an unembedding matrix UR|Σd.
For matrices, we use ∥·∥and ∥·∥Fto denote the spectral and Frobenius norm respectively.
Computation of Activations and Outputs We assume a standard causal transformer, with a few
technical points: We explicitly scale attention logits with the logarithm of the input length, omit
layer norm, allow Heaviside activations in addition to ReLU activations, and assume that, while
the transformer may overall compute at infinite precision, attention logits are operated over at fixed
fractional precision. We next define all computations formally. Reserving a special SOS symbol not
in Σ, written as “$”, we take the set of input strings to be S, the set of strings xΣwhere x1= $
and $ does not occur in x2...|x|. We now define the computation of the transformer Ton an input
xSwhere 1 |x|< N(T) + 1 (that is |x| N(T)if N(T)<, and |x|is any finite length
2
Preprint
otherwise). If Lis the number of layers, then we write the output of layer l= 1, . . . , L at position
i= 1, . . . , N (T)as y(l)
iRd. Let o0be any offset such that |x|+o<N(T) + 1 that is, x
still fits into the transformer’s context width if encoded at offset o. Given this offset, we set
y(0)
i=Exi+pi+oi= 1,...,|x|(1)
where xiΣis the input symbol at position i. Attention logits, at query position iand key position
jare computed as
a(l,h)
i,j = (y(l1)
j)TKT
l,hQl,h y(l1)
ifor 1ji |x|;l= 1, . . . , L;h= 1, . . . , H (2)
We assume standard softmax attention, but incorporate scaling with log |x|following prior work
finding it necessary to theoretically represent sparse functions and circumvent theoretical limitations
of soft attention (Chiang and Cholak, 2022; Edelman et al., 2022):
Y(l)
i:= y(l1)
i+
H
X
h=1 Pi
j=1 exp log |x| · a(l,h)
i,j Vl,hy(l1)
j
Pi
j=1 exp log |x| · a(l,h)
i,j (3)
After each attention block, the activations are passed through a one-layer MLP:
y(l)
i:= Y(l)
i+Bl·ψl(AlY(l)
i+bl)(4)
where we allow the activation function ψlto be, in each coordinate, either ReLU or Heaviside (see
Appendix D.1 for discussion of this). We omit layer norm, as it plays no important role in our
results, but it can be accounted for. See Appendix D.3 for the role of layer norm.
We assume an infinite-precision setup for the activations, with the restriction that attention logits (2)
and the output of the exp(·)function are both rounded to pfractional bits of precision before further
processing. This is a mild restriction preventing tiny changes in attention patterns from potentially
snowballing into large changes in the output due to infinite precision. See Appendix D.2.
We conceive of a transformer Tas a map from strings xS(|x| N(T)) to vectors of next-
token prediction logits, T(x, o)R|x|×|Σ|, where for i= 1,...,|x|,T(x)i=U y(L)
ifor the
unembedding matrix UR|Σd, and ois the offset. In line with the assumed setup, we focus
on transformers whose input-output behavior is invariant across offsets:T(x, o) = T(x, o)for any
0o, oN(T) |x|. Let F(Σ) be the class of all maps fmapping xSto f(x)R|x|×|Σ|.
3 THEORETICAL FR AME WOR K
3.1 LIMIT TRANSFORMERS
Our theory addresses the setting of transformers with absolute positional encodings, where the width
may grow with the input length. Importantly, we cannot view the ground-truth function as realized
by a single transformer: Even if one assigned such a transformer an infinite number of positional
encodings, it would still effectively only be able to distinguish between a bounded number of po-
sitions, because the width of positional encodings within a single transformer is bounded. Instead,
we will derive a parameterization of transformers that allows us to convert sequences of transform-
ers operating on longer and longer sequences to a single limiting transformer-like object. Our key
technical idea is to reparameterize the transformer in terms of product functions, inner products of
parameter vectors as mediated by parameter matrices, such as
ET
σKT
1,hQ1,h EτpT
iKT
1,hQ1,h pj
pT
iKT
2,hQ2,h V1pjUσV2V1Eσ
(5)
and various others; see Appendix F.1 for the full formal definition. We first note that the trans-
former’s computations are uniquely specified by such products. The number of products as in (5)
depends, among others, on |Σ|,L,H,N(T), but crucially not on d. We will use this parameteri-
zation to translate sequences T1, T2, T3, . . . of transformers running on inputs of length 1,2,3, . . .
to limiting transformer-like objects that are applicable at all input lengths, while keeping width d
bounded even if the widths of Tndiverge to infinity. This limiting object, a Limit Transformer,
differs from an ordinary transformer, as defined in Section 2, just in a few respects. Formally:
3
Preprint
Definition 2. ALimit Transformer is a transformer Twhere:
1. N(T)=+
2. All parameters (including positional encodings pi, and the output of ϕl,h) are expressed in
p-bit precision, for some pN
3. In deviation from ordinary transformers, attention logits on input length Nare computed
as
a(l,h)
i,j = (y(l1)
j)TKT
l,hQl,h y(l1)
i+ϕl,h(j, i)(6)
where ϕl,h :N×NR.
Intuitively, a Limit Transformer can use positional information in two ways: through bounded-
width and bounded-precision positional encodings pi, and through potentially more complicated
functions ϕl,h. Given a transformer, we obtain a Limit Transformer by encoding products of the form
pT
iKT
l,hQl,h pjinto the functions ϕl,h. All other product functions involving positional encodings
are expressed in terms of the positional encodings of the Limit Transformer. Our main result will
link length generalization to expressibility by Limit Transformers satisfying specific properties:
Definition 3. A function f:N×NRis “translation-invariant” if f(i, j ) = f(i+τ, j +τ),i
j, τ0, and “local” if there is τsuch that f(i, j) = 0 when j > i +τ. A Limit Transformer
satisfies (1) PERIODIC if pi=pi+∆ for all ifor some >0, and (2) LO CAL if each ϕl,h is
translation-invariant and local.
Given a set Θnof transformers, the parameterization in terms of inner products permits a translation
from a transformer TΘnto a Limit Transformer, whose width is bounded in terms of R(T), and
which further satisfies PERIODIC and L OCA L. This is formalized in Lemma 52 in the Appendix.
3.2 DE FINITI ON O F INFERE NC E PRO CE DURE
To define the inference procedure, we specify the following hypothesis class at each input length n:
Definition 4 (Hypothesis Class).For each n= 1,2,3, . . . , define the hypothesis class Θnas the
set of transformers T(as defined in Section 2) where (1) N(T) = n, (2) each parameter vector and
matrix of Tis represented at pbits of precision, for some pN, (3) each product function involving
positional encodings is translation-invariant. That is, every product function involving exactly one
positional encoding is constant across positions, and for every 1i, j, i + , j + n,
pT
iM1. . . Mkpj=pT
i+∆M1. . . Mkpj+∆ (7)
whenever M1. . . Mkis a product of parameter matrices linking the input layer.1
Note that the width dof the transformers TΘnis unconstrained. The most interesting requirement
here is the third one: We ask that, while the positional encodings piwill typically vary with position,
their contributions to the transformer’s computations are offset-independent. This is a stronger re-
quirement than for the input-output behavior to be offset-independent: we ask for the transformer’s
“algorithm” itself to be the same across offsets. This is a substantive condition, but we believe it to
be a natural requirement in the context of length generalization. Our inference procedure will use a
regularizer Rfavoring simpler hypotheses. The following will be sufficient:
Definition 5 (Regularizer).Let TΘn, thus N(T) = n. Define R(T)as the sum of (1) L+H;
(2) the precision pused in Definition 4; the precision pused for rounding logits and the output of
exp(·)(Section 2); (3) maxl,h rank(Vl,h); (4) maxl,h KT
l,hQl,h ;maxl,h Vl,h;maxlAlF,
BlF;U; (5) maxipi2,maxσEσ2,maxlbl2; (6) the term
L
X
l=1
H
X
h=1
N(T)
X
j=1 pT
1KT
l,hQl,h pj2(8)
The idea of (8) is to discourage accidental attention between far-away positions that do not appear
together during training, which could hamper length generalization. Due to translation invariance,
1Such as KT
1,hQ1,h ,VT
2,hKT
3,hQ3,hV1,h′′ , and similar. See Appendix F.2 for a formal definition.
4
Preprint
this term entails a bound on products for all pairs pi,pj(ij) entering causal attention. While such
a regularizer is not part of standard training, standard initialization tends to lead to bounded values
for (8) when dis large (Appendix G.1); it thus captures an implicit bias of standard initialization and
training. Importantly, the width ddoes not explicitly enter R; as a consequence, for any sufficiently
large C, the number of transformers TnΘnwith R(Tn)Cis infinite, simply because dis
not constrained. Nonetheless, this regularizer will be sufficient for identification under our idealized
inference procedure, which observes the input-output behavior of the target function fon inputs of
length n
2and selects a transformer Twith maximal context window n,TΘnthat exactly fits
that input-output behavior while minimizing the regularizer R(T):
Definition 6 (Inference Procedure).Given a function f F(Σ), the Inference Procedure obtains a
sequence of transformers T1, T2,·· · Θnas follows. Define Unas the set of TΘnmatching the
behavior of fon all inputs of length n
2. Then choose TnUnsuch that
R(Tn)1
n+ inf
TUnR(T)(9)
In (9), we do not simply ask for minimizing the regularizer, as the set of elements of Unwith R(T)
smaller than a given value need not be finite and thus a minimum need not be attained by any
Tn. Importantly, we only ask Tnto match the behavior of fup to length n
2, formalizing the idea
of training on shorter inputs and testing on longer ones; our identifiability guarantee will provide
conditions under which Tnwill end up matching fcorrectly up to length n representing length
generalization. While we take the testing length to be twice the training length, there is nothing
special about this; our analysis works whenever the training length diverges to infinity.
3.3 MAIN RE SULT: CONVERGENCE OF IN FER EN CE PROC ED URE
Our main result asymptotically characterizes length generalization under the inference procedure
from Definition 6. For functions representable by Limit Transformers satisfying LO CAL and PER I-
OD IC, we guarantee that any run of the Inference Procedure will ultimately achieve length general-
ization, so that transformers with context length nchosen to fit the target function on inputs with
length n
2will, when nis sufficiently large, also perform correctly at all lengths n. Formally,
Theorem 7 (Guaranteed Length Generalization in the Limit).Let f F(Σ). Then the following
are equivalent:
1. fis expressible by a Limit Transformer satisfying PERIODIC and LOCA L.
2. (Guaranteed Length Generalization) Applying the Inference Procedure from Definition 6
to fgenerates a sequence T1, T2, . . . with supn=1,2,3,... R(Tn)<, for which there is
some N0such that, for all m>N0,Tmmatches fon all inputs of any length km.
The formal proof is in Appendix B.1. Intuitively, if fis expressible by a Limit Transformer satis-
fying PERIODIC and LOC AL, then, even though the Inference Procedure produces infinitely many
distinct transformers T1, T2, . . . , these can only traverse a finite set of underlying algorithms, each
described by some Limit Transformer. PERIODIC and L OCA L ensure that the Limit Transformer’s
parameter count effectively remains finite, as its position-related parameters can be fully specified
in terms of p1,...,p1and ϕ(1,1), . . . , ϕ(1, τ ). The regularizer bounds width, depth, and pre-
cision of the Limit Transformers; this keeps the set of algorithms traversed finite. Each of these
finitely many algorithms will either be ruled out at some input length nor else match the behav-
ior of fat all input lengths. At some finite N0, only the latter type of algorithm remains; hence,
transformers produced after this point will match the target function. The proof also entails a result
on NoPE length generalization: Applying the inference procedure to fwhile constraining pi0
will lead to length generalization when fis expressible by a Limit Transformer where all piand all
ϕl,h are zero (Corollary 18). While Theorem 7 guarantees length generalization from length n
2to
length nfor expressible problems, it does not rule out length generalization for inexpressible prob-
lems. Such a statement becomes possible if we allow arbitrary scaling of training vs. testing lengths
(Appendix B.4). Besides length generalization guarantees, Limit Transformers are also useful in
providing expressiveness results for transformers with absolute positional encodings (Appendix C).
5
Preprint
4 WH ICH FUNCTIONS ARE ID ENT IFIABLE? EXPRESSIVENESS OF LIMIT
TRANSFORMERS AND C-RASP
We have found that, if a target function fis expressible by a Limit Transformer satisfying PERIODIC
and LO CA L, then Theorem 7 indicates length generalization under our Inference Procedure. In order
to understand the ramifications of this result, we now study what functions Limit Transformers can
express for these functions, Theorem 7 will then guarantee length generalization.
4.1 SIMPLE EXAMPLE: INDUCTION HEA D
We consider the task of predicting the next token in proportion to the frequency at which different
tokens had previously followed tokens matching the current one:
f(x1. . . xN)i,σ =#{k < i :xk=xi, xk+1 =σ}
#{k < i :xk=xi}(10)
We can construct a Limit Transformer with two layers and one head, wherein the first layer ϕ(i, j ) =
1if i+ 1 = jand 0else; each head copies the preceding symbol’s embedding. In the second
layer, attention focuses on positions with the same symbol. The transformer outputs next-token
predictions in proportion to bigram frequencies in the context, up to approximation error O(1
n)(due
to logit scaling). Hence, Theorem 7 guarantees that the Inference Procedure will length-generalize
on (10), providing a length generalization guarantee for an induction head circuit (Olsson et al.,
2022). A special case of (10) occurs when each symbol occurs exactly once; here, such an induction
head circuit suffices to copy a string (Zhou et al., 2024a), we thus obtain a length generalization
guarantee for copying such strings (see Section 5).
4.2 LE NGTH GEN ER ALIZATI ON FOR C-RASP
We next present a large class of functions for which Theorem 7 guarantees length generalization.
We extend the C-RASP formalism (Yang and Chiang, 2024) with positional information, and then
show that any function defined by a C-RASP program is expressible by a Limit Transformer; hence,
transformers will, by Theorem 7, length-generalize on those functions. We first define C-RASP:
Definition 8 (C-RASP).Let Σbe an alphabet, let Φbe a set of unary relations ϕ:N {0,1},
and let Ψbe a set of binary relations ψ:N×N {0,1}. A C-RASP,Ψ] program P
is defined as a sequence P1, . . . , Pkof C-RASP operations. There are two sorts of operations:
Boolean-Valued Operations
Initial P(i) := Qσ(i)
for σΣ
Boolean P(i) := ¬P1(i)
P(i) := P1(i)P2(i)
Constant P(i) :=
Positional P(i) := ϕ(i)
for ϕΦ
Comparison P(i) := C1(i)C2(i)
Count-Valued Operations
Counting C(i) := #[ji, ψ(i, j )] P(j)
for ψΨ {⊤}
Conditional C(i) := P(i) ? C1(i):C2(i)
Addition C(i) := C1(i) + C2(i)
Subtraction C(i) := C1(i)C2(i)
Constant C(i) := 1
A Counting operation returns the number of positions jiwhere P(j)and ψ(i, j)hold. A
conditional operation returns C1(i)if P(i), and C2(i)otherwise. We use the value of the last
Boolean-valued operation, at the last position of the string, to determine acceptance using a C-RASP
program. That is, if the program is run on input wwith final operation L, then we accept wif and
only if L(|w|)is true. C-RASP[periodic,local]is the class of C-RASP programs where each ϕ(i)
is periodic in i, and each ψ(i, j)is translation-invariant and local (Definition 3). We also write
C-RASP[periodic,local]for the class of all languages accepted by some C-RASP[periodic,local]
program. As an example, we present a program recognizing L= ΣabΣover Σ = {a, b}:
6
Preprint
C-RASP program for L= ΣabΣover Σ = {a, b}
Ca(i) := #[ji, j =i1] Qa(i)# of immediately preceding a(1)
Pa(i) := Ca(i)1Position i1holds an a(2)
Qab(i) := Qb(i)Pa(i)A substring ab ends at position i(3)
Cab(i) := #[ji]Qab (j)# of substrings ab (4)
L(i) := Cab(i)1At least one ab precedes position i(5)
Any C-RASP[periodic, local]program can be translated to a Limit Transformer with correspond-
ing positional functions. We say a Limit Transformer Taccepts an input if the value in the last
dimension in the last position of the output is greater than 0, and rejects otherwise.
Theorem 9. For every C-RASP,Ψ] program Pwith local functions Ψand periodic functions Φ
there exists a Limit Transformer Tthat satisfies PERIODIC and LO CA L such that for all wΣ,
Paccepts wiff Taccepts $w. If Puses no local or periodic relations, then Trequires no functions
ϕl,h or positional encodings pi.
The proof is in Appendix B.6. As a consequence, the Inference Procedure will ultimately length-
generalize on inputs from a function fexpressible by a C-RASP[periodic,local]program. If the
C-RASP program requires no positional functions (i.e., it is in C-RASP[]), then length generaliza-
tion will succeed even with NoPE transformers. We establish that various functions are in C-RASP:
Theorem 10. Membership in the following languages is definable in C-RASP[]: (1) MAJORITY,
(2) DYCK-1, (3) anbncn.
The proof is in Appendix C.1. By Theorem 9, these positive results translate into statements about
length generalization under the Inference Procedure. For these tasks, length generalization even with
NoPE is empirically already well-documented (Bhattamishra et al., 2020). Further, C-RASP[local]
can implement versions of the Induction Head task from Section 4.1, see Appendix C.2.1C.2.2.
C-RASP also helps understand why transformers show varying abilities even on simple finite-state
languages (Bhattamishra et al., 2020; Liu et al., 2023; 2024), a fact poorly understood theoretically.
For instance, we have the following:
Lemma 11. Consider the alphabet Σ = {a, b, e}.
1. P ARI T Y := b(abab)∈ C-RASP[periodic,local]
2. (aa)C-RASP[periodic,local]and (aa)∈ C-RASP[]
3. Σbe∈ C-RASP[periodic,local]
4. LC-RASP[]for piecewise testable L
The proof builds on logics with majority quantifiers, whose expressiveness covers C-RASP (Ap-
pendix C.3). Notably, all of these languages are recognizable by simple finite-state automata which
are expressible by transformers (Liu et al., 2023), but empirical length generalization behavior dif-
fers in line with C-RASP expressiveness (Section 5). PARITY (1) has long been found difficult
for transformers (e.g. Hahn, 2020; Bhattamishra et al., 2020; Anil et al., 2022; Chiang and Cholak,
2022; Del´
etang et al., 2023; Hahn and Rofin, 2024). Result (2) exemplifies the effect of different
positional relations. The language Σbe(3) is a simple model of FlipFlop (Liu et al., 2024), a
language on which transformers empirically struggle to generalize perfectly despite its simplicity
for recurrent models (Liu et al., 2024; Sarrof et al., 2024). The class (4) is useful for determining
the expressibility of languages in C-RASP[], as in Section E.1.2.
4.3 LIMITATIONS: LOGARITHMIC COMMUNICATION COMPLEXITY
Having shown that various functions are definable by Limit Transformers, we now provide a simple
technique for showing that various functions are not definable by Limit Transformers. Informally,
any function satisfying the conditions in Theorem 7 has logarithmic communication complexity.
Formally:
Theorem 12. Let Tbe a Limit Transformer satisfying PERIODIC and LOC AL. On an input
xΣ2N, assume Alice has access to x1...N and Bob has access to xN+1...2N. Then Alice can
communicate Clog Nbits to Bob, where Cdepends on Tbut not N, so that Bob can compute each
activation in the second half, y(l)
i(N+ 1 i2N).
7
Preprint
Bin 1 Bin 2 Bin 3
0
25
50
75
100
Binary
Majority
Bin 1 Bin 2 Bin 3
0
25
50
75
100
Binary Majority
Interleave
Bin 1 Bin 2 Bin 3
0
25
50
75
100
Majority
Bin 1 Bin 2 Bin 3
0
25
50
75
100
Sort
Bin 1 Bin 2 Bin 3
0
25
50
75
100
Copy
Unique
Bin 1 Bin 2 Bin 3
0
25
50
75
100
Copy
Repeat
Bin 1 Bin 2 Bin 3
0
25
50
75
100
Parity
Bin 1 Bin 2 Bin 3
0
25
50
75
100
Addition
123456
7 8 9 10 11 12
13 14 15 16 17
Found CRASP[Periodic, Local] Program No CRASP[Periodic, Local] Program Found CRASP[] Program No CRASP[] Program
Figure 1: Experimental results (y-axis: accuracy), at lengths 50 (Bin 1, training), [51,100] (Bin
2), and [101,150] (Bin 3, generalization), for APE (solid) and NoPE (dotted). Green lines indicate
that we found a C-RASP program (C-RASP[periodic,local]for APE, C-RASP[]for NoPE), red
lines indicate that we proved nonexistence, or found no program. Random baselines are indicated
in gray in (left), and very close to zero in (right). On the algorithmic problems (left), we repli-
cate prior empirical findings; C-RASP expressiveness predicts observed length generalization. On
the regular languages (right, with same xand y-axes as left, Table 2), length generalization tracks
C-RASP expressiveness established in Lemma 11 ((1) = (aa), (17) = Σbe) and other results (see
Appendix E.1). C-RASP expressiveness performs much better than circuit complexity and standard
notions of regular language complexity in predicting length generalization (Appendix, Figures 34).
The proof is in Appendix B.3. In principle, computing activations in the second half of the input
requires full knowledge of the first half of the input, because positions in the second half can freely
attend to positions in the first half. In this situation, one would expect Bob to need full knowledge
of N
2input symbols from Alice’s part, exponentially more than the Clog Nclaimed in the theorem.
This is indeed needed if Tperforms the task of, say, checking if x1...,N and xN+1...2Nare identical.
However, if Tsatisfies PERIODIC and LOCAL, attention must largely be determined by the pres-
ence of tokens and token sequences; when an attention head’s behavior is determined by positional
information, it can only focus its attention on a local neighborhood or equally distribute it over a
periodic pattern. Intuitively, in such cases, the set of possible queries and keys can be grouped into
a finite partitioning, of size bounded independently of N. It then suffices for Alice to communicate,
for each possible group of keys, an aggregate of the value vectors at the positions where a matching
key is computed. The proof (Appendix B.3) formalizes this. As a corollary:
Corollary 13. The following problems are not expressible by Limit Transformers satisfying PERI-
OD IC and LO CA L: (1) copying arbitrary strings, (2) addition of n-digit numbers.
This is proven in Appendix B.3. As a consequence, any run of the Inference Procedure on these
functions will output solutions T1, T2, T3, . . . for which the depth, number of heads, parameter
norms or ranks, MLP dimensions, or precision p, must increase with the input length n; indeed,
length generalization is empirically challenging for these functions (Section 5).
5 EX PER IME NTS
We evaluate the expressiveness of Limit Transformers and C-RASP as a predictor of empirical
length generalization of NoPE and APE transformers. Based on Theorems 7 and 9, we expect that
APE transformers should length-generalize on problems with a C-RASP[periodic,local] program
and that NoPE transformers will be successful in those cases where we found a C-RASP[]program.
We test this prediction on a suite of algorithmic problems and formal languages, largely taken from
prior empirical work on length generalization (Bhattamishra et al., 2020; Zhou et al., 2024a), but
evaluated within a uniform framework.
8
Preprint
Setup For each task, the model is trained on inputs whose LE N is in the range [lmin ,50], where
lmin is the minimum length for this task. LE N is the length of the input in the algorithmic tasks
(Appendix E.2), and the overall sequence length in the formal language tasks. The model is tested
on 3 test sets, where LEN is in the range [lmin ,50],[51,100],[101,150]; these lengths are based
on the source of the regular languages benchmark (Bhattamishra et al., 2020). We trained using a
standard AdamW setup; see details in Appendix E.3. Hyperparameters are selected by searching in
order of increasing complexity until we find a setting that performs well up to length 100. We
interpret results on lengths [101,150] as a measure of length generalization. Each model has as many
positional encodings as needed to encode the longest inputs (at least 150); each input is presented
with a random offset in agreement with the theoretical setup. On algorithmic sequence-to-sequence
tasks, we train with cross-entropy loss on the output. On formal languages, where next-symbol
predictions are generally not deterministic, we instead train the model to predict the set of legal
next symbols, with each such set coded as an atomic symbol (as in Bhattamishra et al., 2020; Sarrof
et al., 2024). At test time, predictions are considered correct on a sequence if and only if the output
at every step is correct; the random baseline is thus very low on the formal language benchmark. We
report accuracy, the fraction of test sequences where predictions are correct at every step.
Algorithmic Problems We evaluate on 8 algorithmic problems, which largely overlap with Zhou
et al. (2024a), but are tailored to those where C-RASP expressiveness can be clearly settled. Tasks
are defined formally in Appendix E.2.1. A new problem here is BINARY MAJORITY INTER-
LEAVE, which interleaves multiple MAJORITY functions and can be solved by C-RASP using
periodic functions. Length generalization behavior matches C-RASP expressiveness; C-RASP[]
expressiveness predicts the success of NoPE (see Figure 1). In agreement with prior empirical re-
sults (Zhou et al., 2024a; Jelassi et al., 2023), COPY is difficult in the presence of repetition and easy
when it is avoided; these findings match C-RASP expressiveness (Corollary 13 and Section 4.1).
Formal Languages We applied the experimental framework to 17 regular languages assembled
by Bhattamishra et al. (2020), who evaluated length generalization in transformers and LSTMs.
Whereas LSTMs perform strongly across the board, the behavior of transformers on these regular
languages has so far eluded theoretical understanding. While it is known that transformers struggle
with PARITY, it has remained unclear why they would struggle to length-generalize on some seem-
ingly very simple languages. We found C-RASP[periodic,local] programs for 13 of the languages
and proved nonexistence for the others (Appendix E.1.2). Length generalization succeeded in those
cases where we had found a C-RASP[periodic,local] program (see Figure 1 right). In those cases
where a C-RASP[]program exists, generalization also succeeded with NoPE. Generalization failed
for languages where no C-RASP program exists, such as Σbe(#17 in Figure 1; Lemma 11).
6 DISCUSSION
Prior work has empirically found that transformers’ length generalization capabilities differ between
tasks, but theoretical understanding has been lacking. We have introduced a formal framework an-
alyzing length generalization in an idealized inference procedure. The framework explains what is
common across the diverse tasks where prior research has empirically observed successful length
generalization, in terms of expressiveness in two simple mathematical formalisms, Limit Trans-
formers and C-RASP. We also proved that various problems, on which length generalization is less
successful empirically, are not expressible in one or both of these formalisms. Beyond length gener-
alization, the framework further sheds light on the expressiveness of APE transformers. Our results
on length generalization study an idealized regularizer and assume perfect fitting of the training
distribution. Making the guarantee from Theorem 7 more realistic by incorporating SGD training
dynamics and subsampling of training data is an interesting problem for future research.
Our results can be viewed as formalizing the RASP-L Conjecture (Zhou et al., 2024a). Both Limit
Transformers and C-RASP[periodic,local] formalize intuitions underlying RASP-L in restricting
how positional information can be used. An important advance over Zhou et al. (2024a) is that we
settle the expressiveness of these formalisms for many problems, and are able to explicitly prove a
variety of problems with poor empirical length generalization, such as copying with repeated strings,
to be inexpressible by Limit Transformers. Our results provide a step towards rigorously confirming
the idea that expressiveness in such restricted formalisms predicts length generalization.
9
Preprint
Expressiveness of Transformers A substantial line of research has studied the in-principle ex-
pressiveness of transformers (Strobl et al., 2024). Transformers express a subset of the class TC0
(Merrill and Sabharwal, 2023b; Strobl, 2023), but it is unknown if this inclusion is proper. All prob-
lems considered in Section 5 are in TC0, but empirical length generalization behavior largely tracks
C-RASP[periodic,local] expressiveness, which defines a proper subclass of TC0(Appendix C.3.1).
While it remains open if the expressive power of transformers exhausts TC0, our results suggest a
separation between TC0and those problems for which length generalization is possible with abso-
lute positional encodings. In particular, our results suggest that the existence of APE transformers
that perform a task across larger ranges of input lengths is linked to the expressiveness of Limit
Transformers (Section C). It is an open question how far new, yet-to-be-discovered positional en-
coding schemes may increase the range of length generalization; empirical evidence indicates that
NoPE and APE may be hard to beat by other general-purpose encodings (Kazemnejad et al., 2023).
The proof of Theorem 12 is closely linked to previous communication-complexity bounds for trans-
former layers (Sanford et al., 2023; 2024; Peng et al., 2024; Bhattamishra et al., 2024), which im-
portantly were shown only for individual layers, not multilayer transformers. Indeed, Bhattamishra
et al. (2024) showed that such a logarithmic bound is not in general possible for arbitrary multi-
layer transformers. In contrast, our result applies even at multilayer models, which is enabled by the
restrictions on the ways in which positional information can be used in a Limit Transformer.
Length Generalization of Transformers Various studies have empirically evaluated length gen-
eralization in transformers. Our work is most closely related to Zhou et al. (2024a), discussed
above. Bhattamishra et al. (2020) study length generalization on formal languages; we find that C-
RASP[periodic,local] expressiveness explains behavior on their benchmark well (Section 5). Anil
et al. (2022) show that language models, finetuned on various reasoning problems, do not length-
generalize well. Wang et al. (2024a) evaluate length generalization of NoPE transformers on real-
world tasks. Kazemnejad et al. (2023) explore length generalization across different positional en-
coding schemes, finding NoPE to perform surprisingly well. Zhou et al. (2024b) show that length
generalization for addition improves with specific encoding schemes and input formats. Jelassi et al.
(2024) show that transformers can succeed in length generalization on copying when inputs avoid
n-gram repetition. Chang and Bisk (2024) empirically find limitations in generalization in counting.
In contrast to the rich landscape of empirical studies, theoretical understanding of length general-
ization has been limited. Most relevant, Ahuja and Mansouri (2024) study length generalization in
simple neural architectures, including a one-layer transformer setup with linear (not softmax) atten-
tion. Our results, in contrast, apply to multi-layer softmax transformers and make statements about
many concrete problems that have been studied empirically. Some other works (e.g. Hou et al.,
2024; Xiao and Liu, 2023) provide length-generalizing constructions for certain problems but leave
open whether learning would lead to such constructions. Wang et al. (2024b) show that GD training
leads to length generalization on a specific token selection task.
Limitations The main limitation of our results is that we study idealized asymptotic identification
of a global minimum with perfect knowledge of behavior on the training distribution (cf. Sec. 6
and Q.4 in App. A for more discussion). Extending Theorem 7 to account for subsampling of
the training data and learning dynamics is an important problem for future research. In particular,
providing a practical upper bound on the threshold N0at which length generalization is expected is
an interesting problem. Our study focuses on absolute positional encodings; extending it to other
positional encodings (e.g. Su et al., 2024; Press et al., 2021; Ruoss et al., 2023) is another important
problem for future research.
7 CONCLUSION
We have introduced a theoretical framework that unifies a broad array of empirical findings about
successes and failures of length generalization in transformers with absolute positional encodings.
Our framework is based on the analysis of an idealized inference procedure, for which length gen-
eralization provably happens whenever the ground-truth function is expressible with only limited
access to positional information. By providing upper and lower bounds on the expressiveness of
these objects, we accurately predict the success and failure of length generalization across various
algorithmic tasks and formal languages.
10
Preprint
CONTRIBUTIONS
MH coordinated the project. MH and XH developed the conceptual framework of Theorem 7, with
input from the other authors. XH and YS contributed Section 5. AY contributed Section 4.2 with
input from MH and AK. MH, XH, AY, YS jointly developed the translation to Limit Transformers;
MH worked out the formalization. SB contributed Proposition 54, Lemma 56, and provided concep-
tual input throughout the project. AK contributed to settling the C-RASP expressiveness of formal
languages. PN and HZ provided conceptual and writing-level input over the course of the project.
MH drafted the remaining portions of the paper and the proof of Theorem 7, including definitions
and lemmas.
ACKNOWLEDGMENTS
Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) Project-
ID 232722074 SFB 1102. MH thanks Lena Strobl, Dana Angluin, David Chiang, Mark Rofin,
Anthony Lin, and Georg Zetzsche for conversations on related topics.
REFERENCES
Ahuja, K. and Mansouri, A. (2024). On provable length and compositional generalization. arXiv
preprint arXiv:2402.04875.
Anil, C., Wu, Y., Andreassen, A., Lewkowycz, A., Misra, V., Ramasesh, V., Slone, A., Gur-Ari, G.,
Dyer, E., and Neyshabur, B. (2022). Exploring length generalization in large language models.
Advances in Neural Information Processing Systems, 35:38546–38556.
Awasthi, P. and Gupta, A. (2023). Improving length-generalization in transformers via task hinting.
CoRR, abs/2310.00726.
Barcelo, P., Kozachinskiy, A., Lin, A. W., and Podolskii, V. (2024). Logical languages accepted by
transformer encoders with hard attention. In The Twelfth International Conference on Learning
Representations.
Barrington, D. A. M., Compton, K., Straubing, H., and Th´
erien, D. (1992). Regular languages in
NC1. Journal of Computer and System Sciences, 44(3):478–499.
Behle, C., Krebs, A., and Mercer, M. (2007). Linear circuits, two-variable logic and weakly blocked
monoids. In Kuˇ
cera, L. and Kuˇ
cera, A., editors, Mathematical Foundations of Computer Science
2007, pages 147–158, Berlin, Heidelberg. Springer Berlin Heidelberg.
Behle, C., Krebs, A., and Reifferscheid, S. (2009). Regular languages definable by majority quan-
tifiers with two variables. In Diekert, V. and Nowotka, D., editors, Developments in Language
Theory, pages 91–102, Berlin, Heidelberg. Springer Berlin Heidelberg.
Bhattamishra, S., Ahuja, K., and Goyal, N. (2020). On the ability and limitations of transformers to
recognize formal languages. In Webber, B., Cohn, T., He, Y., and Liu, Y., editors, Proceedings
of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020,
Online, November 16-20, 2020, pages 7096–7116. Association for Computational Linguistics.
Bhattamishra, S., Hahn, M., Blunsom, P., and Kanade, V. (2024). Separations in the representational
capabilities of transformers and recurrent architectures. CoRR, abs/2406.09347.
Cadilhac, M. and Paperman, C. (2022). The regular languages of wire linear AC0.Acta Informatica,
59(4):321–336.
Chang, Y. and Bisk, Y. (2024). Language models need inductive biases to count inductively. CoRR,
abs/2405.20131.
Chiang, D. and Cholak, P. (2022). Overcoming a theoretical limitation of self-attention. In Muresan,
S., Nakov, P., and Villavicencio, A., editors, Proceedings of the 60th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland,
May 22-27, 2022, pages 7654–7664. Association for Computational Linguistics.
11
Preprint
De la Higuera, C. (2010). Grammatical inference: learning automata and grammars. Cambridge
University Press.
Del´
etang, G., Ruoss, A., Grau-Moya, J., Genewein, T., Wenliang, L. K., Catt, E., Cundy, C., Hutter,
M., Legg, S., Veness, J., and Ortega, P. A. (2023). Neural networks and the chomsky hierarchy.
Edelman, B. L., Edelman, E., Goel, S., Malach, E., and Tsilivis, N. (2024). The evolution of
statistical induction heads: In-context learning markov chains. arXiv preprint arXiv:2402.11004.
Edelman, B. L., Goel, S., Kakade, S., and Zhang, C. (2022). Inductive biases and variable creation in
self-attention mechanisms. In International Conference on Machine Learning, pages 5793–5831.
PMLR.
Hahn, M. (2020). Theoretical limitations of self-attention in neural sequence models. Transactions
of the Association for Computational Linguistics, 8:156–171.
Hahn, M. and Rofin, M. (2024). Why are sensitive functions hard for transformers? In Proceedings
of the 2024 Annual Conference of the Association for Computational Linguistics (ACL 2024).
arXiv Preprint 2402.09963.
Hao, Y., Angluin, D., and Frank, R. (2022). Formal language recognition by hard attention trans-
formers: Perspectives from circuit complexity. Transactions of the Association for Computational
Linguistics, 10:800–810.
Hou, K., Brandfonbrener, D., Kakade, S. M., Jelassi, S., and Malach, E. (2024). Universal length
generalization with turing programs. CoRR, abs/2407.03310.
Jelassi, S., Brandfonbrener, D., Kakade, S. M., and Malach, E. (2024). Repeat after me: Trans-
formers are better than state space models at copying. In Forty-first International Conference on
Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net.
Jelassi, S., d’Ascoli, S., Domingo-Enrich, C., Wu, Y., Li, Y., and Charton, F. (2023). Length gener-
alization in arithmetic transformers. CoRR, abs/2306.15400.
Kazemnejad, A., Padhi, I., Ramamurthy, K. N., Das, P., and Reddy, S. (2023). The impact of
positional encoding on length generalization in transformers. In Oh, A., Naumann, T., Globerson,
A., Saenko, K., Hardt, M., and Levine, S., editors, Advances in Neural Information Processing
Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023,
New Orleans, LA, USA, December 10 - 16, 2023.
Krebs, A. (2008). Typed semigroups, majority logic, and threshold circuits. PhD thesis, Universit¨
at
T¨
ubingen.
Lange, K.-J. (2004). Some results on majority quantifiers over words. In Proceedings. 19th IEEE
Annual Conference on Computational Complexity, 2004., pages 123–129. IEEE.
Liu, B., Ash, J., Goel, S., Krishnamurthy, A., and Zhang, C. (2024). Exposing attention glitches
with flip-flop language modeling. Advances in Neural Information Processing Systems, 36.
Liu, B., Ash, J. T., Goel, S., Krishnamurthy, A., and Zhang, C. (2023). Transformers learn shortcuts
to automata. In The Eleventh International Conference on Learning Representations.
McNaughton, R. and Papert, S. A. (1971). Counter-Free Automata (MIT research monograph no.
65). The MIT Press.
Merrill, W. and Sabharwal, A. (2023a). The expressive power of transformers with chain of thought.
In NeurIPS 2023 Workshop on Mathematics of Modern Machine Learning.
Merrill, W. and Sabharwal, A. (2023b). A logic for expressing log-precision transformers. In Thirty-
seventh Conference on Neural Information Processing Systems.
Merrill, W. and Sabharwal, A. (2023c). The parallelism tradeoff: Limitations of log-precision trans-
formers. Transactions of the Association for Computational Linguistics, 11:531–545.
12
Preprint
Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. (2017). Exploring generalization in
deep learning. Advances in neural information processing systems, 30.
Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell,
A., Bai, Y., Chen, A., et al. (2022). In-context learning and induction heads. arXiv preprint
arXiv:2209.11895.
Peng, B., Narayanan, S., and Papadimitriou, C. (2024). On limitations of the transformer architec-
ture. arXiv preprint arXiv:2402.08164.
Press, O., Smith, N. A., and Lewis, M. (2021). Train short, test long: Attention with linear biases
enables input length extrapolation. arXiv preprint arXiv:2108.12409.
Ruoss, A., Del´
etang, G., Genewein, T., Grau-Moya, J., Csord´
as, R., Bennani, M., Legg, S., and
Veness, J. (2023). Randomized positional encodings boost length generalization of transformers.
arXiv preprint.
Sanford, C., Hsu, D., and Telgarsky, M. (2024). One-layer transformers fail to solve the induction
heads task. arXiv preprint.
Sanford, C., Hsu, D. J., and Telgarsky, M. (2023). Representational strengths and limitations of
transformers. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S.,
editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural
Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16,
2023.
Sarrof, Y., Veitsman, Y., and Hahn, M. (2024). The expressive capacity of state space models: A
formal language perspective. CoRR, abs/2405.17394.
Sch¨
utzenberger, M. P. (1965). On finite monoids having only trivial subgroups. Inf. Control.,
8(2):190–194.
Shazeer, N. (2020). GLU variants improve transformer. arXiv preprint arXiv:2002.05202.
Strobl, L. (2023). Average-hard attention transformers are constant-depth uniform threshold circuits.
CoRR, abs/2308.03212.
Strobl, L., Merrill, W., Weiss, G., Chiang, D., and Angluin, D. (2024). What Formal Languages Can
Transformers Express? A Survey. Transactions of the Association for Computational Linguistics,
12:543–561.
Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., and Liu, Y. (2024). Roformer: Enhanced transformer
with rotary position embedding. Neurocomputing, 568:127063.
Tesson, P. and Th ´
erien, D. (2002). Diamonds are forever: The variety DA. In Semigroups, algo-
rithms, automata and languages, pages 475–499. World Scientific.
Tomita, M. (1982). Dynamic construction of finite-state automata from examples using hill-
climbing. In Proceedings of the Fourth Annual Conference of the Cognitive Science Society,
pages 105–108.
Wang, J., Ji, T., Wu, Y., Yan, H., Gui, T., Zhang, Q., Huang, X., and Wang, X. (2024a). Length gen-
eralization of causal transformers without position encoding. In Ku, L., Martins, A., and Sriku-
mar, V., editors, Findings of the Association for Computational Linguistics, ACL 2024, Bangkok,
Thailand and virtual meeting, August 11-16, 2024, pages 14024–14040. Association for Compu-
tational Linguistics.
Wang, Z., Wei, S., Hsu, D., and Lee, J. D. (2024b). Transformers provably learn sparse token
selection while fully-connected nets cannot. In Forty-first International Conference on Machine
Learning.
Weiss, G., Goldberg, Y., and Yahav, E. (2021). Thinking like transformers. In International Confer-
ence on Machine Learning, pages 11080–11090. PMLR.
13
Preprint
Xiao, C. and Liu, B. (2023). Conditions for length generalization in learning reasoning skills. CoRR,
abs/2311.16173.
Yang, A. and Chiang, D. (2024). Counting like transformers: Compiling temporal counting logic
into softmax transformers. In First Conference on Language Modeling.
Yang, A., Chiang, D., and Angluin, D. (2023). Masked hard-attention transformers recognize exactly
the star-free languages. arXiv Preprint.
Zhou, H., Bradley, A., Littwin, E., Razin, N., Saremi, O., Susskind, J. M., Bengio, S., and Nakkiran,
P. (2024a). What algorithms can transformers learn? A study in length generalization. In The
Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May
7-11, 2024. OpenReview.net.
Zhou, Y., Alon, U., Chen, X., Wang, X., Agarwal, R., and Zhou, D. (2024b). Transformers can
achieve length generalization but not robustly. CoRR, abs/2402.09371.
14
Contents
1 Introduction 1
2 Model of Transformers 2
3 Theoretical Framework 3
3.1 LimitTransformers .................................. 3
3.2 Definition of Inference Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3 Main Result: Convergence of Inference Procedure . . . . . . . . . . . . . . . . . . 5
4 Which Functions are Identifiable? Expressiveness of Limit Transformers and C-RASP 6
4.1 Simple Example: Induction Head . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2 Length Generalization for C-RASP ......................... 6
4.3 Limitations: Logarithmic Communication Complexity . . . . . . . . . . . . . . . 7
5 Experiments 8
6 Discussion 9
7 Conclusion 10
A FAQ 16
B Proofs about Limit Transformers 18
B.1 ProofofTheorem7.................................. 18
B.2 Result for NoPE Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
B.3 Logarithmic Communication Complexity for Limit Transformer . . . . . . . . . . 21
B.4 Statement of Main Theorem for Arbitrary Training Lengths . . . . . . . . . . . . . 22
B.5 Corollary about Expressivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
B.6 From C-RASP toLimitTransformers ........................ 24
C Expressivity Proofs for C-RASP 26
C.1 C-RASP Constructions................................ 26
C.1.1 Majority ................................... 26
C.1.2 Dyck-1 .................................... 26
C.1.3 anbncn.................................... 26
C.1.4 Existential Quantification . . . . . . . . . . . . . . . . . . . . . . . . . . 27
C.1.5 Piecewise Testable Languages . . . . . . . . . . . . . . . . . . . . . . . . 27
C.2 C-RASP[periodic,local]Constructions........................ 27
C.2.1 Induction Head (Argmax Version) . . . . . . . . . . . . . . . . . . . . . . 27
C.2.2 Induction Head (All possible next symbols) . . . . . . . . . . . . . . . . . 28
C.2.3 (aa)..................................... 29
C.3 Expressibility of Regular Languages in C-RASP[periodic,local] . . . . . . . . . . 29
C.3.1 Link to Majority Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
C.3.2 Inexpressibility of Σbe........................... 32
C.3.3 Inexpressibility of PARITY . . . . . . . . . . . . . . . . . . . . . . . . . 34
D Discussion of Design Choices 36
D.1 MLP Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
D.2 FixedPrecision .................................... 37
D.3 LayerNorm...................................... 37
E Additional Details for Experiments 37
E.1 Regular Languages from the Bhattamishra et al 2020 Benchmark . . . . . . . . . . 37
E.1.1 Language Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
E.1.2 C-RASP Expressiveness........................... 38
E.2 AlgorithmicTasks................................... 41
E.2.1 Task Definitions for Algorithmic Problems . . . . . . . . . . . . . . . . . 41
15
Preprint
E.2.2 Limit Transformers and C-RASP expressiveness on algorithmic tasks . . . 43
E.3 Details of Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
F Translating between Transformers and Limit Transformers 45
F.1 Product Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
F.2 Formal Definition of Hypothesis Class . . . . . . . . . . . . . . . . . . . . . . . . 49
F.3 From Limit Transformers to Transformers . . . . . . . . . . . . . . . . . . . . . . 49
F.4 From Transformers to Limit Transformers . . . . . . . . . . . . . . . . . . . . . . 54
F.4.1 Proving Lemma 52 (I): Preliminaries . . . . . . . . . . . . . . . . . . . . 55
F.4.2 Proving Lemma 52 (II): Construction of T................. 56
F.4.3 Proving Lemma 52 (III): Proving Correctness . . . . . . . . . . . . . . . . 60
G Additional Supporting Results 68
G.1 Regularizer at Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
G.2 Empirical Length Generalization of Positional Functions . . . . . . . . . . . . . . 70
G.3 Bound for Encodings Norm in Terms of Function Complexity . . . . . . . . . . . 70
A FAQ
(1) What is the point of introducing Limit Transformers?
Limit Transformers are a mathematical formalism helping us prove a length generalization guarantee
(Theorem 7) for a broad class of functions, not just one specific function. They thus serve as an
object that can help us prove things about standard transformers.
(2) What is the relation between Limit Transformers and C-RASP? Why use two different for-
malisms?
Limit Transformers are closely connected to standard transformers and provide a convenient formal-
ism for formalizing a length generalization guarantee in our inference setup (Theorem 7); they also
provide bounds on APE transformer expressiveness as a side result (Appendix C). C-RASP is a for-
malism based on the RASP language (Weiss et al., 2021), intended to provide a formal abstraction of
the kinds of computations that transformers can perform in a human-readable format. Limit Trans-
formers with PERIODIC and L OCA L express all the functions definable in C-RASP[periodic,local],
though it is open if this inclusion is strict. We provide rigorous tools for understanding the ex-
pressiveness of both formalisms. For Limit Transformers, we prove a logarithmic communication
complexity bound (Theorem 12). C-RASP brings additional use in understanding expressiveness
from two angles. First, one can conveniently prove functions expressible by writing down programs,
as we did in Section 4.2. Second, to prove negative results, we can bring to bear a set of deep results
about logics using majority quantifiers (Krebs, 2008), which allow us to settle the expressiveness
of many problems provably. Positive results translate into positive results about Limit Transformer
expressiveness and hence, length generalization under our idealized learning setup. While it is open
if problems not expressible in C-RASP cannot in principle show length generalization, experimental
results suggest that such an implication might hold in many cases.
(3) Why are Limit Transformers needed can’t one just consider transformers whose parameters
have infinite precision and hence can accommodate infinitely many different positional encodings
pi?
The key advantage of Limit Transformers is that they effectively have finite parameter counts when-
ever they satisfy LOC AL and PERIODIC, which is useful in establishing Theorem 7. In an ordinary
transformer, due to fixed width, effectively distinguishing unboundedly many positions requires in-
finitely many parameters pi. Even then, a function as simple as ϕ(i, j) = δij cannot be exactly
represented for infinitely many i, j by a product piQTKpjat bounded width d.
(4) Why is the idealized setup considered for the analysis, as opposed to more practical frameworks
of learning?
Proving guarantees in a more practical setting (SGD training dynamics, subsampling of training
data) would, of course, be ideal. However, such guarantees have been notoriously difficult to estab-
lish for deep learning models (Neyshabur et al., 2017). Standard frameworks for learning, such as
16
Preprint
PAC-learning, assume that the training and test distributions are the same, which precludes out-of-
distribution guarantees such as length generalization. Even within the PAC-learning framework, ob-
taining nontrivial guarantees for deep neural networks remains challenging without making strong
assumptions. Instead of analyzing the learning and generalization of Transformers trained with
gradient-based methods, our work aims to understand the length generalization properties of Trans-
formers from an architectural perspective. A substantial body of work (cf. Section 6) has empirically
investigated the length generalization properties of Transformers and found a complex array of em-
pirical behavior, while theoretical understanding has been very limited. Hence, consolidating the
theoretical relation between these empirical observations and the computational model of Trans-
formers seems like an important direction. Our work provides a formal framework, based on an
idealized model of learning, that separates the tasks on which Transformers succeed and those on
which they fail to length-generalize. The learning model considered in our work is closely related to
the “identification in the limit” setting, which has been widely studied for decades in the context of
learning automata and grammars (De la Higuera, 2010). Our framework is successful in explaining
a wide range of empirical observations (Figure 1). This is a substantial advance, as no prior theo-
retical framework has been able to explain the empirical patterns in Figure 1 to the extent that our
framework can. We hope that further work can build on these insights to establish guarantees that
reproduce this success while narrowing the gap between theoretical analysis and practical learning.
(5) Why does the length generalization condition in Theorem 7 ask for supnR(Tn)<? Isn’t
asking for length generalization sufficient?
If supnR(Tn) = , a transformer Tminimizing R(T)while fitting behavior at some length will
be unlikely to work at substantially longer lengths because performing the task correctly at longer
and longer lengths requires unbounded increase in R(T). It might still happen that generalization
from length n
2to length nis possible in certain problems not expressible by Limit Transformers.
However, this will depend on the problem and the specific scaling of test lengths relative to training
lengths; for problems not satisfying the conditions in Theorem 7, length generalization will fail
when the test length is made sufficiently longer than the training length, even as the training length
diverges to infinity. We make this formal in Section B.4.
(6) Given a task, how can one settle Limit Transformer and C-RASP expressiveness?
Showing that a task is definable by Limit Transformers or C-RASP simply requires providing an
explicit construction, as we exemplify for various tasks (Section C.1). For showing that a task is not
definable in these formalisms, we provide a battery of methods that allow us to provide an answer
for many tasks: communication complexity (Theorem 12) applies to both formalisms; for showing
non-definability in C-RASP, reduction to specific languages already proven not to be expressible
(such as Parity and Lbb, see Appendix E.1.2) is frequently useful.
(7) Why does the guarantee specifically apply to PERIODIC and LOC AL Limit Transformers? What
is special about such positional relations?
Local positional relations are important because, if a product function of the form pT
iQTKpj,
where the rank of Q,Kis not constrained, takes nonzero values at unboundedly long distances ji,
there is no general reason why the function should length-generalize. Independent initialization of
the pi’s tends to lead to values close to zero for most of these products (Appendix G.1); our Inference
Procedure incorporates this via the term (8). Given this, one expects a learned model to still exhibit
small products at distances not present in the training distribution, and hence a failure of length
generalization in the presence of nonlocal product functions.
The situation is different for products involving Vl,h matrices, whose rank is penalized by R(·); these
are able to represent not local, but periodic functions. In the finite-precision setup, a translation-
invariant product function of the form pT
iM1. . . Mkpjmust be periodic in jiwhenever one
of the matrices M1. . . Mkhas bounded rank as the number of positions considered diverges to
infinity, with period bounded in terms of the rank (Lemma 48). Hence, in a transformer TΘn,
any product function involving one or more Vl,h matrices needs to be periodic with period bounded
in terms of R(T).
17
Preprint
B PROO FS AB OUT LIMIT TRANSFORMERS
B.1 PROO F OF TH EO REM 7
We re-state and then prove Theorem 7:
Theorem 14 (Guaranteed Length Generalization in the Limit, restated from Theorem 7).Let f
F(Σ). Then the following are equivalent:
1. fis expressible by a Limit Transformer satisfying PERIODIC and LOCA L.
2. (Guaranteed Length Generalization) Consider the inference procedure from Definition 6
applied to fwith R, generating a sequence T1, T2, . . . . For any such sequence, there is
some N0such that, for all m > N0,Tmmatches fon all inputs of any length km, and
supn=1,2,3,... R(Tn)<.
Remark 15. We note that a limit transformer Trepresenting fneed not itself be offset-invariant.
It is sufficient to have
T(x, 0) = f(x)(6)
Lemma 47 shows that such a function has a sequence of transformers TnΘnwhich are offset-
invariant, even without assuming Tto be offset-invariant.
High-Level Proof Sketch The key to the proof is a compactness property: Any sequence
T1, T2, . . . (TiΘi) where supiR(Ti)<has a subsequence of transformers whose behav-
ior across inputs can be summarized into a single Limit Transformer. For 12, given a sequence
generated by the Inference Procedure, we show that Rstays bounded and use the compactness
property to show that a subsequence exhibits behavior equivalent to f. To show that, in fact, all pos-
sible sequences Tngenerated by the Inference Procedure ultimately exhibit behavior equivalent to
f, when nis large, we show that subsequences failing to length-generalize would exhibit increasing
attention dot products between far-away positions as input length increases. However, due to the
penalty on attention dot products in R, any such sequence would, for large n, need to have a higher
value of Rthan sequences avoiding such an increase. For 21, we obtain the Limit Transformer
from the compactness property applied to the sequence generated by the Inference Procedure. The
penalty on attention dot products enforces that it satisfies LO CAL; the bounds on the MLP and value
matrices enforce that the positional encodings in the Limit Transformer can be taken to be periodic.
Preliminaries and Formal Proof We now proceed to the formal proof. We make crucial use of
the two technical Lemmas 47 and 52, which provide translations between ordinary transformers and
Limit Transformers.
The following definition will be used:
Definition 16. If TΘi, then define R(T)to be R(T)minus the term in Eq. (8). That is,
R(T) = R(T) + X
l,h X
1jN(T)|pT
1KT
l,hQl,h pj|2(7)
The following lemma will be used for both directions of the main theorem:
Lemma 17. Let T1, T2, . . . , where TnΘn, be a sequence generated by the Inference Procedure
based on the functional behavior of a function f F, and such that
sup
n=1,2,3,... R(Tn)<(8)
Then fis expressible by a Limit Transformer satisfying PERIODIC and LOCA L, and there is some
N0such that, for all m>N0,Tmmatches fon all inputs of length km.
Proof. From the sequence T1, T2, . . . generated by the Inference Procedure, we obtain, using
Lemma 52, Limit Transformers ˜
T1,˜
T2, . . . such that supiR(˜
Ti)<where
˜
Ti(x, o) = Ti(x, o),i, o, x;|x|+oi(9)
18
Preprint
and, in each Tn,˜
Tn,
ϕl,h(i, j ) = pT
iKT
l,hQl,h pj(10)
Due to supiR(˜
Ti)<, we know that, except for the functions ϕl,h, only a finite number of
Limit Transformer parameter settings will be traversed by ˜
Ti. Each function ϕl,k is local; however,
a priori, they might not be local for any single finite τacross the different ˜
Tn. We will show that this
is not possible, i.e., we will show that all ϕl,k are local for a single finite τacross the different ˜
Tn.
This will occupy us for the remainder of the proof.
First, we note that R(Tn)converges because infTUnR(T)is bounded and monotonically increas-
ing in n. For each τand each n, we consider
Dn(τ) = X
l,h
τ
X
i=1 |ϕl,h(1, i)|2 R(Tn)
where the ϕl,h function is taken from ˜
Tnwhen defining Dn.
Consider R(Tn)(Equation 7). Let
R0:= lim inf
n→∞ R(Tn)(11)
and let ν1, ν2, ν3, . . . be such that
lim
i→∞ R(Tνi) = R0(12)
Then, for some D0,
lim
i→∞ Dνi(νi) = D0(13)
and
lim
n→∞ R(Tn) = lim
i→∞ R(Tνi) = R0+D0(14)
Indeed,
D0= lim sup
n→∞
Dn(n)(15)
because2
D0+R0= lim
n→∞(R(Tn) + Dn(n)) = lim inf
n→∞ R(Tn) + lim sup
n→∞
Dn(n)(16)
Define, for each τN,
D(τ) = lim inf
i→∞ Dνi(τ)(17)
As this function is monotonically increasing, and as ϕl,h has bounded precision, there must be τ
such that D(τ) = limτ→∞ D(τ).
Now define a sequence T
nby selecting, for each n, an i(n)such that νi(n)nand
Dνi(n)(n) = lim inf
j→∞ Dνj(n) = D(n)D(τ)
(with equality when νjτω); then define T
nas the restriction of Tνi(n)to positions up to n. As
Tνi(n)agrees with the behavior of fup to length νi(n)
2n
2, we also find that T
nagrees with the
behavior of fup to length n
2. Then
lim sup
n→∞ R(T
n) = lim sup
n→∞ R(T
n) + Dνi(n)(n)
= lim sup
n→∞ R(Tνi(n)) + D(τ)
=R0+D(τ)
2In general, if an+bnconverges and an, bnare bounded, then the limit lim(an+bn)equals lim sup an+
lim inf bn. For, assume lim sup an+ lim inf bn>lim(an+bn)(similar if >is replaced by <). Then
let i(n)be a subsequence such that ai(n)lim sup an. Then lim(an+bn) = lim(ai(n)+bi(n)) =
lim sup an+ lim bi(n)lim sup an+ lim inf bn>lim(an+bn), contradiction.
19
Preprint
Since Tnwas created by the Inference Procedure, we have
lim sup
n→∞ R(T
n)lim
n→∞ R(Tn)(18)
On the other hand, since R(T
n) R(Tνi(n)), we also have
lim sup
n→∞ R(T
n)lim
n→∞ R(Tn)(19)
giving
lim sup
n→∞ R(T
n) = lim
n→∞ R(Tn) = D0+R0(20)
Hence,
R0+D(τ) = lim sup
n→∞ R(T
n)
= lim
n→∞ R(Tn)
=R0+D0
and D(τ) = D0. Now assume there are infinitely many nsuch that ϕl,h is not τ-local in Tn,
hence, infinitely many nsuch that Dn(n)Dn(τ)+22p. Then:
D0= lim sup
i→∞
Dn(n)lim sup
n→∞
Dn(τ)+22plim inf
i→∞ Dνi(τ)+22p=D0+ 22p(21)
This is a contradiction.
We thus have shown that the functions ϕl,k (˜
Tn)must be local for a uniform τ. We thus know
that the sequence ˜
Tionly traverses a finite set of possible Limit Transformers. The set of traversed
functions becomes stationary at some i=N0; all of these must be functionally equivalent to f.
Hence, Tiis functionally equivalent to fat all lengths ias soon as iexceeds some threshold
N0.
We now prove the theorem.
Proof of the Theorem. Both directions are corollaries of Lemma 17.
21: This directly follows from Lemma 17.
12: By Lemma 47, for each i= 1,2,3, . . . , there are b
TiΘisuch that R:= supiR(b
Ti)<
such that b
Ti(x, o) = f(x, o),i, o, x;|x|+oi(22)
such that
pT
iKT
l,hQl,h pj=ϕl,h(i, j)(23)
By LO CA L,
R(b
Ti)<(24)
and we conclude
lim sup
i→∞ R(Ti)lim sup
i→∞ R(b
Ti)<(25)
Lemma 17 now provides N0>0and a function gsuch that for all m>N0,
Tm(x, o) = g(x),x:|x|+om(26)
On the other hand, for any string xS, we have
f(x) = Tn(x, 0),n2|x|(27)
Hence, fgand for all m>N0,
Tm(x, o) = f(x),x:|x|+om(28)
20
Preprint
B.2 RE SULT FO R NOPE TRANSFORMERS
Corollary 18. For ease of the reader, we mark the differences to Theorem 7 in blue font.
Let f F(Σ). Then the following are equivalent:
1. fis expressible by a Limit Transformer satisfying where all pi0,ϕl,h 0.
2. (Guaranteed Length Generalization) Consider the inference procedure from Definition 6
applied to fwith Rwhile constraining all pi0, generating a sequence T1, T2, . . . . For
any such sequence, there is some N0such that, for all m>N0,Tmmatches fon all inputs
of any length km, and supn=1,2,3,... R(Tn)<.
Proof. Retracing the proof of Lemma 47 shows that, when translating a Limit Transformer to an
ordinary transformer, the positional encodings can be taken to be zero when pi0,ϕl,h 0in the
Limit Transformer. Retracing the proof of Lemma 52 shows that, when pi0in a transformer, the
resulting Limit Transformer will have zero positional encodings and zero outputs for all ϕl,h. The
proof of Theorem 7 then applies equally to show Corollary 18.
B.3 LOGARITHMIC COMMUNICATION COMPLEXITY FOR LIMIT TRANSFORMER
Theorem 19 (Restated from Theorem 12).Let Tbe a Limit Transformer satisfying PERIODIC and
LOC AL. Assume that Toperates in precision O(log N), i.e., attention weights are rounded to
O(log N)precision. On an input xΣ2N, assume Alice has access to x1...N and Bob has access
to xN+1...2N. There is a communication protocol in which Alice and Bob exchange at most Clog N
bits, where Cdepends on Tbut not Nor x, and Bob can compute each activation in the second
half, y(l)
i(N+ 1 i2N). Further, Cis bounded linearly by R(T).
Proof. First, note that all activations y(l)
iare computed at log Nprecision because parameters are
at fixed precision and the output of exp(·)in the softmax attention computation is computed at fixed
fractional precision. We first consider the attention logits, in the case where j < N i:
a(l,h)
i,j = Roundp[(y(l1)
j)TKT
l,hQl,h y(l1)
i+ϕl,h(i, j )]
where Roundp[. . . ]rounds each entry to the closest number with pfractional bits. It is certainly
sufficient to have access to
a(l,h)
i,j =Roundp[y(l1)
j]T
KT
l,hQl,h y(l1)
i+ Roundp[ϕl,h(i, j )]
where pdepends on pand the largest singular value of KT
l,hQl,h , which is a finite constant. We can
thus partition the positions j= 1, . . . , N 1into a bounded number of sets, indexed by
1. Roundp[y(l1)
j]
2. max(Nj, N L)where L= max{k:ϕl,h (1, k)= 0}.
Due to the finite precision rounding of logits and the locality of positional relations, we can maintain
a finite set of keys and queries (though not values). This is fundamental to getting a logarithmic
communication bound.
We show the claim by induction over the layers.
We can write
Y(l)
i=y(l1)
i+
H
X
h=1 Pi
j=1 exp(log |x| · a(l,h)
i,j )Vl,hy(l1)
j
Pi
j=1 exp(log |x| · ai,j )
21
Preprint
The residual stream is known to Bob by inductive hypothesis. We need to understand the term inside
the sum. The green terms are fully known to Alice, and the blue ones are fully known to Bob by
inductive hypothesis:
Pi
j=1 exp(log |x| · a(l,h)
i,j )Vl,hy(l1)
j
Pi
j=1 exp(log |x| · ai,j )
=PN1
j=1 exp(log |x| · a(l,h)
i,j )Vl,hy(l1)
j
PN1
j=1 exp(log |x| · ai,j )+Pi
j=Nexp(log |x| · ai,j )
+Pi
j=Nexp(log |x| · a(l,h)
i,j )Vl,hy(l1)
j
PN1
j=1 exp(log |x| · ai,j )+Pi
j=Nexp(log |x| · ai,j )
Alice can communicate the green terms for every set in the partitioning of the indices j < N defined
above. In fact, it is sufficient to communicate the number of relevant positions and the sum of the
vectors Vl,hy(l1)
j.
Corollary 20 (Restated from Corollary 13).The following problems are not expressible by Limit
Transformers satisfying PERIODIC and LOCA L: (1) copying strings with repeated n-grams, (2)
addition of n-digit numbers.
Proof. Formally, we define copying as the task of, given a prefix $x#, autoregressively predicting
x. Copying with repeated n-grams means that there is no restriction on the repetition of consecutive
subspans of xof any length; this is in contrast to copying tasks with restrictions on the repetition
of n-grams (for some n) in x(Jelassi et al., 2024; Zhou et al., 2024a), which we study separately
(Appendix E.2).
Formally, we define addition as the task of, given a prefix $x+y=, where x, y are binary strings,
to output the sum of the numbers denoted by x, y in binary.
The communication complexity lower bound for copying follows from a standard communication
complexity lower bound for determining string equality. The bound follows for addition since the
special case of adding 0 to a number amounts to copying.
Remark 21. Analogous bounds follow for various other algorithmic and formal language problems.
For instance, the special case of multiplying with 1 amounts to copying; hence, such a bound holds
for multiplication. For the unbounded-depth Dyck over two bracket types, we can consider a word
of the form (i1. . . (iN)jN. . . )j1, which is in the Dyck language if and only if ik=jkfor all k, again
allowing a reduction to the communication complexity lower bound for determining string equality.
B.4 STATE ME NT OF MAIN THEO RE M FOR ARBI TR ARY TRAINING LENGT HS
Our main theorem considers generalization from length n
2to length n. Here, we discuss an alterna-
tive version applying to arbitrary scaling of training vs testing lengths. In particular, in such a setup,
we explicitly obtain failure of length generalization for inexpressible functions, though potentially
requiring testing on lengths more than twice the lengths used in training. We use the following
definition:
Definition 22. Atraining length is a function t:NNsatisfying limt→∞ t(n) = +and
t(n)nfor all n.
If t(n)is a training length, then the t(n)-Inference Procedure determines TnΘ(n)to match fat
all inputs of lengths t(n)while minimizing R(Tn)up to 1
n.
The special case of t(n) = n
2is the Inference Procedure from Definition 6.
We then state:
Theorem 23. Let f F(Σ). The following are equivalent:
1. fis expressible by a Limit Transformer satisfying PERIODIC and LOCA L.
22
Preprint
2. Let t(n)be any training length. Then the t(n)-Inference Procedure will output solutions
T1, T2, . . . such that, for some N0, for all m>N0,Tmmatches fat all lengths m.
Intuitively, this says that, when selected to fit the behavior of fon sufficiently long inputs
of length t(n), the output of the Inference Procedure will generalize to unboundedly longer
inputs of length n, where ncan be arbitrarily larger than t(n).
Corollary 24. Assume f F(Σ) is not expressible by a Limit Transformer satisfying PERIODIC
and LOC AL. Then, for some training length t(n), the t(n)-Inference Procedure outputs a sequence
Tnwhere infinitely many Tnfail to match fat length n.
Remark 25. There are two important differences compared to Theorem 7. First, the second condi-
tion refers to length generalization for all arbitrary training lengths t(n), not specifically training
length n
2. Second, the second condition does not ask for supiR(Ti)<, but simply asks for Tnto
ultimately length generalize.
Proof of Theorem 23. 12 The proof of Theorem 7 remains valid in this direction without any
changes, as it does not specifically rely on the training lengths being half the overall context size.
21 We show the contrapositive. Assume fis not expressible by a Limit Transformer satisfying
PERIODIC and LO CA L. Then, using the same arguments as in the proof of Lemma 173, any sequence
TnΘnthat matches fwill have lim infn→∞ R(Tn) = (). Now consider kN; we will
assign every ka number nk> k, starting with n0= 0. For each n>k, there is ˆ
Tk,n Θnthat
matches fup to length kwhile Uk:= supnR(ˆ
Tk,n)<for every fixed k. Now select nk> nk1
such that no TΘnkwith R(T)Uk+ 1 matches fat length nk; this is possible because of
(). We thus obtain a sequence (k, nk)N×N. By construction, there are infinitely many distinct
different values nk. Then define
t(n) := max ({k:nkn})(29)
Then t(n)is a training length. By definition, the t(n)-Inference Procedure will, whenever nis one of
the nk’s, find a transformer Tnkwith R(Tnk)Uk+1
nkthat fails to match fat length n=nk.
B.5 CO RO LL ARY A BOU T EXPRESSIVITY
We have introduced Limit Transformers as a formalism for distilling computations of transformers
performing on longer and longer sequences into a single limiting object, helping understand length
generalization. Here, we show that they also provide a simple lower bound for the expressiveness of
causal transformers across input lengths:
Corollary 26. Let f F(Σ). Assume fis expressible by a Limit Transformer satisfying PERIODIC
and LOC AL. Then at each input length N, there exists a transformer TNperforming fon all inputs
of length up to Nsuch that:
1. The parameters of TNare expressed at pbit precision, with pindependent of N
2. The number of heads and layers of TNis bounded independently of N.
3. The width dof TNis bounded as O(N).
We note that an important aspect is that TNperforms correctly not just at length N, but at all lengths
up to N. This distinguishes the result from constructions guaranteeing the existence of a transformer
at a fixed length. For instance, Bhattamishra et al. (2024) provide a transformer for testing equality
between length N-strings (which could also be used for copying), but this construction uses specific
positional encodings that depend on the input length. In contrast, the result here provides conditions
under which a transformer can perform a task at all lengths up to a given bound; in this stronger
3Assume there is a sequence TnΘnthat matches fand has lim infn→∞ R(Tn)<. Translating each
element to a Limit Transformer leads to a sequence where, except perhaps for the functions ϕl,h, only a finite
number of settings will be traversed. Now, as in the proof of Lemma 17, one can use D(τ)to construct a
sequence of Limit Transformers that are local for a single τ. The important difference to Lemma 17 is that
here we are not assuming the sequence (Tn)nto be constructed by the inference procedure, but we nonetheless
obtain such a sequence.
23
Preprint
setup, no APE transformer for copying with uniform complexity bounds as provided by Corollary 26
is known, and the problem is indeed not expressible by Limit Transformers satisfying PERIODIC and
LOC AL (Corollary 13). In contrast, Corollary 26 provides APE constructions performing correctly
up to any given length for a wide class of problems including C-RASP[periodic,local].
Another important feature is that the construction provides a fixed precision for the parameters, as
is the case in real-world implementations. We note that, if parameters are at fixed precision, it is
generally not possible to find a single transformer across all input lengths in the APE setting; hence,
it is unavoidable that the width of the transformers will need to increase as the input length increases.
Importantly, many other aspects of the transformer’s complexity, such as the number of heads and
layers, remain bounded.
Proof. The statement is an immediate corollary of Lemma 47, which provides transformers
T1, T2, . . . with bounded R(TN), which by Definition 5 entails a uniform bound on precision,
heads, and layers. The construction provided in the proof of Lemma 47 provides a width bounded
as O(N).
B.6 FROM C-RASP TO LIMIT TRANSFORMERS
The proofs are adaptations of the proofs from Yang and Chiang (2024).
Theorem 27 (Restated from Theorem 9).For every C-RASP,Ψ] program Pwith local functions
Ψand any periodic functions Φthere exists a Limit Transformer Tthat satisfies PERIODIC and
LOC AL such that for all wΣ,Paccepts wiff Taccepts $w. If Puses no local or periodic
relations, then Trequires no functions ϕl,h or positional encodings pi.
Remark 28. We note that the Limit Transformer Tprovided by the proof of Theorem 27 emulates
the C-RASP program Pat zero offset: That is, Paccepts wiff a predetermined entry in the last
output dimension of T($w, 0) is above some threshold. In principle, its computations may not be
offset-invariant, i.e., for the constructed T, the output T($w, o)may depend on o. Importantly,
the proof of Theorem 7 does not require a Limit Transformer computing fto be offset-invariant, but
just requires it to compute fwhen the offset is zero. This is because Lemma 47 ensures that, for any
Limit Transformer Tsatisfying LOCAL and PERIODIC, even if it is not offset-invariant, there are
transformers TnΘnwhose behavior matches T(·,0).
Proof of Theorem 27. C-RASP has two sorts of operations, a Boolean sort and a Count sort. We will
simulate each operation in the transformer by storing the Boolean values as {0,1}, and storing the
counts as c
i+1 . That is, we say that a Limit Transformer Tsimulates a C-RASP program Pif for
every operation Pkof Pthere is a dimension dkin Tsuch that when Pk(i)when run on wis true iff
T($w)i+1,dk= 1 (and 0otherwise) for Boolean operation and Pk(i) = ciff T($w)i+1,k =c
i+1
for count operations.
The theorem will be shown by induction on the length of P. As a clarifying note, we use 0-indexing
everywhere in this proof. If Pis of length 0, we only have initial Qσ(i)vectors, which can be
simulated by appropriately setting the word embedding. Otherwise, assume all programs of length
kare simulated by some transformer, and we have cases for each type of operation Pk+1(i)
can be. All cases are identical to Yang and Chiang (2024) except for comparison, conditional, and
counting.
First, we must address the SOS token $. There exists a transformer layer that sets the entire vector
to 0in the initial position while leaving all other layers untouched. For instance, we may use a
conditional operation, as described later in the proof.
If Pk+1(i) := ϕ(i), a periodic positional function in Φ, then it is simulated in Tby appropriately
setting piin the positional encoding.
If Pk+1(i) := P(i) ? C1(i):C2(i), we can implement the following function: for P {−1,1}
and V[0,1]
f(P, V ) = V P =1
0P= 1
24
Preprint
This is achieved by f(P, V ) = ReLU(VP)ReLU(P). Thus, the desired Conditional Output
can be defined in a single FFN as f(P, V1) + f(P, V2), where the first layer and ReLU compute
each fterm and the second layer adds them together.
If Pk+1(i) := C1(i)C2(i). By the inductive hypothesis C1(i)and C2(i)are stored in dimensions
d1and d2as the value C1(i)
i+1 and C2(i)
i+1 . It suffices to check that C2(i)
i+1 =C1(i)
i+1 0.
To compute this, we use the Heaviside activation function, which we used in our model of MLPs as
discussed in D.1.
21 0 1 2
1
0
1
x
hs(x)
Thus, there exists an MLP which, letting x1and x2be the values in dimensions d1and d2, computes
(hs(x2x1) + 1)/2in the dimension reserved for Pk+1, which will be the Boolean value in {0,1}
corresponding to C1(i)C2(i).
If C(i) := #[ji]P(j)(using ψ(i, j) = ), then the desired sum is computed using uniform
attention since the boolean representation of P(j)is just 0or 1. We enforced that P(0) is false, so it
does not contribute to the sum. This is described in more detail in Yang and Chiang (2024), though
the case here is simpler.
If C(i) := #[ji, ψ(i, j )] P(j), we can think of it as implementing #[ji]ψ(i, j)P(j).
Suppose ψis a local function of the following form
ψ(i, j) = 1j=i
0else
Then C(i)will either be 1or 0depending if P(i)is true or false. If we set the query and key
matrices to 0we get
sij = log N·ψ(i, j)
We assume the log is base 2, but the argument is similar for others. Then we can have attention
compute
ci,k =X
ji
exp (log N·ψ(i, j)) ·P(j)
X
ji
exp (log N·ψ(i, j)) =X
ji
N(ψ(i,j)
ln 2 )·P(j)
X
ji
N(ψ(i,j)
ln 2 )
If P(i)and ¬P(j)for j=i, then we have a lower bound:
N(1
ln 2 )
N(1
ln 2 )+i1ci,k
If ¬P(i)and P(j)for j=ithen we have an upper bound:
ci,k i1
N(1
ln 2 )+i1
25
Preprint
Since N1
ln 2 i, and we know that P(i) ci,k 1
2, we can construct an MLP that computes
the correct value. It will output either 0
i+1 or 1
i+1 , in the dimension reserved for Pk+1(i), for instance
by using a conditional operation that checks that the output of the attention layer ci+1,k 1
2, which
was shown in an earlier case.
C EXPRESSIVITY PROOFS FO R C-RASP
C.1 C-RASP CONSTRUCTIONS
C.1.1 MAJORITY
MAJORITY is the language of strings over Σ = {0,1}with at least as many 1s as 0’s.
MAJORITY
C1(i) := #[ji]Q1(i)(1)
C0(i) := #[ji]Q0(i)(2)
M(i) := C1(i)C0(i)(3)
C.1.2 DYCK-1
Dyck-1 is the language of strings over Σ = {0,1}with at least as many 1s as 0’s.
Dyck-1
C((i) := #[ji]Q((j)The number of (up to position i(1)
C)(i) := #[ji]Q)(j)The number of )up to position i(2)
V(i) := C((i)< C)(i)Violation: there are more )than ((3)
CV(i) := #[ji]V(j)The number of Violations (4)
M(i) := CV(i) = 0 Matched: zero Violations (5)
B(i) := C((i) = C)(i)Balanced: same number of (and )(6)
D(i) := M(i)B(i)String is Matched and Balanced (7)
C.1.3 anbncn
Let Σ = {a, b, c}. This is another example of a counter language which C-RASP can express and
which transformers have been observed to length generalize on (Bhattamishra et al., 2020).
anbncn
Ca(i) := #[ji]Qa(j)(1)
Cb(i) := #[ji]Qb(j)(2)
Cc(i) := #[ji]Qc(j)(3)
A(i) := Cb(i) + Cc(i) = 0 (4)
B(i) := Cc(i) = 0 (5)
CA(i) := #[ji]Qa(j)A(j)(6)
CB(i) := #[ji]Qb(j)B(j)(7)
Ga(i) := CA(i) = Ca(i)(8)
Gb(i) := CB(i) = Cb(i)(9)
Gabc(i) := Ca(i) = Cb(i) = Cc(i)(10)
L(i) := Ga(i)Gb(i)Gabc(i)(11)
26
Preprint
C.1.4 EXISTENTIAL QUAN TIFICATI ON
This is generally a useful primitive, so to save a little space we can add a macro for existential
quantification towards the left in C-RASP. This is easily defined using counting:
P(i) :=
A(i)
C(i) := #[ji]A(j)(1)
P(i) := C(i)1(2)
And we abbreviate this using P(i) :=
A(i). We demonstrate its use below.
C.1.5 PIECEWISE TES TABLE LA NG UAG ES
Piecewise testable languages are Boolean combinations of languages of the form
Σa1Σa2Σ. . . ΣanΣ. This allows us to check for the presence of noncontiguous sub-
strings, which contrasts with the proof in C.3.2 that implies the presence of contiguous substrings
cannot be expressed in C-RASP[].
It suffices to show programs for languages of the form L= Σa1Σa2Σ. . . ΣanΣ, since
Boolean combinations are recognizable using Boolean operations of C-RASP. For Lwe have the
following C-RASP program which has the final accepting operation Ln:
Σa1Σa2Σ. . . ΣanΣ
L1(i) :=
Qa1(i)a1occurred (1)
L2(i) :=
Qa2(i)L1(i)a2occurred, preceded by a1(2)
.
.
. (3)
Ln(i) :=
Qan(i)Ln1(i)anoccurred, preceded by an1, . . . , preceded by a1(4)
C.2 C-RASP[PERIODIC,LOCAL]CONSTRUCTIONS
C.2.1 INDUCTION HEA D (ARGMAX VERSIO N)
As an example consider Σ = {a, b, c}. Predicate N EX Ta(i)is true iff the next token should be an
a. First we can define predecessor
CPa(i) := #[ji, j =i1] Qa(j)
P REDa(i) := C Pa(i)1
Then we can count bigram occurence by counting
Cab := #[ji]Qb(j)P REDa(j)
Then each NEXTa(i)predicate can be defined by checking the current symbol and finding the
most frequently occuring bigram.
27
Preprint
NEXTa(i)(Argmax) over Σ = {a, b, c}
.
.
. (1)
MOREaa,ab (i) := Caa(i)Cab (i)(2)
MOREaa,ac (i) := Caa(i)Cac (i)(3)
NEX Ta(i) := Qa(i)M OREaa,ab(i)M OREaa,ac(i)(4)
This corresponds to testing, for the fin Equation 10, for which σthe entry f(x1. . . xN)N is
maximal.
C.2.2 INDUCTION HEA D (ALL POSSIBLE NEXT SYMBOLS)
Consider Σ. For aΣ, predicate N E X Ta(i)is true iff the next token can possibly be an a. As in
Section C.2.1, first, we can define predecessor
CPa(i) := #[ji, j =i1] Qa(j)
P REDa(i) := C Pa(i)1
Then we can check for bigram occurrence by counting
CBIGRAMab := #[ji]Qb(j)P RE Da(j)
EX I ST Sab := CBIGRAMab(i)1
If a bigram σa ever occurred previously in the string, nonzero probability is assigned to predicting
awhen at symbol σ. Then each N EX Ta(i)predicate can be defined as follows
NEXTa(i)(All Possible) over Σ = {a, b, c}
.
.
. (1)
NEX Ta(i) := _
σΣ
[Qσ(i)EX IS T Sσa(i)] (2)
where WσΣcan be expressed using the Boolean operations and ¬as defined in Section 4.2. This
corresponds to testing, for the fin Equation 10, for which σwe have f(x1. . . xN)N,σ >0.
Generating based on this program Consider an input prefix of the form #x#, where #denotes
a separator symbol. If we iteratively generate the next symbol aby selecting aΣwhere NEXTa
holds at the last position, we generate a string #x#ywhere all bigrams in #yhad already occurred
in #x, a simple version of the in-context Markov chains studied by (Edelman et al., 2024).
Special Case: Unique Copying In the special case of an input prefix where each symbol occurs
at most once in x, the generation procedure defined above will copy x, and (assume we stop at #)
resulting in an overall string of the form #x#x#. This is essentially the RASP-L construction of
unique copying noted by Zhou et al. (2024a).
Necessity of Positional Relations Intuitively, an induction head circuit requires positional in-
formation; indeed, we observe length generalization in unique copying with APE but not with
NoPE (Figure 1). Formally, we can prove as follows that the predicate NEXTadefined above
for each aΣ, while definable in C-RASP[local], is not definable in C-RASP[]. Consider
Σ = {a, b}; then the predicate NEXTbcan be used to define the (disjoint) union of the languages
ΣabΣa,ΣbbΣb. As the first one is definable in C-RASP[]4and the union is disjoint, the sec-
ond would be definable if NEX Tbis. This contradicts the fact that ΣbbΣ∈ C-RASP[], because
4It is sufficient to check whether aand bboth are present and whether one bhas a ain its preceding context;
as Σ = {a, b}, this is equivalent to ab being a substring
28
Preprint
ΣbbΣ∈ \
MAJ2[<](Lemma 6.11 in Krebs (2008)) and the inclusion C-RASP[]\
MAJ2[<]
(see Section C.3.1).
C.2.3 (aa)
The following function that checks the parity of a position mod 2is a periodic function.
ϕ(i) := i0 mod 2
So the following program recognizes (aa)
(aa)
C¬a(i) := #[ji]¬Qa(j)(1)
A(i) := C¬a(i) = 0 (2)
D(i) := ϕ(i)A(i)(3)
The Boolean value of the last operation in the last position of the string is the accepting value. This
is true if the string is of even length and contains only as. Overall, we have constructed a program
in C-RASP[periodic,local].
C.3 EXPRESSIBILITY OF RE GU LAR LA NG UAG ES IN C-RASP[PERIODIC,LOCAL]
Lemma 29 (Restated from Lemma 11).Consider the alphabet Σ = {a, b, e}.
1. P ARI T Y := b(abab)∈ C-RASP[periodic,local]
2. (aa)C-RASP[periodic,local]and (aa)∈ C-RASP[]
3. (a|b|e)be∈ C-RASP[periodic,local]
4. LC-RASP[]for piecewise testable L
Proof. 1–3 are shown in Lemmas 36 (for 3.), 38 (for 2.), 41 (for 1.), and Appendix C.2.3 (for 2.). 4.
is shown in Appendix C.1.5.
C.3.1 LIN K TO MAJORITY LO GI C
In understanding the expressiveness of C-RASP, we draw on an established body of work on logics
using MAJORITY quantifiers. Merrill and Sabharwal (2023a); Strobl (2023) show that the ex-
pressiveness of transformers is upper-bounded by uniform TC0, which can be defined as the logic
FOM[BIT]. This logic is defined in terms of MAJORITY quantifiers and various predicates. C-
RASP[periodic,local] can be viewed as a highly restricted fragment of this logic. Specifically, it is
contained in \
MAJ2[<, +1, M od], which was studied by Krebs (2008); Behle et al. (2007; 2009);
results about that logic help understand the expressiveness of C-RASP:
Definition 30. \
MAJ2[<, +1, M od]is the logic defined by the constructs
1. The construct
[
M aj x ϕ1, . . . , ϕc
2. The predicates Qa(x)for aΣ
3. Numerical predicates for q, j N:M odm,r(x),S ucc(y, x)
4. Boolean connectives
29
Preprint
5. First-order quantifiers5
such that only two variables (say, xand y) can appear within a formula. We define the semantics,
when xΣ, by defining
1. for the majority quantifier:
w|=[
M aj x ϕ1, . . . , ϕc 0<
n
X
i=1
c
X
j=1 1if w|x=i|=ϕj
1else
2. for the predicates:
w|x=i|=Qa(x)wi=a
w|x=i|=M odm,r (i)ir(mod m)
w|x=i,y=j|=Succ(j, i)j+ 1 = i
Semantics of Boolean connectives and first-order quantifiers follow the standard definition.
A language L Σis definable in \
MAJ2[<, +1, M od]if there is a formula ϕwithout free variables
such that w L if and only if w|=ϕ.
The logic \
MAJ2[<]results by omitting the numerical predicates defined under (3).
It is straightforward to convert C-RASP programs into formulas of \
MAJ2[<, +1,MOD]. As we
shall see later in Section C.3.3, the inclusion is strict because P ARI T Y is expressible even in
\
MAJ2[<].
Proposition 31. C-RASP[periodic,local]\
MAJ2[<, +1, M OD]
We make two remarks about the corollaries of the result:
Remark 32. First, the proof, simply by omitting the positional relations, also yields a corresponding
inclusion without positional relations: C-RASP[]\
MAJ2[<].
Second, the result implies that C-RASP[periodic,local]defines a subclass of TC0, in fact, all
C-RASP[periodic,local]programs translate into uniform TC0circuits with a linear number of gates
by results in Krebs (2008, Theorem 4.33 and Figure 4.4). The inclusion is strict, e.g., PARITY has a
linear-size TC0circuit but is not definable by C-RASP[periodic,local], as we show below.
Proof of Proposition 31. First, every periodic positional function ϕ(i)is a Boolean function
Modm,r (i) i=rmod m. For local functions ψ(i, j), it suffices to only consider func-
tions of the form ψ(i, j) j=i+cfor cZ. This is because the counting operation
C(i) := #[ji, |ij| c]P(j)is equivalent to ˆ
C(i)where
ˆ
C(i) := #[ji, j =icj=i(c1) . . . j=i+c]P(j)
And it is possible to further reduce this by distributing the disjunctions over many counting op-
erations so that each one only contains a single disjunct as positional function. It helps that
a predicate fst(i)is definable in \
MAJ2[<, +1,MOD]which is true iff i= 0. For instance
fst(i) := [
Majjji, ⊤⟩
For each Boolean C-RASP operation P(i), there exists a \
MAJ2[<, +1,MOD]formula ˆ
P(i)with
one free variable that is equivalent. By induction, all cases are straightforward except for comparison
operations.
5These can be simulated by majority quantifiers with two variables by Proposition 5.5 in Krebs (2008),
which is based on Corollary 3.3 in Lange (2004). Nonetheless, as the simulation is unobvious, they are useful
for writing formulas in \
MAJ2[<, +1, M OD].
30
Preprint
For comparison operations, we will first show a formula that is equivalent for all nonempty strings.
Accounting for the empty string is easy, depending on the constants in the comparison. WLOG we
are able to rewrite the formula (not in standard C-RASP notation) as the following, where αk, β Z
X
kK
αk·#[ji]Pk(j)
+
X
mM
αm·#[ji, j =i+cm]Pm(j)
+β > 0
We’ve grouped the uniform counting operations that have ψ=together. Then using a case dis-
junction, we can rewrite it all the local counting operations as the following (using I[ϕ]as notational
convenience to turn ϕ(j) {⊥,⊤} to the corresponding value in {0,1}):
_
τ∈{0,1}M:I[Pm(jcm)]=τm
X
kK
αk·#[ji]Pk(j)
+
X
mM
αmτm
+β > 0
We can see that for nonempty strings within each case, the additive constant βcan be reformulated
as β+PmMαmτm·#[ji]fst(j), and we can just add it to the summation using αk+1 =
β+PmMαmτm. Then, it is possible to define formulas for each case disjunction using a
series of existential quantifiers and succ(j, i). For instance:
ϕ(j2) i. (Succ(i, j ) j. (Succ(j, i)ϕ(j)))
This means it now suffices to focus on the sum of counting terms and simulate that using a [
Maj
formula. For k(K+ 1), if αk>0consider the list of formulas
Lk:= [ ˆ
Pk(j),ˆ
Pk(j),..., ˆ
Pk(j)
| {z }
αkmany
,,,...,
| {z }
αkmany
]
Intuitively, the [
Maj j quantifier can only check if the total count is greater than half the possible
positions, so to check if a count is >0we need to pad the quantifier with a bunch of trivially true
formulas to ensure the total count is at least half by default. And if αk<0we use
Lk:= [¬ˆ
Pk(j),¬ˆ
Pk(j),...,¬ˆ
Pk(j))
| {z }
αkmany
,,,...,
| {z }
αkmany
]
Let L=L1++L2+ + . . . ++LK+1 be the concatenation of all these lists, and let φ1, φ2, . . . , φ|L|
list out the formulas in L. Then we claim the following formula will compute the correct value for
nonempty strings.
ϕ1(i) := [
Maj jφ1, φ2, . . . , φ|L|
For empty strings, if β+PmMαmτm>0then define ϕ0(i) := ¬[
Maj j⟨⊤⟩ . Otherwise,
use ϕ0(i) := ¬[
Maj j⟨⊤⟩ . Then we can define
ˆ
P(i) := ϕ0(i)ϕ1(i)
And verifying the correctness of this is straightforward.
31
Preprint
C.3.2 INEXPRESSIBILITY OF Σbe
Krebs (2008); Behle et al. (2007; 2009) used infinite groups to establish results about the expres-
siveness of \
MAJ2[<]; by Corollary 31, these results entail results on C-RASP[periodic,local]. In
particular, Lemma 6.11 in Krebs (2008) shows that Lbb ∈ \
MAJ2[<]; this result turns out to have
profound consequences for C-RASP expressiveness.
Definition 33. Let Σ = {a, b, e}. Define Lbb := ΣbebΣ.
Lemma 34. Let φbe a \
MAJ2[<, +1, M OD]formula. There exists a morphism h(σ) = esσes1
and a \
MAJ2[<]formula ψsuch that for every wΣ,h(w)ϕ wψ
Proof. Let Mbe the least common multiple of all occurring moduli in φ. Let Cbe the maximum
nesting depth of Succ(x, y )predicates (which must be bounded by the quantifier depth of φ). In-
tuitively, we can think of Cas the largest number where a subformula φ(x+C)occurs in φ. Let
s=MC, and define the morphism h(σ) = esσes1. Here we will use the notation ϕc(x)that is
true at position xin wwhenever ϕis true at position x+cin h(w), for c[s, s 1].
We will show that for every formula φ(x)of \
MAJ2[<, +1, M OD]with at most one free variable,
we can define φs(x), φ(s1)(x). . . φ(s1) (x)such that for all i[0,|w|1] and c[s, s 1]
h(w)φ(i+c) wφc(i)
Intuitively, what this does is it takes every interval of [xs, x + (s1)] around each position in
h(w)and stores it vertically at that position in w. We will induct on the complexity of φ. If φ(x)
is Qe(x), then φ0(x) := Qe(x), and then φc(x) = for every other c= 0, since the morphism h
pads neutral symbols ein h(w)between every symbol from w. If φ(x)is Qσ(x)for σ=e, we have
that φ0:= Qσ(x)and φ+c(x) = for every other c= 0. If φ(x)is M odm,r(x), the φccan also
be “hardcoded” similarly, as every position in h(w)that has a symbol from wis going to be = 0
mod s.
Boolean formulas are also straightforward. The only hard case is if we have a formula φ(x) =
\
MAJ yφ1(x, y ), . . . , φk(x, y). We can think of φspecifying the constraint
X
ik
#y[φi(x, y)]
> k ·|w|
2
The idea here is to rewrite ψi(x, y)in terms of its unary formulas (which we can apply the induc-
tive hypothesis to) and then split h(w)into some intervals, upon which evaluating φ+c(x)will be
simpler. First we can rewrite each φi(x, y)as
Fi(α1(x), . . . , αq(x), β1(y), . . . , βr(y), χ1(x, y), . . . , χp(x, y))
Where Fiis a Boolean function, the αare unary in x, the βare unary in y, and the χ(x, y)are in-
equalities of xand y, possibly with +1’s, of the form xy+ 1, for example. To save space,
we will abbreviate the above expression by grouping the α, β, χ formulas together notationally
Fi(χi(x, y), αi(x), βi(y)).
The χformulas are not unary, but we can “eliminate” the χterms by casework over intervals of
the string. We will show this by example for a summation with only one #yterm. This argument
works identically if we had many of #yterms, but it would add notational clutter. So if we had a
formula φ(x) = \
MAJ yφ1(x, y )we could think of it as in the form
φ(x) = # yF(χ(x, y), α(x), β (y)|w|
2
Then we can construct the formula φc(x)for c[s, s 1] by using the following partition of
intervals of the string. Let Ξbe the set of inequalities
32
Preprint
Ξ = {y < x s, y =xs, . . . , y =x, . . . , y =x+ (s1), y > x + (s1)}
And we define some notation. For ξΞ, let χξ(x, y ) {⊤,⊥} evaluate χin the case ξholds.
For instance if χ(x, y)is y < x + 1, then χy >x+2(x, y ) = . Since the intervals defined by
ξΞdisjoint and cover the entirety of the string, every χcan be evaluated in this manner. Then,
we can essentially compute the sum in each interval, and only precision is needed in the interval
[xs, x + (s1)], so φ(x)is equivalent to
#yhy < x F(χy <xs(x, y), αc(x), β s(y)i
+# yhy < x F(χy <xs(x, y), αc(x), β (s1)(y)i
.
.
.
+# yhy < x F(χy <xs(x, y), αc(x), β +(s1)(y)i
+# yhy < x F(χy <xs(x, y), αc(x), β s(y)i
+# yhx=yF(χy=xs(x, y), αc(x), β s(y)i
+# yhx=yF(χy=x(s1)(x, y), αc(x), β(s1) (y)i
.
.
.
+# yhx=yF(χy=x+(s1)(x, y), αc(x), β(s1) (y)i
+# yhx=yF(χy=x+s(x, y), αc(x), β s(y)i
+# yhx<yF(χx+s<y(x, y), αc(x), βs(y)i
+# yhx<yF(χx+s<y(x, y), αc(x), β(s1) (y)i
.
.
.
+# yhx<yF(χx+s<y(x, y), αc(x), β(s1) (y)i
+# yhx<yF(χx+s<y(x, y), αc(x), βs(y)i|w|
2
By the inductive hypothesis, all αcand βcare definable solely in terms of \
MAJ2[<], so the entire
formula is equivalent to a \
MAJ y formula that quantifies over all the bracketed formulas above,
as well as equally many trivially true formulas, as described more clearly in Proposition C.3.1. As
mentioned before, since this argument also applies to a summation of #yterms, this completes
the proof. Then for any φ(x)in \
MAJ2[<, +1, M OD], after performing the above translation the
resulting formula φ0(x)is our desired formula in \
MAJ2[<].
Lemma 35. Lbb ∈ \
MAJ2[<, +1, M OD]
Proof. Assume for sake of contradiction that Lbb is definable by a formula φof \
MAJ2[<
,+1, M OD]. Let hand ψbe as guaranteed by the above lemma. Then for wΣ,wLbb
h(w)Lbb. This means ψdefines Lbb which contradicts Lemma 6.11 in Krebs (2008), which has
shown that Lbb ∈ \
MAJ2[<].
33
Preprint
Lemma 36. For Σ = {a, b, e}, it holds that
Σbe∈ C-RASP[periodic,local](4)
Proof. To get a contradiction, note that a C-RASP[periodic,local]program Φfor Σbecould be
used to construct one for Lbb, as:
C1(i) := #[ji, j =i1] Φ(j)
P REVΦ(i) := C1(i)1
C2(i) := #[ji]Qb(j)P RE VΦ(j)
Lbb := C2(i)1
C.3.3 INEXPRESSIBILITY OF PARITY
First, let the depth of a C-RASP operation be the maximum depth of nesting of counting operations
in it. For instance if C(i) := #[ji]P(j), the depth of the Cis depth of P(i)plus one. None of
the other operations are greater than the depth of its dependencies. We will induct on program depth
for the following proof:
Lemma 37. Let Σ = {a}. For any C-RASP[]program Pthere exists an nsuch that for all w
where |w| n, either all such ware accepted by Por all are rejected
Proof. If Pis depth 0, it is equivalent to either Qa(i)or ¬Qa(i), which either rejects every string
or accepts every string.
Otherwise, let all C-RASP programs of depth kgive constant output for strings above length n, and
then consider a program Pof depth k+ 1.Pwill be equivalent to a Boolean combination of linear
constraints. We will see that each linear constraint becomes constant for strings above a certain
length. Consider any linear constraints over Xmany counts Cxof depth k:
L(i) :=
X
xX
αx#[ji]Cx(j)
c
For string of length in, this is equivalent to
L(i) :=
X
xX
αx((in)I[Cx(n)] + #[jn]Cx(j))
c
Where I[Cx(n)] denotes the truth value of Cx(n) {0,1}. Rearrange this to
L(i) :=
(in)X
xX
αx(I[Cx(n)])
+
X
xX
αx(#[jn]Cx(j))
c
The sums c1=PxXαx(I[Cx(n)]) and c2=PxXαx(#[jn]Cx(j)) are constants depend-
ing on the formula and n.
(in)c1+c2c
Depending on if c1, c2are positive or negative, we either derive a lower bound mafter which the
linear constraint L(i)is always true, or always false for im. Since any formula of depth k+ 1
is a Boolean combination of these linear constraints, we take the max of all the m’s from them, and
any string larger than this will always be accepted or rejected by P.
34
Preprint
Lemma 38. (aa)∈ C-RASP[]
That is, no C-RASP[]program can determine if a general string has an even length. The same
proof applies to testing whether the string length is a multiple of any other fixed integer.
Proof. Using the previous lemma, for every C-RASP[]program there exists an nsuch that the
program accepts (aa)niff it accepts (aa)na. So no program can recognize (aa).
We will use this to show that no C-RASP[periodic,local]program cannot recognize P ARI T Y , as
the extra positional operations do not give sufficient expressive power. We start with an observation
that simplifies the proof
Proposition 39. As syntactic sugar we allow j < i as a mask in C-RASP counting operations.
Proof. Consider the counting operation C(i) := #[ji]P(j). We can define the program
I(i) := P(i)?1:0
C(i) := #[ji]P(j)
C(i) := C(i)I(i)
And essentially, this operation will compute the count
C(i) := #[j < i]P(j)
Lemma 40. Let Pbe a C-RASP[periodic,local]program over Σ = {a, b}. There is some s > 0
and a morphism h(a) = bsabs1and a C-RASP[]program ˆ
Pover Σ = {a}such that for all
wa, if Paccepts h(w)iff ˆ
Paccepts w.
Proof. Choose sto be the least multiple of all moduli occurring in Pthat is also greater than all the
|c|in local functions j=i+c. For every operation P(i)of P, we will define ˆ
Pc(i)for c[s, s1]
such that ˆ
Pc(i)when run on wis equivalent to P(s+i(2s) + c)when run on h(w).
If P(i)is Qa(i)or Qb(i), it is straightforward, as ˆ
Pc(i)is true iff c= 0. Modular predicates are
also capable of being “hardcoded”, as positions in ware always 0 mod sin h(w). All other kinds
of operations are also straightforward using the inductive hypothesis. The only ones that need care
are counting operations. First, consider a counting operation without positional functions:
C(i) := #[ji]A(j)
We can define each ˆ
Cc(i)using a program like the following. The idea is that the entire window of
[js, j + (s1)] around each j < i can be counted up completely, but around iwe only consider
35
Preprint
the interval [s, i +c]:
Cs(i) := #[j < i]ˆ
As(j)
C(s1)(i) := #[j < i]ˆ
A(s1)(j)
.
.
.
C(s1)(i) := #[j < i]ˆ
A(s1)(j)
Is(i) := ˆ
As(j)?1:0
I(s1)(i) := ˆ
A(s1)(j)?1:0
.
.
.
Ic(i) := ˆ
Ac(j)?1:0
ˆ
Cc(i) := X
t[s,s1]
Ct(i) + X
t[s,c]
It(i)
Otherwise, if we have a counting operation that involves a local positional function
C(i) := #[ji, j =i+d]A(j)
Then the operation returns either the count 1or 0and we can just use
ˆ
Cc(i) := (c=dˆ
Ac(i)) ? 1 :0
Since dwill not exceed ±s,ˆ
Adexists. Using these constructions we can see that (bsabs1)nis
accepted by operation P(i)in Piff anis accepted by the constructed operation ˆ
P(s1)(i).
Lemma 41. P ARI T Y ∈ C-RASP[periodic,local]
Proof. If such a program existed, it would be able to distinguish between (bsabs1)2nand
(bsabs1)2n+1 for all n(using the sguaranteed by the previous lemma). However, this implies
the existence of a C-RASP[]program over Σ = {a}that recognizes (aa). This contradicts
Lemma 38.
D DISCUSSION OF DESIGN CHOICES
D.1 MLP ACTIVATION FUNCTIONS
Our analysis allows ReLU and the Heaviside function as activation functions in MLPs. ReLU is a
standard choice in theoretical studies of neural networks and transformers (e.g. Bhattamishra et al.,
2024; Sanford et al., 2023). Modern LLMs also use other functions such as SwiGLU (Shazeer,
2020), but universal approximation theorems guarantee that ReLU networks can approximate
smooth functions on bounded domains well. While the choice of ReLU is not necessarily key to
our results, it is important that the number of active units provides a meaningful upper bound on the
complexity of the function expressed. Our results would continue to go through if ϕis an arbitrary
activation function but operates at p-bit precision.
We also allow the Heaviside function as a second activation function. Heaviside allows exactly per-
forming threshold computations at arbitrary input lengths, which is relevant to simulating C-RASP
at arbitrary input lengths. This includes simple problems such as MAJORITY, on which transform-
ers empirically do well. Real-world transformers generally do not include this function, though
ReLU MLPs can approximate it arbitrarily closely. Theoretically modeling length generalization on
such functions with only ReLU MLPs is an interesting possible extension of our theory.
36
Preprint
D.2 FI XED PRECISION
As described in Section 2, we assume that attention logits and the exponentials inside softmax are
rounded to pfractional bits of precision before further processing. This allows us to cluster keys and
queries into finite numbers of clusters, and compute all activations at logarithmic precision, used for
proving the logarithmic communication complexity bound (Theorem 12). We note that logarithmic
precision of the intermediate activations is also key to upper bounds of transformers in terms of TC0
shown by Merrill and Sabharwal (2023b).
We also assume that the parameters in transformers and Limit Transformers are expressed at fixed
precision (Definitions 4 and 2), and penalize the precision pused of representing the parameters as
part of the regularizer used in our inference procedure (Definition 5). Indeed, penalizing unbounded
precision of parameter values is necessary to enable full identification of a transformer algorithm
from behavior at finite lengths. For any real number α0, a one-layer transformer with real-valued
parameters can express the function
Fα(x) = 1if α·#1(x)#0(x)
0else
For any two distinct α, β, the functions Fαand Fβare distinct on sufficiently long inputs (though,
when αand βare close, very long inputs will be needed to distinguish them). Thus, there are un-
countably many distinct functions implemented by transformers; however, their distinction relies
on infinite precision. In an infinite precision setup, one cannot hope to identify algorithms imple-
mented by transformers from finite data, no matter the input length and the regularization applied
to the model size. In contrast, when parameters are representable in finite precision (as in real
computers), the number of distinct algorithms expressed by Limit Transformers is countable, and
ultimate identification from long inputs is possible when the precision required for representing the
parameters is penalized.
D.3 LAYER NORM
Real-world transformers use Layer Norm or RMSNorm, whereby activations y(l)
iare rescaled to
have norm or standard deviation d. Layer norm can be incorporated into the translation to Limit
Transformers (Lemma 52) by recording terms of the form
vT Y
S∈S1
S!T Y
S∈S2
S!wT(5)
when v VOl1and w VOl2and S1,S2 P. We can record these products in further dimen-
sions of d
y(l)
i, so that y(l)
i2
2is recoverable from d
y(l)
i. The simplest approach is then to modify the
definition of Limit Transformers by normalizing d
y(l)
ibased on this recovered norm.
E ADDITIONAL DETAILS F OR EXP ERI MENT S
E.1 RE GULAR LA NG UAG ES FRO M TH E BHATTAMIS HR A ET A L 2020 BENCHMARK
E.1.1 LAN GUAGE DE FIN ITI ON S
Descriptions follow Bhattamishra et al. (2020).
Tomita Grammars. Definitions are shown in Table 1.
Dnare defined on the alphabet Σ = {a, b}by the recursion Dn= (aDn1b).
PARITY. PARITY is b(abab). It is contained in the set of algorithmic tasks.
Others. Other languages: (aa),(aaaa)and (abab)(not star-free), aabbccddee,
{ab}d{b, c}, and {0,1,2}02(star-free).
37
Preprint
Grammar Star-Free Definition
1 Yes 1*
2 Yes (10)*
3 No strings without odd-length strings of ones followed by odd-
length strings of zeros (i.e., no 012n+102m+1 1substrings)
4 Yes strings without any 000’s substrings
5 No strings of even length with an even number of 1’s
6 No strings where number of 0’s - number of 1’s is divisible by 3
7 Yes 0*1*0*1
Table 1: Tomita Grammars (originally due to Tomita, 1982), following Bhattamishra et al. (2020).
C-RASP expressiveness
# Language [][periodic, local] Star-Free? Dot-Depth AC0?
1 Tomita 1 yes yes yes 1 yes
2 Tomita 2 yes yes yes 1 yes
3 Tomita 3 no no no yes
4 Tomita 4 no yes yes 1 yes
5 Tomita 5 no no no no
6 Tomita 6 no no no no
7 Tomita 7 yes yes yes 1 yes
8D2yes yes yes 2 yes
9D3yes yes yes 3 yes
10 D4yes yes yes 4 yes
11 D12 yes yes yes 12 yes
PARITY no no no no
12 (aa)no yes no yes
13 (aaaa)no yes no yes
14 (abab)no yes no yes
15 aabbccddeeyes yes yes 1 yes
16 {a, b}d{b, c}yes yes yes 1 yes
17 {0,1,2}02no no yes 2 yes
Table 2: The finite-state languages in the benchmark from Bhattamishra et al. (2020), with the num-
bering from Figure 1, C-RASP expressiveness properties, and three established notions of complex-
ity (star-freeness, dot depth, and membership in the circuit complexity class AC0). In the C-RASP
columns, “yes” means we found a C-RASP program; “no” means we proved that no C-RASP pro-
gram can exist. Note that {0,1,2}02is equivalent to the language Σbefrom Lemma 11. Note
also that we discuss PARITY in the algorithmic benchmark, as it is included in Zhou et al. (2024a).
See discussion in Appendix E.1.2. See Figure 2 for a version of Figure 1 (right) with languages
labeled.
E.1.2 C-RASP EXPRESSIVENESS
All languages in the benchmark are in TC0, and all are expressible in principle by transformers
(Liu et al., 2023). We were able to provably settle the C-RASP expressiveness for all languages,
with results shown in Table 2. We compare C-RASP expressiveness with the standard notions of
complexity of finite-state languages considered by Bhattamishra et al. (2020): whether languages
are star-free, and (among the star-free ones) their dot-depth. While many star-free languages in the
sample show length-generalization, star-freeness does not overall account for the observed behavior
(Figure 4). Within the star-free languages, a standard complexity metric is dot-depth (Figure 2); this
again does not accurately predict length-generalization: it succeeds for a language with dot depth
12 but fails for a language with dot depth 2. We also considered the circuit complexity of regular
languages (Barrington et al., 1992). All regular languages included in the sample are in the class
TC0; most are also in AC0, a smaller class sometimes compared to transformers (Hao et al., 2022;
Barcelo et al., 2024). Transformers show poor length generalization on the non-AC0regular lan-
38
Preprint
guages6, but also fail on various languages that are in AC0. On algorithmic problems, transformers
succeed on some non-AC0problems such as Majority (Table 3). Overall, C-RASP expressiveness is
much more successful than previously considered notions of complexity in accounting for empirical
length generalization behavior of transformers.
Proof Sketches for Membership Claims We sketch proofs for all C-RASP expressiveness claims
in Table 2. We first note that C-RASP[periodic,local]is closed under the inverse images of mor-
phisms where each symbol is mapped to a string of the same length. That is:
Lemma 42. C-RASP[periodic,local]is closed under the inverse images of morphisms where
each symbol is mapped to a string of the same length. That is, if h: Σ1Σn
2(for some
fixed n) is extended to a map Σ
2Σ
1and L C-RASP[periodic,local], then h1(L)
C-RASP[periodic,local].
Proof. Let hbe such a morphism. We we take any C-RASP[periodic,local]program operation P(i)
that operates on h(w)into an equivalent program which operates on w
For P(i) = Qa(i), we can use ˆ
P0(i) := Qh(a)0(i),ˆ
P1(i) := Qh(a)1(i), and so on.
For P(i) = Modm,r(i), then for cnwe can define (relying on the fact that m, n are fixed)
ˆ
Pc(i) := _
k<m M odm,k (i)M odm,(ck mod m)(i)
For P(i) = #[ji, ψ(i, j )] V(j), where ψ(i, j) = , then like before we can define ˆ
Pc(i)as
C0(i) := #[j < i]ˆ
V0(j)
C1(i) := #[j < i]ˆ
V1(j)
.
.
.
Cn1(i) := #[j < i]ˆ
Vn1(j)
I0(i) := ˆ
As(j)?1:0
I1(i) := ˆ
A(s1)(j)?1:0
.
.
.
Ic(i) := ˆ
Ac(j)?1:0
ˆ
Pc(i) := X
t[0,n1]
Ct(i) + X
t[0,c]
It(i)
If ψ(i, j) := j=id, then it takes a little bit more care, as the single position jupon which to
check V(j)may occur far behind the current position. First, via a similar argument to Proposition
C.3.3, it is possible to simulate the counting operation #[jik, ψ (i, j)] V(j)for any constant
cN. We can do this by counting #[ji]j=ikψ(i, j)V(j)for each kk, since
j=ikψ(i, j)remains a local function, and then subtracting that from the original count over
6By the results of Barrington et al. (1992), regular languages outside of AC0are all, informally speaking, at
least as PARITY, and indeed they provably are not in C-RASP[periodic,local].
39
Preprint
ji. Then, for cd
C0(i) := #jd
n(0 = (n(dmod n))) ˆ
V0(j)
C1(i) := #jd
n(1 = (n(dmod n))) ˆ
V1(j)
.
.
.
Cn1(i) := #jd
n(n1=(n(dmod n))) ˆ
Vn1(j)
ˆ
Pc(i) := X
t[0,n1]
Ct(i)
For c>d, the position to check occurs beyond i, so we use:
I0(i) := cd= 0 ˆ
V0(i)?1:0
I1(i) := cd= 1 ˆ
V0(i)?1:0
.
.
.
Ic(i) := cd=cˆ
Vc(i)?1:0
ˆ
Pc(i) := X
t[0,c]
It(i)
All other cases are straightforward, and it can be verified that these constructions are correct.
Tomita 1 C-RASP[]AC-RASP program can detect the presence of a symbol other than 1and
flag a violation.
Tomita 2 C-RASP[]AC-RASP program expresses: At each position, either the current sym-
bol is a 0 and the count of ones and zeros is balanced, or the current symbol is a 1 and the count of
ones is one more than the count of zeros.
Tomita 3 ∈ C-RASP[]For a given string of the form 010K, the outputs will converge as K
by the same argument as for (aa)in Lemma 38. Hence, Tomita 3 ∈ C-RASP[].
Tomita 3 ∈ C-RASP[periodic,local]Informally, the only way periodic and local predicates are
likely to help is if the lengths of contiguous blocks of zeros and ones were bounded (local), or
the parity of the lengths of 1and 0substrings were globally linked to the parity of the positions
of transitions 10 and 01 (periodic), but neither is the case. Sketching a formal proof, assume a
C-RASP[periodic,local]program is given for Tomita 3. First, we eliminate the periodic predicates
by labeling every symbol with the position modulo s, where sis a multiple of 2 and all moduli
appearing in periodic functions; giving an extended alphabet 11,...,1s; 01,...,0s. For sufficiently
large cthat is a co-prime with s, we can then also eliminate local functions by merging an adjacent
block of length caround every transition between ones and zeros into a single symbol Λ; indexed
by the first symbol inside the block and whether the transition happens at the floor(s/2)-th (second
part has even length) or ceil(s/2)-th (second part has odd length) position in the block. The resulting
language, over an extended alphabet, is recognized by a C-RASP[]program capable of determining
whether a string of the form
Λ11,...Λ0(1+c)%s,... Λ1(1+2c)%s,... . . . (6)
contains a substring of the form
Λ0...,even Λ1... ,evenΛ0...,even or Λ0... ,odd Λ1...,odd Λ0...,odd (7)
This is impossible for the same reasons that ΣbbbΣ(Tomita-4) is not in C-RASP[].7
7Formally, Theorems 6.10 and 6.12 in Krebs (2008) show that the regular languages in \
MAJ2[<]are
contained in DA G. The second component in this product can capture the first subscript of each Λ, but not
40
Preprint
Tomita 4 ∈ C-RASP[]By Lemma 6.11 in Krebs (2008) (discussed in Appendix C.3.2),
ΣbbΣ∈ \
MAJ2[<]. In analogy, ΣbbbΣ∈ \
MAJ2[<].
Tomita 4 C-RASP[periodic,local]AC-RASP program tests whether there is a position with a
0where the preceding position also holds a 0and the position preceding that also holds a 0.
Tomita 5 ∈ C-RASP[periodic,local]This language is the intersection of PARITY with the strings
of even length. C-RASP[periodic,local]inexpressiveness follows from the same arguments as for
PARITY (Lemma 41).
Tomita 6 ∈ C-RASP[periodic,local]Consider first the language L3where the number of 1’s is
divisible by 3. This is not in C-RASP[periodic,local]in analogy to PARITY. Now consider the
length-preserving morphism h(1) = 001,h(0) = 000. Then h(w) LT omita 6w L3.
Tomita 7 C-RASP[]Tomita 7 is equivalent to {ϵ} a+b+a+b+b+a+a+b+a+
a+b+a+b+. It can be shown this is equivalent to Σ\ΣbΣaΣbΣaΣ, and this was constructed
in section C.1.5. Interestingly, directly implementing this C-RASP[]construction in a transformer
appears to require at least four layers; this is in contrast to the C-RASP[periodic,local]construction
discussed next, where two layers are sufficient.8
Tomita 7 C-RASP[periodic,local]AC-RASP program can count the number of positions with
the bigrams 01, 10 to detect a violation.
DnC-RASP[]Similar to Tomita 2.
PARITY ∈ C-RASP[periodic,local]See Lemma 41.
(aa)∈ C-RASP[]See Lemma 38.
(aa)C-RASP[periodic,local]See Example C.2.3.
(aaaa)∈ C-RASP[],C-RASP[periodic,local]Analogous to (aa).
(abab)∈ C-RASP[],C-RASP[periodic,local]Analogous to (aa).
aabbccddeeC-RASP[]AC-RASP program indicates, first, the presence of “a”, “b”,
“c“, “d”, “e”, and, second, that every “a” is preceded only by “a”; every “b” is preceded only by “a”
or “b”; every “c” is preceded only by “a”, “b”, “c”; and analogously for “d”, “e”.
{a, b}d{b, c}C-RASP[]There is a single d;acan only appear before it; ccan only appear
after it.
{0,1,2}02∈ C-RASP[periodic,local]See Lemma 36. Note that this is equivalent to the lan-
guage Σbeover the alphabet Σ = {a, b, e}.
E.2 ALGORITHMIC TASKS
E.2.1 TA SK DEFINI TI ONS FOR ALGORITHMIC PRO BLE MS
The tasks are generally from Zhou et al. (2024a), except for Binary Majority Interleave. Here, we
define each formally.
the second. Since the language ΣaaaΣΣbbbΣover Σ = {a, b}is not in DA (shown, e.g., via Theorem
2c in Tesson and Th´
erien (2002), which would entail syntactic congruence of (aabb)ωbb(aabb)ωand (aabb)ω),
the claimed C-RASP[]program cannot exist.
8The first layer collects bigrams, the second layer compares the count of each bigram to the count of SOS
tokens (which is known to be one) to count the bigrams.
41
Preprint
Binary Majority. The binary majority problem identifies the most frequent bit in a sequence of
random bits. An example is SOS 0 1 ... 0 SEP 1 EOS The part 0 1 ... 0 is
the sequence of random bits. We define LEN to be the length of this part. We constrain the sequences
such that the number of 0s and 1s are always not equal. The model is trained with the language
modeling loss on the part 1 EOS , in other words, it is only trained to predict the most frequent
bit and EOS token. The minimum length lmin of this task is 1.
Binary Majority Interleave. The sequences in this problem are created by interleaving multiple
binary majority (see above) inputs while avoiding repeating special tokens (e.g., SOS). We use 3
binary majority sequences to compose one sequence in this task. Formally speaking, given 3 binary
sequences of the same length, x1
1,···x1
n,x2
1,···x2
n, and x3
1,···x3
n, and their corresponding labels
(most frequent bits) y1,y2,y3, the interleaved input is SOS x1
1, x2
1, x3
1, x1
2, x2
2, x3
2,··· , x1
n, x2
n, x3
n.
SEP y1,y2,y3EOS.
An example is SOS 1 0 1 1 0 0 ... SEP 1 0 0 EOS LEN in this problem refers to
length between SOS and SEP (excluding). The model is trained with the language modeling loss on
the part 1 0 0 EOS and lmin = 3.
Majority. This problem is similar to the binary majority problem except that the vocabulary is
bigger. An example is SOS c b a b SEP b EOS , where cbab is a sequence of
random tokens, each of which is sampled independently from an alphabet of 26 symbols. The LEN
is defined as the length of this part. We constrain the sequences such that there is always a unique
answer. The model is trained with the language modeling loss on the part b EOS .lmin = 1.
Sort. In sort problem, the model outputs a sorted version of the given sequence. An example is
SOS 14 23 6 9 SEP 6 9 14 23 EOS where 14 23 6 9 is a sequence of unique
numbers. LE N in this problem refers to the length of this part. The model is trained with the
language modeling loss on the part 6 9 14 23 EOS . In this problem, lmin = 1. The total
vocabulary size of tokens except for special tokens is equal to the maximum testing length, i.e., 150.
Copy (unique). In this problem, the model outputs the same sequence as the given sequence,
which consists of unique tokens. An example is SOS 14 23 6 9 SEP 14 23 6 9 EOS
where the first 14 23 6 9 is a sequence of unique numbers. LEN in this problem refers
to length this part. The model is trained with the language modeling loss on the second part
14 23 6 9 EOS . In this problem, lmin = 1. The total vocabulary size of tokens except
for special tokens is equal to the maximum testing length, i.e., 150.
Copy (repeat). In this problem, the model outputs the same sequence as the given sequence,
which can contain repeated tokens. An example is SOS b a b SEP b a b EOS where
the first bab is a sequence of random symbols. As in Zhou et al. (2024a), we use an
alphabet of only 2 symbols. LEN in this problem refers to length this part. Each symbols is sampled
independently and uniformly. The model is trained with the language modeling loss on the second
part b a b EOS . In this problem, lmin = 1.
Parity. In the parity problem, the model recognizes whether the given sequence con-
tains an even number of 1s and outputs a corresponding token. An example is
SOS 1 0 0 1 0 SEP e EOS The bits before SEP and after SOSare a random sequence of
bits. LE N in this problem refers to the length of this part. The token between SEP and EOS is the
label, it can be either “e” or “o”, meaning even or odd number of 1s. The model is trained with the
loss on the part e EOS .lmin = 0 for this problem. The bits are randomly sampled in a way
such that the number of 1s is distributed uniformly given a fixed LEN.
Addition. In the addition problem, the model does binary addition. An example is
SOS101+10=111EOS The two operands are sampled randomly. LEN in this
problem refers to the total length of them, including “+” and “=”. The model is trained with the loss
on the part 1 1 1 EOS .lmin = 4 for this problem. Note that we do not pad zeros in the
front of operands to make them of equal length. The length of the first operand is sampled uniformly
in [1, LEN-2], and the remaining length is for the second operand. After determining the lengths,
random bits are sampled uniformly.
42
Preprint
C-RASP[]C-RASP[periodic,local]AC0?
Binary Majority yes yes no
Binary Majority Interleave none found yes no
Majority yes yes no
Sort yes yes yes
Copy (unique) no yes yes
Copy (repeat) no no yes
Parity no no no
Addition no no yes
Table 3: Expressiveness properties of algorithmic tasks as defined in Appendix E.2.1 and discussed
in Appendix E.2.2. In the C-RASP columns, “yes” means we found a C-RASP program; “no”
means we proved that no C-RASP program can exist; “none found” means that we found no program
despite best efforts. All problems are expressible in TC0, the tightest known upper-bound on the
expressiveness of transformers. We also show membership in the circuit complexity class AC0, a
smaller class sometimes compared to transformers (Hao et al., 2022; Barcelo et al., 2024); it is not
predictive of length generalization either here or in the regular languages benchmark (e.g., Addition
is in AC0but Majority is not).
E.2.2 LIMIT TRANSFORMERS AND C-RASP EXPRESSIVENESS ON ALGORITHMIC TASKS
See Table 3. We provide proof sketches.
Binary Majority C-RASP[]A single count operation is sufficient.
Binary Majority Interleave and C-RASP[]We did not find a C-RASP[]program, though we
do not have a rigorous proof of nonexistence. Note that, by Lemma 38, even the (seemingly easier)
task of determining whether a given input is well-formed (input length is a multiple of the number
of different majority sequences) cannot be solved by C-RASP[].
Binary Majority Interleave C-RASP[periodic,local]Periodic functions can be used to sepa-
rately implement each count operation.
Majority C-RASP[]Similar to Binary Majority.
Copy (unique) C-RASP[periodic,local]If character (or n-gram) repetition is prevented, then
the sequence length is bounded by the alphabet, so that the space of possible inputs becomes finite,
seemingly precluding the asymptotic analysis done in Theorem 7. To overcome this (apparent)
challenge, we consider two formalizations of this task as operating on unbounded-length inputs.
First, as explained in Zhou et al. (2024a) (and relatedly by Jelassi et al. (2023)), the unique copy-
ing task can be realized with an induction head circuit (Section 4.1 and Appendix C.2.2). More
specifically, each position first records whether SEP has already appeared. An induction circuit then
predicts new tokens in proportion to how frequently they have previously followed appearances of
the current token before SEP (Section 4.1). Copying without repetition is a special case where each
token occurs at most once, so the output of fin (10) is always 0 or 1. We show in Appendix C.2.2
that the induction head construction from Section 4.1 is expressible in C-RASP[periodic,local]but
not in C-RASP[].
We also considered a second formalization of the task, in terms of repeated copying, where, given an
input such as SOS a c b SEP, the model repeatedly copies the string, always predicting the next
character, leading to an unbounded sequence SOS a c b SEP a c b SEP a c b SEP....
This turns the copying task into a function f F(Σ) that operates on unboundedly long sequences,
outputting next-token predictions at each position.
Copy (repeat) ∈ C-RASP[periodic,local]One proof proceeds via communication complexity:
By Corollary 13, copying of general strings is not expressible by Limit Transformers and hence
not in C-RASP[periodic,local]. While valid, this proof does not make transparent why length
43
Preprint
generalization is much easier if repetition is avoided. A different approach, not using communi-
cation complexity and crucially using the presence of repetition proceeds from the fact that over
the alphabet Σ {a, b, e}, the language ΣbebΣ∈ C-RASP[periodic,local](Lemma 35),
and uses it to deduce that, given an input of the form vbekbw#vbek(v, w Σ,klarge), no
C-RASP[periodic,local] program can reliably determine whether a bshould follow.
Sort C-RASP[]As explained in Zhou et al. (2024a), this can be realized by selecting the
smallest number in the input that is larger than the last output symbol. This algorithm does not
require local or periodic positional information.
Note that, as in COPY (unique), the input length is bounded by the alphabet size in this case, but we
can view it as a task defined with unbounded length by the same trick as for COPY (unique), whereby
an initial sequence such as SOS a c b SEP is repeatedly sorted, leading to an unbounded se-
quence SOS a c b SEP a b c SEP a b c SEP....
Parity ∈ C-RASP[periodic,local]See Lemma 41.
Addition ∈ C-RASP[periodic,local]Addition is at least as hard as copying because the special
case of adding zero to a number amounts to copying (Corollary 13).
E.3 DE TAI LS O F EXPERI ME NTAL SET UP
As mentioned in the main paper, at train time, we add random offsets to position indices so that
all position embeddings are trained. The offsets are sampled uniformly at random in the range
[0, N |x|](see Section 2). Like Zhou et al. (2024a), we sample independent training batches on
the fly instead of using a finite-size training set. In contrast, each test set contains 2000 samples that
are sampled at the beginning of each experiment.
For the problems where we train models with language modeling loss, the length of inputs is sampled
uniformly from minimum up to maximum length in the specified range. This is true for training data
and all test sets. As mentioned before, we also use predictive modeling. At each step, the model
outputs a label indicating the set of possible next characters, including EOS. The models are trained
on a whole sequence of tokens. In decoder-only models, standard predictive modeling approaches
are less straightforward. Therefore, we assess predictions by combining the input and output spaces.
For every position in the sequence, we evaluate the predicted character by comparing the output
space (where each embedding represents a subset of possible next tokens) against the expected
value.
We train decoder-only transformer from scratch, using implementations from Hugging Face Trans-
formers9. We train models for maximum 30k steps with a batch size of 64. We stop training early
once the model’s accuracy reaches 100% on the in-distribution test set (the one in range [lmin ,50].
The model is trained with a dropout rate of 0.0. We use AdamW, with a weight decay rate of 0.01.
In preliminary experiments, we found that different model architectures, while achieving 100% ac-
curacy on in-distribution data, may perform differently on out-of-distribution data. To draw a con-
clusion about how the model performs on a problem in general, we determine the hyperparameters
as follows: We consider configurations of {1, 2, 4}layers, {1, 2, 4}heads and model dimension of
{16, 64, 256}, and learning rate of {0.001, 0.0001}. We sweep all the configurations by iterating
over every combination and choose the one that achieves the highest accuracy on [51,100] among
those configurations whose accuracy on [lmin,50] is 100%. When there are multiple such options,
e.g., their accuracy on [51,100] is 100%, the one with the simplest architecture is selected (when
estimating complexity, we assume the following priority: number of layers >number of heads >
model dimension). The final hyperparameters we used for each task are shown in Table 6 and 7.
When no configuration from the search space defined above can achieve accuracy of 100% on
[lmin,50], e.g., in the case of ADDITION, we use an extra configuration, where the number of
layers is 12, number of heads is 12, model dimension is 768, learning rate is 1e-4 or 3e-5 (if 1e-4
9https://huggingface.co/docs/transformers/en/model_doc/gpt2#
transformers.GPT2LMHeadModel
44
Preprint
Problem Model Size LR Max Steps
Tomita-1, 2 1 layer; 1 head; 16 dim 1e-3 30k
D2,D3,D4,D12 1 layer; 4 head; 128 dim 1e-4 30k
Tomita-4, 7 4 layer; 2 head; 64 dim 1e-3 30k
{a, b}d{b, c},aabbccddee6 layer ; 4 head; 64 dim 1e-4 30k
(aa),(aaaa),(abab)6 layer; 4 head; 256 dim 1e-4 60k
Tomita-3, 5, 6 6 layer; 4 head; 256 dim 1e-4 60k
{0,1,2}02, Parity 6 layer; 4 head; 256 dim 1e-4 60k
Table 4: Experimental Hyperparameters for testing NoPE on the Regular Languages
Problem Model Size LR Max Steps
Tomita-1, 2 1 layer; 1 head; 16 dim 1e-3 30k
D2,D3,D4,D12 1 layer; 4 head; 128 dim 1e-4 30k
Tomita-4, 7 4 layer; 2 head; 128 dim 1e-3 30k
{a, b}d{b, c},aabbccddee4 layer ; 4 head; 64 dim 1e-4 30k
(aa),(aaaa),(abab), Parity 4 layer; 4 head; 128 dim 1e-4 40k
{0,1,2}02, Tomita-3, 5, 6 6 layer; 4 head; 128 dim 1e-3 30k
Table 5: Experimental Hyperparameters for testing APE on the Regular Languages.
does not work), and a bigger maximum number of iterations, 60k, we also use the first 3k steps as
warm-up steps.
After we determine the hyperparameter configuration, we run the experiments with multiple random
seeds and report the average accuracy of 5 successful runs (those runs where the model achieves
100% accuracy on in-distribution data). We do no select successful runs in cases where we use the
biggest architecture (the 12-layer configuration), because we find in many cases the accuracy on
in-distribution data stops at around 99%.
The random baseline plotted for the algorithmic tasks (Figure 1, left) is computed using a 2-layer
MLP with a token embedding layer; hence, the model predicts the next token solely based on the
current token. It is trained with the same hyperparameters as transformers, the learning rate is 1e-3.
F TRANSLATING BETWEEN TRANSFORMERS AND LIMIT TRANSFORMERS
Here, we formally introduce the product parameterization, formally state the hypothesis class, and
state and prove the technical lemmas establishing the correspondence between ordinary transformers
and limit transformers.
F.1 PROD UC T PARAMETERIZATION
This parameterization is defined as follows:
Problem Model Size LR Max Steps
Binary Majority 1 layer; 1 head; 16 dim 1e-3 30k
Binary Majority Interleave 2 layer; 4 head; 256 dim 1e-4 30k
Majority 1 layer; 2 head; 256 dim 1e-3 30k
Sort 1 layer; 2 head; 256 dim 1e-4 30k
Copy (unique) 2 layer; 1 head; 64 dim 1e-3 30k
Copy (repeat) 4 layer; 4 head; 256 dim 1e-3 30k
Parity 4 layer; 2 head; 256 dim 1e-4 30k
Addition 12 layer; 12 head; 768 dim 1e-4 60k (3k)
Table 6: Experimental hyperparameters for testing APE on each problem. In the last column, num-
bers in parenthesis mean the warm-up steps, which is 0 when there is no number in parenthesis.
45
Preprint
Problem Model Size LR Max Steps
Binary Majority 1 layer; 1 head; 16 dim 1e-3 30k
Binary Majority Interleave 12 layer; 12 head; 768 dim 1e-4 60k (3k)
Majority 1 layer; 1 head; 64 dim 1e-3 30k
Sort 1 layer; 1 head; 256 dim 1e-3 30k
Copy (unique) 4 layer; 4 head; 256 dim 1e-3 30k
Copy (repeat) 4 layer; 4 head; 256 dim 1e-3 30k
Parity 12 layer; 12 head; 768 dim 3e-5 60k (3k)
Addition 12 layer; 12 head; 768 dim 1e-4 60k (3k)
Table 7: Experimental hyperparameters for testing NoPE on each problem. In the last column,
numbers in parenthesis mean the warm-up steps, which is 0 when there is no number in parenthesis.
Bin 1 Bin 2 Bin 3
0
100 (aa)*
Bin 1 Bin 2 Bin 3
0
100 (aaaa)*
Bin 1 Bin 2 Bin 3
0
100 (abab)*
Bin 1 Bin 2 Bin 3
0
100 Tomita-3
Bin 1 Bin 2 Bin 3
0
100 Tomita-5
Bin 1 Bin 2 Bin 3
0
100 Tomita-6
Bin 1 Bin 2 Bin 3
0
100 D-2
Bin 1 Bin 2 Bin 3
0
100 D-3
Bin 1 Bin 2 Bin 3
0
100 D-4
Bin 1 Bin 2 Bin 3
0
100 D-12
Bin 1 Bin 2 Bin 3
0
100 Tomita 1
Bin 1 Bin 2 Bin 3
0
100 Tomita-2
Bin 1 Bin 2 Bin 3
0
100 Tomita-4
Bin 1 Bin 2 Bin 3
0
100 Tomita-7
Bin 1 Bin 2 Bin 3
0
100aa*bb*cc*dd*ee*
Bin 1 Bin 2 Bin 3
0
100 {a,b}*d{b,c}*
Bin 1 Bin 2 Bin 3
0
100 {0,1,2}*02*
Found CRASP[Periodic, Local] Program No CRASP[Periodic, Local] Program Found CRASP[] Program No CRASP[] Program
Figure 2: Detailed results for regular languages with language names, corresponding to the right
part of Figure 1 but with individual languages labeled.
Definition 43 (Product Parameterization).For l= 1, . . . , L, set
VO0={pi:i}∪{Eσ:σ}
VOl={(Bl)·,s :s= 1, . . . , d}
VI0=
VIl={(Al)s,·:s= 1, . . . , d;Uσ:σΣ}
VO =[
l=0,1,...,L VOl
VI =[
l=0,1,...,L VIl
P={{Vl1,h1, . . . , Vlk,hk}: 0 kL;l1<· ·· < lk; 1 hiH}
Given a transformer Twith N(T)<, define:
αl,h,S1,S2,v,w:=vT Y
S∈S1
S!T
KT
l,hQl,h Y
S∈S2
S!wR
for 1lL; 1 hH;v VO;w VO;S1,S2 P
βS,v,w:=vT Y
S∈S
S!wR
for v VIl1;w VOl2;S P
where the matrix product over a set S P Y
S∈S
S(8)
46
Preprint
Algorithmic Problems
Bin 1 Bin 2 Bin 3
0
50
100
Binary
Majority
Bin 1 Bin 2 Bin 3
0
50
100
Binary Majority
Interleave
Bin 1 Bin 2 Bin 3
0
50
100
Majority
Bin 1 Bin 2 Bin 3
0
50
100
Sort
Bin 1 Bin 2 Bin 3
0
50
100
Copy
Unique
Bin 1 Bin 2 Bin 3
0
50
100
Copy
Repeat
Bin 1 Bin 2 Bin 3
0
50
100
Parity
Bin 1 Bin 2 Bin 3
0
50
100
Addition
In AC0 Not In AC0
Regular Languages
Bin 1 Bin 2 Bin 3
0
100 (aa)*
Bin 1 Bin 2 Bin 3
0
100 (aaaa)*
Bin 1 Bin 2 Bin 3
0
100 (abab)*
Bin 1 Bin 2 Bin 3
0
100 Tomita-3
Bin 1 Bin 2 Bin 3
0
100 Tomita-5
Bin 1 Bin 2 Bin 3
0
100 Tomita-6
Bin 1 Bin 2 Bin 3
0
100 D-2
Bin 1 Bin 2 Bin 3
0
100 D-3
Bin 1 Bin 2 Bin 3
0
100 D-4
Bin 1 Bin 2 Bin 3
0
100 D-12
Bin 1 Bin 2 Bin 3
0
100 Tomita 1
Bin 1 Bin 2 Bin 3
0
100 Tomita-2
Bin 1 Bin 2 Bin 3
0
100 Tomita-4
Bin 1 Bin 2 Bin 3
0
100 Tomita-7
Bin 1 Bin 2 Bin 3
0
100
aa*bb*cc*dd*ee*
Bin 1 Bin 2 Bin 3
0
100 {a,b}*d{b,c}*
Bin 1 Bin 2 Bin 3
0
100 {0,1,2}*02*
In AC0 Not In AC0
Figure 3: Membership in the circuit complexity class AC0does not predict transformers’ length
generalization on algorithmic problems (top) or regular languages (bottom). Prior work has often
linked the expressiveness of transformers to circuit complexity (e.g. Hahn, 2020; Hao et al., 2022;
Merrill and Sabharwal, 2023c; Strobl, 2023; Barcelo et al., 2024). All tasks included in our exper-
iments are in the class TC0, the tightest known upper bound on transformers’ expressiveness. A
well-known circuit complexity class within TC0is AC0, known to upper-bound the power of cer-
tain hard-attention models of transformers (Hao et al., 2022; Barcelo et al., 2024), which may raise
hopes that it helps understand transformers’ practical abilities. However, membership in this class
does not predict transformers’ length generalization behavior. On the algorithmic problems, there
is no apparent correlation at all; majority-type problems, which the attention mechanism can easily
implement, are not in AC0, but problems with super-logarithmic communication complexity such
as copying and addition (Corollary 13) are contained. On the regular languages, AC0exactly cov-
ers the class FO[reg]. This class can be proven to include all regular languages in C-RASP, but it
also includes various languages that transformers length-generalize poorly on, such as Tomita-3. A
natural subclass, obtained by restricting the size of AC0circuits to a linear number of wires, yields
the class FO2[Reg](Cadilhac and Paperman, 2022), which does not match transformers’ behavior
well either, e.g. it includes {0,1,2}02(bottom right, equals Σbefrom Lemma 11) but does not
include D-12. Taken together, established circuit complexity classes do not account for Transform-
ers’ length generalization behavior. Compare to C-RASP results in Figures 1 and 2.
47
Preprint
(A) Star-Free vs Non-Star-Free Languages
Bin 1 Bin 2 Bin 3
0
100 (aa)*
Bin 1 Bin 2 Bin 3
0
100 (aaaa)*
Bin 1 Bin 2 Bin 3
0
100 (abab)*
Bin 1 Bin 2 Bin 3
0
100 Tomita-3
Bin 1 Bin 2 Bin 3
0
100 Tomita-5
Bin 1 Bin 2 Bin 3
0
100 Tomita-6
Bin 1 Bin 2 Bin 3
0
100 D-2
Bin 1 Bin 2 Bin 3
0
100 D-3
Bin 1 Bin 2 Bin 3
0
100 D-4
Bin 1 Bin 2 Bin 3
0
100 D-12
Bin 1 Bin 2 Bin 3
0
100 Tomita 1
Bin 1 Bin 2 Bin 3
0
100 Tomita-2
Bin 1 Bin 2 Bin 3
0
100 Tomita-4
Bin 1 Bin 2 Bin 3
0
100 Tomita-7
Bin 1 Bin 2 Bin 3
0
100
aa*bb*cc*dd*ee*
Bin 1 Bin 2 Bin 3
0
100 {a,b}*d{b,c}*
Bin 1 Bin 2 Bin 3
0
100 {0,1,2}*02*
Star Free + APE Non Star Free + APE Star Free + NoPE Non Star Free + NoPE
(B) Dot-Depth
Bin 1 Bin 2 Bin 3
0
100 (aa)*
Bin 1 Bin 2 Bin 3
0
100 (aaaa)*
Bin 1 Bin 2 Bin 3
0
100 (abab)*
Bin 1 Bin 2 Bin 3
0
100 Tomita-3
Bin 1 Bin 2 Bin 3
0
100 Tomita-5
Bin 1 Bin 2 Bin 3
0
100 Tomita-6
Bin 1 Bin 2 Bin 3
0
100 D-2
Bin 1 Bin 2 Bin 3
0
100 D-3
Bin 1 Bin 2 Bin 3
0
100 D-4
Bin 1 Bin 2 Bin 3
0
100 D-12
Bin 1 Bin 2 Bin 3
0
100 Tomita 1
Bin 1 Bin 2 Bin 3
0
100 Tomita-2
Bin 1 Bin 2 Bin 3
0
100 Tomita-4
Bin 1 Bin 2 Bin 3
0
100 Tomita-7
Bin 1 Bin 2 Bin 3
0
100
aa*bb*cc*dd*ee*
Bin 1 Bin 2 Bin 3
0
100 {a,b}*d{b,c}*
Bin 1 Bin 2 Bin 3
0
100 {0,1,2}*02*
Dot Depth=1 Dot Depth=2 Dot Depth=3 Dot Depth=4 Dot Depth=12
Figure 4: (1) Comparing length-generalization with a standard notion of the complexity of finite-
state languages: Star-free languages (green) do not require modular counting (McNaughton and Pa-
pert, 1971), have simpler algebraic representations in terms of group-free monoids (Sch¨
utzenberger,
1965), are easily represented by modern state-space models (Sarrof et al., 2024), and match the ex-
pressiveness of a formal model of hard attention Transformers (Yang et al., 2023). However, they
do not consistently lead to length generalization in transformers, which on the other hand length-
generalize on some non-star-free languages such as (aa). The expressiveness of C-RASP correctly
accounts for the observed behavior. (2) Within the star-free languages, a standard complexity metric
is dot-depth, with increased dot-depth indicating increased complexity (non-star-free languages are
plotted in gray color). Dot-depth does not predict length generalization, which succeeds on some
languages at dot depths 1 and 12 and fails at some languages at intermediate depth. See Figure 3 for
further discussion regarding another existing notion of complexity, circuit complexity, also much
less successful than C-RASP expressiveness at predicting length generalization. Compare to C-
RASP results in Figures 1 and 2.
48
Preprint
is computed in descending order of layers; with the Sassociated with the lowest layer at the right.
For instance, Y
S∈{V1,h,V3,h,V4,h′′ }
S=V4,h′′ V3,hV1,h (9)
Remark 44. Here, we exemplify the Product Parameterization (Definition 43).
α1,h,,,pi,Eσ=pT
iKT
1,hQ1,h Eσ
α2,h,{V1,h},,pi,pj=pT
iVT
1,hKT
2,hQ2,h pj
α3,h,{V2,hV1,h′′ },{V1,h′′′ },Eσ,Eτ=ET
σVT
1,h′′ VT
2,hKT
2,hQ2,h V1,h′′′ Eτ
β,(A1)s,·,pi=(A1)T
s,·pi
β{V1,h},(A3)s,·,Eσ=(A3)T
s,·V1,hEσ
β{V3,h,V1,h},Uτ,Eσ=UT
τV3,hV1,hEσ
β{V3,h,V2,h},Uτ,(B1)·,s =UT
τV3,hV2,h(B1)·,s
Remark 45. For ease of notation, we have not restricted the layers from which different vector
parameters are taken in the definition of αand β; hence, they will also include products that are not
relevant to actual computations, such as
piTVT
2,hKT
1,hQ1,h′′ V3,h′′′ pj(10)
where a vector of the form V3,h′′′pjcannot actually feed into the computation of queries in the first
layer. This is simply for simplicity of notation; such products will not impact results, though one
could explicitly exclude them if one wants to obtain tighter quantitative bounds on the parameter
count of Limit Transformers in Lemma 52.
F.2 FORMAL DE FINI TI ON O F HYPOT HESIS CL AS S
Definition 46 (Hypothesis Class, corresponds to Definition 4).Let pNbe fixed. For each n=
1,2,3, . . . , define the hypothesis class Θnas the set of transformers T(as defined in Section 2)
where
1. N(Tn) = n
2. each parameter vector and matrix of Tis represented at pbits of precision
3. each product function (Definition 43) involving positional encodings is translation-
invariant. That is, every product function involving exactly one positional encoding is
constant across positions, and for every 1i, j, i + , j + n,
αl,h,S1,S2,pi,pj=αl,h,S1,S2,pi+∆,pj+∆
for all l, h, S1,S2making these objects well-defined.
F.3 FROM LIMIT TRANSFORMERS TO TRANSFORMERS
Lemma 47. Let Tbe a Limit Transformer satisfying PERIODIC and LOCA L. Then there are
transformers T1, T2, . . . (TiΘi) such that, for all iN, x S, o {0, . . . , i |x|}:
Ti(x, o)T(x, 0) when o+|x| i(11)
and
sup
iR(Ti)<(12)
Proof. We use hats to indicate the parameters of the constructed transformers. Fix NN; we
construct TN.
49
Preprint
Let be the periodicity of pi. The construction sets ˆ
L=L+ 2 and ˆ
H= max{1, H, }. Each
activation has ˆ
d:= d+N+ 3∆ + 2 dimensions, which can be partitioned into six regions:
ddimensions (Region I: main region)
Ndimensions (Region II: position region)
dimensions (Region III: periodic region I)
dimensions (Region IV: periodic Region II)
1dimension (Region V: SOS Region)
∆+1dimensions (Region VI: Copied SOS Region)
(13)
Region I directly emulates the computations of the Limit Transformer. Region II carries absolute
positional information, used for simulating the positional functions ϕl,h.
We define the token and positional encodings as
ˆ
pi=
0Rd
eiRN
e(i%∆)+1 R
0R
0R
0R∆+1
ˆ
Eσ=
EσRd
0RN
0R
0R
1σ=$ R
0R∆+1
where eiis the i-th unit vector. Region I holds token information. Region II holds exact position
information. Region III holds modular position information. Region V indicates whether the to-
ken is start-of-sequence (SOS) or not. Content will be written to Regions IV and VI by attention
components.
Intuition At first sight, a simple intuitive translation just uses Regions I and II, placing all param-
eters of Tinto Region I, and taking one-hot vectors eifor Region II to encode exact positional
information, so that the functions ϕl,h can be implemented by products ˆ
piˆ
KTˆ
Qˆ
pj. One would thus
use the simpler encoding ˜
Eσ:= [Eσ,0] and ˜pi:= [pi,ei]. Such a translation would be able to
reproduce the input-output behavior of T. However, it would fall short in two ways: First, the
positional encodings pican give rise to patterns whereby pT
iKTQpjis periodic in jiwhen ji
is large, and thus bounded away from zero at unboundedly many distances ji, making (8) un-
bounded. We will avoid this by routing modular positional information through a value matrix V1,1
before making it available to attention computations; intuitively, this is possible because the vectors
pihave a bounded-dimensional span. Modular positional information starts out in Region III of
ˆ
pi, and is copied by V1,1into Region IV; no KTQmatrix in the construction will directly address
Region III. Second, transformers in ΘNmust satisfy the requirement that all product functions are
translation-invariant; such a requirement need not be implemented by T(e.g., MLPs could respond
differently to different pis), and thus also not by the simple translation sketched. We overcome this
by adding different attention heads, each assigned to some k {0,...,1}, each of which
primarily attends to the SOS symbol (based on Region V), with a stronger weight in the k-th head
falling on SOS if the distance between the query position and the SOS position is congruent to k
modulo (based on Region IV). These attention weights are written, via the value matrix, to Region
VI. An MLP then compares each of these attention weights to the weights resulting from uniform
attention, thereby determining the head giving rise to the highest attention weight, and places the
matching encoding piinto Region I. This construction maintains translation-invariance of all prod-
uct functions; most importantly, it makes all βfunctions involving positional encodings equal to
zero: The MLP reads from Region VI, whose entries are linear combinations of entries from Region
V, which is zero in all ˆpi. Crucially, the dependence of the positional encoding piwritten to Region
I on the original positional information in ˆpiis mediated entirely through attention weights, which
do not enter the definition of the product functions. Once these computations are completed, Region
I matches the activations in Tat offset 0, and a direct simulation of Tcan proceed based on
Regions I and II. Taken together, we expand the intuitive simple construction to make rigorous the
intuition that bounded-rank positional information can be utilized by transformers even under the
constraints that (8) be bounded and that product functions be translation-invariant.
Layer 1: Copying Periodic Positional Information to Region IV In the lowest layer, each po-
sition attends to itself and moves the periodic positional information from Region III to Region IV.
50
Preprint
Formally:
K1,1=Q1,1=
0 0 0 0 0 0
0 ·IN×N0000
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
V1,1=
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 I×000
0 0 0 0 0 0
0 0 0 0 0 0
for some >0to be chosen later. After attention (the MLP does nothing, A1,B1,b10), the
output will be
ˆ
y(1)
i=
Eσ
ei
e(i%∆)+1
(1 ϵ)e(i%∆)+1
1xi=$
0
(14)
where ϵ < 0.1when is sufficiently large. Region III will not be addressed by any further down-
stream matrices or vectors. The idea behind this operation is to ensure no ˆ
KTˆ
Qmatrix will directly
have to read from Region III rather, any dependence of attention logits on modular positional in-
formation is mediated by the V1,1matrix. This allows us to keep (8) bounded even in the presence of
such dependence, as we detail below. Note that this strategy importantly relies on R(T), so
that rank(V1,1)is bounded independently of N. Intuitively, modular positional information (unlike
the full positional information encoded in Region II) can be routed through bounded-rank compo-
nents, which enables keeping (8) bounded.
Layer 2: Determining Position Relative to SOS We now add a second layer, in which we attend
with + 1 different heads, where the s-th head tests whether the distance to SOS is congruent to
smodulo , and the + 1-st head attends uniformly. By determining which head attends most
strongly to SOS, we can read out the relative position with an MLP without breaking the translation
invariance of product functions.
For h= 1,...,, let Q2,h be such that
Q2,h
. . .
. . .
. . .
ei
. . .
. . .
=
0
0
0
ei+h%∆
0
0
(15)
Let K2,h (h= 1,...,) be the identity matrix restricted to Regions IV and V. Further, let
Q2,∆+1 =K2,∆+1 0(16)
Then, for h= 1,...,,
ˆy(1)
iˆ
KT
2,h ˆ
Q2,h ˆy(1)
j= 1ijh(mod) (17)
and
ˆy(1)
iKT
2,∆+1Q2,∆+1 ˆy(1)
j= 0 (18)
Intuitively, heads 1,...,attend preferentially to positions at a given distance modulo ; head
∆+1attends everywhere. Define V2,h for h= 1,...,∆+1by
V2,h
. . .
. . .
. . .
. . .
z
. . .
=
0
0
0
0
0
z·eh
(19)
51
Preprint
As the only vector parameter with an entry in Region V is the token embedding for the SOS token,
the outcome of this attention block effectively writes the attention falling on the SOS token for
each of the ∆+1attention heads. We then use a Heaviside MLP with the number of hidden units
bounded in to determine for which s= 1, . . . it holds that entry sin Region VI has a greater (as
opposed to smaller) entry than entry + 1. The MLP, via the Bmatrix, then writes psto Region
I. A special case occurs at SOS, where Region V is 1 (it is 0 everywhere else); here, the MLP
units described above are disabled (a 1 in Region V causes a large negative number to be added to
their inputs) and the MLP instead writes p1to Region I. Overall, ∆+1MLP units with Heaviside
activation are sufficient for this construction. Overall, after the MLP, the i-th position in the string
has p((i1)%∆)+1 added to Region I. Overall,
ˆ
y(2)
i=
Exi+pio+1
ei+o
. . .
. . .
. . .
. . .
=
y(0)
i
ei+o
. . .
. . .
. . .
. . .
(20)
where ois the offset, and the second equality holds if the Limit Transformer is run at o= 0. In
layers 2,...,ˆ
L, only Regions I and II will receive any consideration.
Higher Layers: Emulating TNext, for l1, define ˆ
KT
l+2,h ˆ
Ql+2,h Rˆ
d׈
das
ˆ
KT
l+2,h ˆ
Ql+2,h =
KT
l,hQl,h 0 0000
0Wl,h 0000
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
(21)
where ij:
(Wl,h)i,j =ϕl,h (i, j)(22)
to satisfy
pT
iWl,hpj=ϕl,h (i, j)(23)
Then Wis a sum of matrices each of which has the value ϕl,h(ij, j)for some ion an (off-)diagonal
and zeros elsewhere; hence
Wl,h2X
i|ϕl,h(1, i)|(24)
Overall, ˆ
KT
l+1,h ˆ
Ql+1,hcan be bounded, independently of N, in terms of KT
l,hQl,h and ϕl,h1.
As ϕl,h is local, this 1-norm is finite.
We construct all other parameter matrices and vectors by placing the parameter from the Limit
Transformer into the Region I and leaving all other regions zero. Now, by induction, the first d
dimensions of any activation will match those of the Limit Transformer, but shifted by two layers:
ˆy(l+2)
i=
y(l)
i
ei+o
. . .
. . .
. . .
. . .
(25)
As the Umatrix only reads from Region I, the output will also be the same as for the Limit Trans-
former.
For the excess heads, the matrices are just set to 0 in the first or the higher layers depending on
whether Hor is larger than the other.
Bounding Norms and Ranks At l2, now the ranks of Vl,h and the norms of Al,Bl, U and the
2norms of e, b, c will be the same they were in the Limit Transformer. The increases from the first
and second layer are bounded in terms of and hence R(T)
52
Preprint
Verifying Boundedness of (8) By construction, ˆ
pT
iˆ
KT
1,h ˆ
Q1,h ˆ
pj=δij δh1and ˆ
pT
iˆ
KT
2,h ˆ
Q2,h ˆ
pj
0. For the higher layers, the boundedness follows because Tsatisfies LO CAL.
Verifying Translation Invariance We need to verify that all product functions are translation-
invariant. Each ˆpicontains entries in Regions II and III. In the first layer, we have α1,1,,,pi,pj=
δij , hence, these are translation-invariant. In the second layer, we have
α2,s,,,pi,pj=
0
ei
e(i%∆)+1
0
0
0
T
KT
2,sQ2,s
0
ei
e(j%∆)+1
0
0
0
= 0 (26)
α2,s,{V1},,pi,pj=0 (27)
α2,s,{V1},{V1},pi,pj=
0
ei
e(i%∆)+1
0
0
0
T
VT
1,1KT
2,sQ2,s V1,1
0
ei
e(j%∆)+1
0
0
0
(28)
=
0
0
0
e(i%∆)+1
0
0
T
KT
2,sQ2,s
0
0
0
e(j%∆)+1
0
0
(29)
=1jis(mod ∆) (30)
(31)
These are all translation-invariant. All α2+l,h,,,pi,pjequal a function ϕl,h(i, j )and hence are
translation invariant. No Vmatrix ever reads from Region II. Overall, all αproducts are translation-
invariant. In higher layers, KTQmatrices read from Regions I and II. The positional encodings ˆ
pi
write to Region II but not not even when mediated directly through value matrices Region I, and
the products are translation-invariant because the functions ϕl,h are.
Consider
β,(A1)s,·,pi= 0
because the first layer MLP does nothing. Second,
β,(A2)s,·,pi= 0
because the second layer MLP only reads from Region VI, and none of pi,V1picontain any entries
in Region VI. Also,
βS,Uσ,pi= 0
β{V1},Uσ,pi= 0
β{V2V1},Uσ,pi= 0
β{V2},Uσ,pi= 0
since none of pi,V1pi,V2V1pi,V2picontain any entries in Region I, where Uσhas its entries.
53
Preprint
Since V2,s all read only from Region V, whose entries have no connection to ˆpi, we also have:
β{V2,... },(Al)s,·,pi= 0
β{V2,... },Uσ,pi= 0
β{V2,V1,... },(Al)s,·,pi= 0
β{V2,V1,... },Uσ,pi= 0
Overall, all βproducts involving piare translation-invariant.
F.4 FROM TRANSFORMERS TO LIMIT TRANSFORMERS
We first establish various smaller lemmas. The first lemma informally says that, when f(i, j) :=
pT
iApjis translation-invariant in (i, j), and Ahas bounded rank, then f(i, j )is periodic. This
lemma is key to the prominent role of the PERIODIC property in Theorem 7.
Lemma 48. Let pN, and let NN. Let p1,...,pNRksuch that pi2< C ; let ARk×k.
Let f(i, j) := pT
iApjbe translation invariant, in the sense that
0ij:M0 : j+MNf(i, j) = f(i+M , j +M)(32)
Further assume that, for ij,f(i, j)can be expressed with pfractional bits.10 Define for n0
G(n) := f(1,1 + n)(33)
Then there is Nupper-bounded in terms of rank(A),p,C, and A(but not N) such that
n:n+ < N G(n) = G(n+ ∆) (34)
Proof. Let ρ:= rank(A); it is >0without loss of generality. We write the singular value decom-
position of Aas
A=UTΣV(35)
where ΣRρ×ρ,U,VRρ×d, where U,V 1. Then we can write
pT
iApj=pT
iUTΣV pj=Upi
V piT
(Iρ×ρ0ρ×ρ)TΣ (0ρ×ρIρ×ρ)Upj
V pj(36)
We will henceforth replace Aby (Iρ×ρ0ρ×ρ)TΣ (0ρ×ρIρ×ρ)Rρ×ρand piby U pi
V pi
R2 rank(A); note that this increases the norms of these objects at most by a multiplicative constant
while preserving the products pT
iApj.
As pi2< C for all i, we find that, when Nis sufficiently large, for each ϵ > 0, there are pi,pj
(i < j) such that:
pipj2ϵ(37)
Take := ji. The minimum required distance can be upper bounded by considering the
ϵ-packing number of {v:v C}and applying the pigeonhole principle. Hence, overall, can
be upper-bounded in terms of ϵ,rank(A), and C. Importantly, this bound is independent of N.
Hence, k {i+ , . . . , N }:
=|G(ki)G(ki∆)|
=|f(i, k)f(i+ , k)|
=|f(i, k)f(i+ (ji), k)|
=|pT
iApkpT
jApk|
pipj2Apk2
ϵ· Apk2
10In particular, this is satisfied if pi,Aare each expressed at some fixed precision q, where p3q.
54
Preprint
Equivalently, using the substitution l:= ki, we have, for any l {l, . . . , N }:
|G(l+ ∆) G(l)| ϵ· Apk2(38)
Take ϵ=2p
4CA; then
G(l) = G(l+ ∆) (39)
due to the assumption about fixed-precision outputs.
Lemma 49. Assume each parameter in a transformer is represented at pbits of precision. Then
each product function is exactly represented at (4 + 2L)pbits of precision.
Proof. Each product function consists at most of two vectors, a key and query matrix, and up to 2L
value matrices. This results in a sum of numbers that each are a product of up to 4+2Lnumbers that
each are individual parameters. As each number is represented at pbits of precision, each product
is represented at (4 + 2L)pbits of precision.
It will be useful to define a complexity metric applicable to Limit Transformers:
Definition 50. For a Limit Transformer T, define R(T)as the sum of
1. L+H+d
2. the precision pused for expressing the parameters (Definition 2), and the precision pused
for rounding attention logits and the output of exp(·)(Section 2).
3. the maximum norm of all parameter vectors and matrices (including positional encod-
ings)
4. the minimal periodicity of the positional encodings11
5. maxl,h P
i=1 |ϕl,h(1,1 + i)|2(short: maxl,h ϕl,h 2
2).
Proposition 51. Let AR+. Let Ube the set of Limit Transformers Tsuch that R(T)A.
Then the set of parameter settings in U, other than the ϕl,h functions, is finite.
Proof. Immediate.
We now state the key lemma translating ordinary transformers to Limit Transformers using the
product parameterization:
Lemma 52. Let TΘn, at pbits of precision. Let the alphabet Σbe fixed.12 Then there is a Limit
Transformer Tsuch that TTat length nand
R(T)F(R(T)) (40)
for some universal function F:R+R+and
pT
iKT
l,hQl,h pj=ϕl,h(i, j)(41)
for the pi,Kl,h,Ql,h parameters of T; for each l, h. In particular, Tsatisfies PERIODIC and each
ϕl,h is translation-invariant.
We prove the lemma in the remainder of the section.
F.4. 1 PROVI NG LEM MA 52 (I): PRELIMINARIES
We discuss various preliminaries, before presenting the construction, explaining its intuition, and
explaining how its soundness is formally proven.
11Formally, 1plus the supremum of the set of ’s for which pi≡ pi+∆.
12The bound on R(T)depends on it, but during the inference procedure, the alphabet is assumed fixed.
55
Preprint
Basic Idea We will construct Tso that the entries in every parameter are 0, 1, or one of the prod-
uct functions from Definition 43. This will automatically ensure that its parameters are represented
at fixed precision pbounded in terms of R(T), and with each entry bounded in terms of the spectral
norms of parameter matrices and the 2norms of parameter vectors, hence, also bounded in terms of
R(T). Importantly, we will use the definition of R(T)and the definition of Θnto restrict attention
to a number of product functions that are bounded only in terms of R(T), independently of n.
Bounding Active MLP Units First, given AlF<, the number of nonzero entries is bounded
as Al2
F
22p, which is bounded in terms of R(T). Similarly, the number of nonzero entries in blis
bounded as b2
2
22p, and similarly for Bl. Let dM LP dbe, across l= 1, . . . , L the maximum
of the maximum number of nonzero entries in bl, and of the maximum number of nonzero rows in
Al. Without loss of generality, by reordering rows in Aland columns in Bl, we may assume that,
in each layer, these entries are in the first dMLP dimensions. Then dM LP is upper-bounded in terms
of these (or d, if the bounds exceed d); this is bounded in terms of R(T)and independent of n. In
each layer, only dMLP of the MLP units have nonzero input weights Al, output weights Bl, or
biases bl. Removing product functions belonging to the inactive units, we set:
d
VIl:=VIl {(Al)s,·:s=dM LP + 1, . . . , d}
c
VI :=
L
[
l=1 d
VIl
Then, the size of c
VI is bounded in terms of R(T).
Periodicity of Bounded-Rank Positional Functions By Lemma 48 all products pT
i. . . pjwhere
the intervening material has bounded rank are periodic in jiwith period bounded in terms of the
rank of the intervening material, and hence R(T). Let be the least common multiple of the periods
obtained from Lemma 48 across the (finitely many) different products of the form βl,h,S1,S2,pi,pj
where S1 S2=. Define
d
VO0={pi:i= 1,...,}∪{Eσ:σ}
d
VOl={(Bl)·,s :s= 1, . . . , dM LP }, l = 1, . . . , L
d
VO =d
VO0
T
[
l=1 d
VOl
Then, the size of d
VO is bounded in terms of R(T).
F.4. 2 PROVI NG LEM MA 52 (II): CONSTRUCTION OF T
Translation in Terms of Regions We use hats (i.e., b·) to mark the parameters and activations
of the Limit Transformer, distinguishing those from the parameters and activations of the original
transformer T.
Each d-dimensional parameter vector and activation (residual streams and attention outputs) is trans-
lated to a vector consisting of three regions, each having a fixed number of dimensions bounded in
terms of R(T), That is, each vector parameter or activation v(e.g., pi,Eσ,y(l)
i) is translated to a
parameter or activation b
v(e.g., b
pi,c
Eσ,d
y(l)
i) vector consisting of the following three regions:
b
v=
ΓS,w(b
v) : S P;w[
VIl2
ΛS1,T1,T2,l,h,w1,w2(v):1lL; 1 hH;S1 T1;T1,T2 P,w1,w2d
VO
S2,T1,T2,l,h,w1,w2(v) : 1 lL; 1 hH;S2 T2;T1,T2 P,w1,w2d
VO
(42)
56
Preprint
Intuition of the Construction The first region, denoted ΓS,w(b
v), has one entry for every choice
of S P;w[
VIl2. Intuitively, the entry ΓS,w(b
v)describes the outcome of applying all value
matrices in Sand then finally the vector wT:
ΓS,w(b
v) = wT Y
S∈S
S!vR(43)
(Recall Definition 43 for the notation QS∈S S.) The second and third regions each have one entry
for every choice of 1lL, S2 T2;T1,T2 P,w1,w2d
VO. These regions contain the
information necessary for computing attention logits. Intuitively, T1,T2describe the value matrices
through which w1and w2, respectively, pass before the computation of attention logits in layer
l. For parameter vectors w1,w2(e.g., token embeddings or columns of a Blmatrix positional
encodings are somewhat special), we simply expect (note the duplicated arguments T1,T2 these
will be explained in the next paragraph):
ΛT1,T1,T2,l,h,w1,w2(w1)ΩT2,T1,T2,l,h,w1,w2(w2) = wT
1 Y
S∈T1
S!T
KT
l,hQl,h Y
S∈T2
S!w2
(44)
Thus, Λcan be viewed as holding key parameters, whereas can be viewed as holding query param-
eters, for the contribution that the pair of w1,w2makes to attention logits in layer l, after passing
through the value matrices in T1,T2, respectively. As a convention, at the level of parameter vectors,
the Λcomponent will hold the attention logit contribution (the RHS of this equation), whereas the
component will just hold zeros and ones. At the level of intermediate activations v(y(l)
ior Y(l)
i),
the situation is slightly more complex: here,
ΛS1,T1,T2,k,h,w1,w2(d
y(l)
i)(45)
denotes the contribution to attention logits for head hat layer karising from multiples of
QS∈T1Sc
w1in an activation \
y(k1)
iinteracting with multiples of QS∈T2Sc
w2in an activa-
tion \
y(k1)
j; a similar idea applies to .... However, additional care is needed to ensure that only
contributions from value matrices are counted that were actually passed through. The additional
argument S1serves as a “to-do-list”: it records which of the value matrices in T1still have to be
traversed; whenever an activation passes through a value matrix Vl,h, the value matrix d
Vl,hof
the Limit Transformer moves entries from ΛS1,T1,T2,k,h,w1,w2to ΛS1−{Vl,h},T1,T2,k,h,w1,w2 ef-
fectively removing itself from the “to-do-list”. The same princple applies to , which maintains a
to-do-list S2for T2. In the end, only those components where the to-do-lists are empty (formally,
S1=S2=) will enter attention logit computations of the Limit Transformer:
ϕk,h(i, j ) + X
v,w,T1,T2
Λ,T1,T2,k,h,v,w(d
y(l)
i)·,T1,T2,l,h,v,w(d
y(l)
j) = (y(l)
i)TKT
k,hQk ,hy(l)
j(46)
where the sum runs over all v,wd
VO, and T1,T2runs over all sets of value matrices from layers
l.
In the remainder of the proof, we present a detailed formal construction implementing this intuition.
We first define, for each vector v {pi,Eσ,bl:i, l, σ}its translation b
v; throughout, we will define
each of the three regions.
Vector Parameters bl VO We take (b
bl)s:= (bl)sfor s= 1, . . . , dMLP .
Vector Parameters Eσ VO The first region provides products with other vectors appearing at
higher layers (rows/columns of the Almatrices and the unembedding matrix). The second and third
57
Preprint
regions provide products leading up to keys and values.
ΓS,w(c
Eσ) := (βS,w,Eσif S P,wc
VI
0else
ΛS1,T1,T2,l,h,w1,w2(c
Eσ) := αl,h,T1,T2,w1,w2if Eσ=w1,S1=T1
0else
S2,T1,T2,l,h,w1,w2(c
Eσ) := 1if Eσ=w2,S2=T2
0else
Vector Parameters: b
piFor v=pi, the construction is analogous, however, we zero out the
entries for Λ,,l,h,pi,pjand ,,l,h,pi,pj, as these will be taken care of by the ϕl,h functions. For-
mally:
ΓS,w(b
pi) := (βS,w,piif S P,wc
VI
0else
ΛS1,T1,T2,l,h,w1,w2(b
pi) :=
αl,h,T1,T2,w1,w2if pi%∆ =w1; (w2∈ {pj:j}∨T1 T2=),
S1=T1
0else
S2,T1,T2,l,h,w1,w2(b
pi) :=
1if pi%∆ =w2; (w1∈ {pj:j}∨T1 T2=),
S2=T2
0else
We need to establish that Tsatisfies P ERIODIC with the period given above. First, βS,v,piis
independent of iby translation-invariance, thus trivially periodic in i. Second, the Λand entries
are periodic in iwith period by construction.
Matrix Parameters: d
Vl,h Each entry in the d
Vl,h matrix is zero or one. We define it implicitly, in
terms of its action on the three different regions:
ΓS,w(d
Vl,h(d
y(l)
l)) = (ΓS∪{Vl,h },w(d
y(l)
l)Vl,h ∈ S
0else
ΛS1,T1,T2,l,h,w1,w2(d
Vl,h(d
y(l)
l)) = (ΛS1∪{Vl,h},T1,T2,l,h,w1,w2(d
y(l)
l)Vl,h ∈ S1,Vl,h T1
0else
S2,T1,T2,l,h,w1,w2(d
Vl,h(d
y(l)
l)) = (S2∪{Vl,h},T1,T2,l,h,w1,w2(d
y(l)
l)Vl,h ∈ S2,Vl,h T2
0else
Matrix Parameters: c
Al,c
BlLet s {1, . . . , dMLP }. For the s-th unit in the MLP at layer l, we
first define the s-th row of c
Alby setting
(c
Al)s,··d
Y(l)
i= Γ,(Al)s,·d
Y(l)
iR
58
Preprint
and define the s-th column of c
Blas follows writing ˆ
XRfor the s-th hidden unit activation:
ΓS,w(b
Bs,·ˆ
X) := ˆ
X·βS,w,(Bl)·,s
ΛS1,T1,T2,l,h,w1,w2(c
Bl·,s ˆ
X) := ˆ
X·αl1,h,l2,T1,T2,(Bl)·,s,w2if w1= (Bl)·,s ,S1=T1
0else
S2,T1,T2,l,h,w1,w2(c
Bl·,s ˆ
X) := ˆ
X·1if w2= (Bl)·,s,S2=T2
0else
This defines the c
Aland c
Blmatrix parameters, and we can write, letting ψl,s denote the activation
function (ReLU or Heaviside) applying to the s-th hidden MLP unit:
d
y(l)
i=d
Y(l)
i+
dMLP
X
s=1
(b
Bl)·,s ·ψl,s (b
Al)s,·(d
Y(l)
i)+(b
bl)s(47)
or equivalently
d
y(l)
i=d
Y(l)
i+c
Bl·ϕl(c
Al·d
Y(l)
i+b
bl)(48)
matching the formulation of MLPs for our model of transformers (Equation 4).
Note that the hidden dimension of the MLP in the Limit Transformer is now dMLP , which will be
smaller than b
d. We thus pad the remaining rows/columns of c
Al,c
Bl, and the remaining entries of b
bl
with zeros.
A partial order on sets of value matrices For S,T P, we write T lSto denote that
1. T S
2. l {1, . . . , L}: [(Vl,h S)l> l]
3. l {1, . . . , L}: [(Vl,h T S)ll]
Intuitively, T lS says that among the value matrices in T, the activation has already passed
through all value matrices at layer land below”. For example:
{V1,h, V2,h, V3,h′′ , V5,′′′ } 2{V3,h′′ , V5,′′′ }
{V1,h, V2,h, V3,h′′ , V5,′′′ } ≥2{V4,h′′′ }
{V1,h, V2,h, V3,h′′ , V5,′′′ } ≥1{V3,h′′ , V5,′′′ }
{V1,h, V2,h, V3,h′′ , V5,′′′ } ≥3{V3,h′′ , V5,′′′ }
Matrix Parameters: \
KT
l,hQl,h We again define them implicitly in terms of regions; this can be
realized using matrices d
KT
l,h,d
Ql,h where all entries are 0 or 1. Importantly, we sum only those
entries where the “to-do-lists” S1,S2are empty, and the sets T1,T2only contain value matrices at
layers l:
(d
y(l)
i)T\
KT
l,hQl,h d
y(l)
j=X
v,w,T1l,T2l
Λ,T1,T2,l,h,v,w(d
y(l)
i)·,T1,T2,l,h,v,w(d
y(l)
j)(49)
Matrix Parameters: UThe Umatrix is translated as follows:
b
UT
σd
y(L)
i= Γ,Uσ(d
y(L)
i)(50)
Positional Function Define for l= 1, . . . , L and h= 1, . . . , H, when 1ijN(T):
ϕl,h(i, j ) = pT
iKT
l,hQl,h pj(51)
As TΘn,ϕl,h(i, j )only depends on ji.
59
Preprint
Bounding R(T)First, we showed above that b
dis upper-bounded in R(T). Second, all param-
eters are represented at precision bounded in terms of R(T): those parameters that are taken from
product functions have precision 4Lp bits; those involving the SVDs of KTQmatrices also have
bounded precision. Third, the norm of all parameter vectors is bounded in terms of R(T)by
construction. Fourth, is bounded in terms of R(T)as discussed above. The boundedness of the
fifth term is immediate.
Summary We have constructed a Limit Transformer Tsuch that
R(T)F(R(T)) (52)
for some universal function F:R+R+and
pT
iKT
l,hQl,h pj=ϕl,h(i, j)(53)
for the pi,Kl,h,Ql,h parameters of T; for each l , h. By assumption on T, each ϕl,h is translation-
invariant. We have also constructed b
piwith period , so that Tsatisfies PERIODIC.
F.4. 3 PROVI NG LEM MA 52 (III): PROVI NG CORRECTNESS
In order to conclude Lemma 52, it remains to establish the correctness of the translation; that is,
TTat length N(T). To do this, it suffices to show that both transformers provide the same
next-token predictions for each i= 1, . . . , N(T):
b
UT
σd
y(L)
i= Γ,Uσ(d
y(L)
i) = UT
σy(L)
i(54)
Informally, proving this requires showing that the attention logits and MLP activations in Tmatch
those in T; the result then follows from the linearity of Γ,Uσand the way Γ... is defined for the vec-
tor parameters and how value matrices d
Vl,h move information. Formally, we prove the correctness
of the translation inductively, by showing the following equalities. Recall (from Definition 43) that,
when S P is a set of value matrices, we write QS∈S Sfor the product of these matrices, ordered
by layers, with the matrix associated with the lowest layer at the right. Then
Lemma 53 (Preservation of Products by Translation).For layer l {0,1, . . . , L}, for any
S,S1,S2 P for any k > l, for any ml, and for any wc
VI, by induction over the lay-
ers l:
60
Preprint
(A)Preservation of Products with Vector Parameters
ΓS,w(d
y(l)
i) = wT Y
S∈S
S!y(l)
i
if l {1, . . . , L}: [(Vl,h S)l> l]
(B)Preservation of Attention Logits (I)
X
v,w,T1lS1,T2lS2
ΛS1,T1,T2,k,h,v,w(d
y(l)
i)·S2,T1,T2,k,h,v,w(d
y(l)
j)
= (y(l)
i)T Y
S∈S1
S!T
KT
k,hQk ,h Y
S∈S2
S!y(l)
j
if S1 S2=
(C)Preservation of Attention Logits (II)
ϕk,h(i, j ) + X
v,w,T1l,T2l
Λ,T1,T2,k,h,v,w(d
y(l)
i)·,T1,T2,k,h,v,w(d
y(l)
j)
= (y(l)
i)TKT
k,hQk ,hy(l)
j
(D)Preservation of Attention Logits (III)
X
v,w,T1lS1,T2lS2
ΛS1,T1,T2,k,h,v,w((d
Bm)·,s)·S2,T1,T2,k,h,v,w(d
y(l)
j)
= (Bm)T
·,s Y
S∈S1
S!T
KT
k,hQk ,h Y
S∈S2
S!y(l)
j
(E)Preservation of Attention Logits (IV)
X
v,w,T1lS1,T2lS2
ΛS1,T1,T2,k,h,v,w(d
y(l)
i)·S2,T1,T2,k,h,v,w((d
Bm)·,s)
= (y(l)
i)T Y
S∈S1
S!T
KT
k,hQk ,h Y
S∈S2
S!(Bm)·,s
(55)
and analogous statements with the post-MLP activations y(l)
i,d
y(l)
ireplaced by the pre-MLP activa-
tions Y(l)
i,b
Y(l)
i.
From (A), we in particular obtain the correctness of the translation by noting that next-token predic-
tions are replicated:
b
UT
σd
y(L)
i= Γ,Uσ(d
y(L)
i) = UT
σy(L)
i(56)
Proof of Lemma 53. The formal proof proceeds by induction over l. It is conceptually straight-
forward, consisting of expanding definitions and taking care of the special treatment of positional
encodings. We show it in considerable detail to build intuition. For the inductive base, at l= 0,
where y(0)
iis a sum of a word embedding and a positional embedding, the claims are immediate
from the definitions. For expository purposes, and for building intuition for the more complex in-
ductive step, we show them in more detail. Starting from (for simplicity, we are taking the offset to
be zero here):
y(0)
i=Exi+pi
d
y(0)
i=d
Exi+b
pi
(57)
61
Preprint
we first, for (A), write
ΓS,w(d
Exi+b
pi)=ΓS,w(d
Exi)+ΓS,w(b
pi)
=βS,w,Exi+βS,w,pi%∆
=wT Y
S∈S
S!y(0)
i
proving case (A) of the inductive base. Second, for (B) and (C), write using the linearity of Λ...,... :
X
v,w,T10S1,T20S2
ΛS1,T1,T2,k,h,v,w(d
Exi+b
pi)·S2,T1,T2,k,h,v,w(d
Exi+b
pi)
=X
v,w,T10S1,T20S2
ΛS1,T1,T2,k,h,v,w(d
Exi)·S2,T1,T2,k,h,v,w(d
Exi)
+X
v,w,T10S1,T20S2
ΛS1,T1,T2,k,h,v,w(d
Exi)·S2,T1,T2,k,h,v,w(b
pi)
+X
v,w,T10S1,T20S2
ΛS1,T1,T2,k,h,v,w(b
pi)·S2,T1,T2,k,h,v,w(d
Exi)
+X
v,w,T10S1,T20S2
ΛS1,T1,T2,k,h,v,w(b
pi)·S2,T1,T2,k,h,v,w(b
pi)
The only way of satisfying T 0Sis for Tto equal S. After plugging in the definitions, the sums
collapse due to the indicator terms in the definition of the token and positional encodings, and we
obtain after simplifying:
=αk,h,S1,S2,Exi,Exj
+αS1,S1,S2,k,h,Exi,pj%∆
+αS1,S1,S2,k,h,pi%∆,Exj
+αS1,S1,S2,k,h,pi%∆,pj%∆ ·1S1∪S2=
By translation-invariance, the second and third term are independent of the positional encod-
ing arguments. By our choice of at the beginning of the proof, the fourth term equals
αS1,S1,S2,k,h,pi,pj·1S1∪S2=, as this is periodic in (i, j)with periodicity . We can thus rewrite as
=αk,h,S1,S2,Exi,Exj
+αS1,S1,S2,k,h,Exi,pj
+αS1,S1,S2,k,h,pi,Exj
+αS1,S1,S2,k,h,pi,pj·1S1∪S2=
Applying the definition of α..., the above equals
=((y(0)
i)TKT
k,hQk ,h(y(0)
j)ϕk,h(i, j )S1 S2=
(y(0)
i)TQS∈S1STKT
k,hQk ,h QS∈S2S(y(0)
j)S1 S2=
This establishes cases (B) and (C) of the inductive base. The proof of cases (D) and (E) in the
inductive base is analogous.
For the inductive step, the intuition is that each activation y(l)
iis a linear combination of vector
parameters, with different sets of value matrices acting on those:
y(l)
i=X
v∈VO X
S∈P
λv,i,l,S Y
S∈S
S!v(58)
where the coefficients are determined by attention weights and the activations of MLP hidden units.
Importantly, the attention weights and MLP activations turn out to be the same in the Limit Trans-
former as in the original transformer, provided we can prove that attention and MLPs are faithfully
62
Preprint
simulated (which indeed follows from cases A and C of the inductive claim). Hence, the same
decomposition is valid in the Limit Transformer:
d
y(l)
i=X
v∈VO X
S∈P
λv,i,l,S Y
S∈S b
S!b
v(59)
with the same λv,i,l,Scoefficients as in the original transformer. Then, case (A) of the inductive
claim follows intuitively by the calculation:
wTy(l)
i=X
v∈VO X
S∈P
λv,i,l,SwT Y
S∈S
S!v
=X
v∈VO X
S∈P
λv,i,l,Sb
wT Y
S∈S b
S!b
v
=b
wTX
v∈VO X
S∈P
λv,i,l,S Y
S∈S b
S!b
v
=b
wTd
y(l)
i
which is warranted provided that, when b VO and w VI, we have that b
wTQS∈S b
Sb
v
equals wTQS∈S Sv this is ensured because of the way the vector parameters b
vand the value
matrices d
Vl,h are defined. The same idea establishes cases (D–E). A similar, though somewhat more
complex (due to the bilinear nature of attention) calculation establishes cases (B–C). Formalizing
this reasoning essentially amounts to inductively proving cases (A–E); it will not be necessary to
keep track of an explicit decomposition using λ... coefficients; rather, one can mechanically verify
these conditions inductively by plugging in definitions and applying the inductive hypothesis.
Formally proving the inductive step consists in mechanically expanding definitions and applying
the inductive hypothesis. First, (C) applied to layer l1entails that the attention logits for attention
heads operating in layer lmatch those of the original transformer. We start by establishing the
inductive step for the pre-MLP activations Y(l)
i. Starting from:
Y(l)
i=y(l1)
i+
H
X
h=1
i
X
j=1
˜a(l,h)
i,j Vl,hy(l1)
j
d
Y(l)
i=\
y(l1)
i+
H
X
h=1
i
X
j=1
˜a(l,h)
i,j d
Vl,h
\
y(l1)
j
(60)
we show the inductive step first for (A) in the case of the pre-MLP activation Y(l)
i. For Ssatisfying
l {1, . . . , L}: [(Vl,h S)l> l](61)
we consider
ΓS,wd
Y(l)
iS,w\
y(l1)
i+
H
X
h=1
i
X
j=1
˜a(l,h)
i,j ΓS,wb
Vl,h
\
y(l1)
j
S,w\
y(l1)
i+
H
X
h=1
i
X
j=1
˜a(l,h)
i,j ΓS∪{Vl,h },w\
y(l1)
j
where ˜ai,j denotes attention weights. The claim here now follows from the inductive hypothesis for
(A):
=wT Y
S∈S
S!y(l1)
i+
H
X
h=1
i
X
j=1
˜a(l,h)
i,j wT Y
S∈S
S!Vl,hy(l1)
j
=wT Y
S∈S
S!Y(l)
i
63
Preprint
where the last step used (60). Next, we consider (B) and (C). First, for (B), assuming T1 T2=,
we first find using (60) and the linearity of Λ... and ... (portions changed marked in blue):
X
v,w,T1,T2
ΛS1,T1,T2,k,h,v,w(d
Y(l)
i)ΩS2,T1,T2,k,h,v,w(d
Y(l)
j)
=X
v,w,T1,T2
ΛS1,T1,T2,k,h,v,w \
y(l1)
i+
H
X
h=1
i
X
w=1
˜a(l,h)
iw d
Vl,h
\
y(l1)
w!
·S2,T1,T2,k,h,v,w \
y(l1)
j+
H
X
h′′=1
j
X
w=1
˜a(l,h′′)
jw
[
Vl,h′′
\
y(l1)
w!
=X
v,w,T1,T2
ΛS1,T1,T2,k,h,v,w\
y(l1)
iS2,T1,T2,k,h,v,w\
y(l1)
j
+X
v,w,T1,T2
ΛS1,T1,T2,k,h,v,w\
y(l1)
iH
X
h=1
j
X
w=1
˜a(l,h)
jw S2,T1,T2,l,h,v,wd
Vl,h
\
y(l1)
w
+X
v,w,T1,T2
H
X
h′′=1
i
X
w=1
˜a(l,h′′)
iw ΛS1,T1,T2,l,h,v,w[
Vl,h′′
\
y(l1)
wS2,T1,T2,k,h,v,w\
y(l1)
j
+X
v,w,T1,T2
H
X
h=1
i
X
w=1
˜a(l,h)
iw ΛS1,T1,T2,k,h,v,wd
Vl,h
\
y(l1)
w
·
H
X
h′′=1
j
X
w=1
˜a(l,h′′)
jwS2,T1,T2,k,h,v,w[
Vl,h′′
\
y(l1)
w
Now the definition of b
Vl,h allows us to rewrite this as:
=X
v,w,T1lS1,T2lS2
ΛS1,T1,T2,l,h,v,w\
y(l1)
iS2,T1,T2,l,h,v,w\
y(l1)
j
+X
v,w,T1lS1,T2lS2
ΛS1,T1,T2,l,h,v,w\
y(l1)
iH
X
h=1
j
X
w=1
˜a(l,h)
jw S2∪{Vl,h},T1,T2,l,h,v,w\
y(l1)
w
+X
v,w,T1lS1,T2lS2
H
X
h=1
i
X
w=1
˜a(l,h)
iw ΛS1∪{Vl,h},T1,T2,l,h,v,w\
y(l1)
wS2,T1,T2,l,h,v,w\
y(l1)
j
+X
v,w,T1lS1,T2lS2
H
X
h=1
i
X
j=1
˜a(l,h)
i,j ΛS1∪{Vl,h},T1,T2,l,h,v,w\
y(l1)
j
·
H
X
h′′=1
j
X
w=1
˜a(l,h′′)
jwS2∪{Vl,h′′ },T1,T2,l,h,v,w\
y(l1)
w
64
Preprint
In order to directly apply the inductive hypothesis, we rearrange the summations:
=X
v,w,T1lS1,T2lS2
ΛS1,T1,T2,l,h,v,w\
y(l1)
iS2,T1,T2,l,h,v,w\
y(l1)
j
+
H
X
h=1
j
X
w=1
˜a(l,h)
jw X
v,w,T1lS1,T2lS2
ΛS1,T1,T2,l,h,v,w\
y(l1)
iS2∪{Vl,h},T1,T2,l,h,v,w\
y(l1)
w
+
H
X
h=1
i
X
w=1
˜a(l,h)
iw X
v,w,T1lS1,T2lS2
ΛS1∪{Vl,h},T1,T2,l,h,v,w\
y(l1)
wS2,T1,T2,l,h,v,w\
y(l1)
j
+
H
X
h=1
i
X
w=1
˜a(l,h)
iw
H
X
h′′=1
j
X
w=1
˜a(l,h′′)
jwX
v,w,T1lS1,T2lS2
ΛS1∪{Vl,h},T1,T2,l,h,v,w\
y(l1)
w
·S2∪{Vl,h′′ },T1,T2,l,h,v,w\
y(l1)
w
Note that, as above, S1,S2are fixed and the sums run over T1,T2. We can rewrite the above as:
=X
v,w,T1l1S1,T2l1S2
ΛS1,T1,T2,l,h,v,w\
y(l1)
iS2,T1,T2,l,h,v,w\
y(l1)
j
+
H
X
h=1
j
X
w=1
˜a(l,h)
jw X
v,w,T1l1S1,T2l1S2∪{Vl,h}
ΛS1,T1,T2,l,h,v,w\
y(l1)
i
·S2∪{Vl,h},T1,T2,l,h,v,w\
y(l1)
w
+
H
X
h=1
i
X
w=1
˜a(l,h)
i,j X
v,w,T1l1S1∪{Vl,h},T2l1S2
ΛS1∪{Vl,h},T1,T2,l,h,v,w\
y(l1)
j
·S2,T1,T2,l,h,v,w\
y(l1)
j
+
H
X
h=1
i
X
w=1
˜a(l,h)
iw
H
X
h′′=1
j
X
w=1
˜a(l,h′′)
jwX
v,w,T1l1S1∪{Vl,h′′ },
T2l1S2∪{Vl,h}
ΛS1∪{Vl,h},T1,T2,l,h,v,w\
y(l1)
w
·S2∪{Vl,h′′ },T1,T2,l,h,v,w\
y(l1)
w
We are now ready to apply the inductive hypothesis: Directly plugging the inductive hypothesis for
(B) into the second through fourth terms gives us:
=X
v,w,T1l1S1,T2l1S2
ΛS1,T1,T2,l,h,v,w\
y(l1)
iS2,T1,T2,l,h,v,w\
y(l1)
j
+
H
X
h=1
j
X
w=1
˜a(l,h)
jw (y(l1)
i)T Y
S∈S1
S!T
KT
k,hQk ,h Y
S∈S2
S!T
Vl,hy(l1)
w
+
H
X
h=1
i
X
w=1
˜a(l,h)
iw y(l1)
jVT
l,h Y
S∈S1
S!T
KT
l,hQl,h Y
S∈S2
S!y(l1)
j
+
H
X
h=1
i
X
w=1
˜a(l,h)
iw
H
X
h′′=1
j
X
w=1
˜a(l,h′′)
jw(y(l1)
w)TVT
l,h Y
S∈S1
S!T
KT
l,hQl,h Y
S∈S2
S!Vl,h′′ y(l1)
w
65
Preprint
We now distinguish two cases, for proving (B) and (C). The first one is that S1=S2=. In this
case, by case (C) of the inductive hypothesis:
=(y(l1)
i)TKT
k,hQk ,hy(l1)
jϕk,h(i, j )
+
H
X
h=1
j
X
w=1
˜a(l,h)
jw (y(l1)
i)TKT
k,hQk ,hVl,h y(l1)
w
+
H
X
h=1
i
X
w=1
˜a(l,h)
iw y(l1)
jVT
l,hKT
l,hQl,h y(l1)
j
+
H
X
h=1
i
X
w=1
˜a(l,h)
iw
H
X
h′′=1
j
X
w=1
˜a(l,h′′)
jw(y(l1)
w)TVT
l,hKT
l,hQl,h Vl,h′′ y(l1)
w
Now, applying (60) again, we rearrange to sums to obtain the conclusion
= (Y(l)
i)TKT
k,hQk ,hY(l)
jϕk,h(i, j )(62)
proving (upon rearranging) the inductive step for (C) in the case of Y(l)
i. In the second case, S1
S2=; here, we use case (B) of the inductive hypothesis to instead rewrite as
=(y(l1)
i)T Y
S∈S1
S!T
KT
k,hQk ,h Y
S∈S2
S!(y(l1)
j)
+
H
X
h=1
j
X
w=1
˜a(l,h)
jw (y(l1)
i)T Y
S∈S1
S!T
KT
k,hQk ,h Y
S∈S2
S!Vl,hy(l1)
w
+
H
X
h=1
i
X
w=1
˜a(l,h)
iw y(l1)
jVT
l,h Y
S∈S1
S!T
KT
l,hQl,h Y
S∈S2
S!y(l1)
j
+
H
X
h=1
i
X
w=1
˜a(l,h)
iw
H
X
h′′=1
j
X
w=1
˜a(l,h′′)
jw(y(l1)
w)TVT
l,h Y
S∈S1
S!T
KT
l,hQl,h Y
S∈S2
S!Vl,h′′ y(l1)
w
Now, applying (60) again, we rearrange to sums to obtain the conclusion
= (Y(l)
i)T Y
S∈S1
S!T
KT
k,hQk ,h Y
S∈S2
S!Y(l)
j(63)
This proves the inductive step for (B) in the case of Y(l)
i. We next address the inductive step for (D)
in the case of the pre-MLP activation:
X
v,w,T1lS1,T2lS2
ΛS1,T1,T2,k,h,v,w(d
Bm)·S2,T1,T2,k,h,v,w(d
Y(l)
j)
=(Bm)T Y
S∈S1
S!T
KT
k,hQk ,h Y
S∈S2
S!Y(l)
j
By unfolding d
Y(l)
jusing (60) and using the linearity of S2,T1,T2,k,h,v,w, the claim follows directly
from the inductive hypothesis for (D). The same reasoning applies to (E). Overall, we have proven
the inductive step (A–E) for the pre-MLP activations Y(l)
i.
We now need to show that the inductive step also holds for the post-MLP activations. Recall that the
MLP acts as
y(l)
i=Y(l)
i+
dMLP
X
s=1
(Bl)·,s ·ψl,s (Al)s,·(Y(l)
i)+(bl)s
d
y(l)
i=d
Y(l)
i+
dMLP
X
s=1
(b
Bl)s,··ψl,s (b
Al)s,·(d
Y(l)
i)+(b
bl)s(64)
66
Preprint
The proof proceeds by expanding this equation and reducing the claim to the already-proven induc-
tive step for pre-MLP activations (for handling the direct contribution from the pre-MLP activation),
and cases (D) and (E) (for handling the contributions of the MLP units). First, we note that, for each
l, s, by the case (A) of the inductive hypothesis and by the definition of b
bl,
ψl,s (Al)s,·(Y(l)
i)+(bl)s=ψl,s (b
Al)s,·(d
Y(l)
i)+(b
bl)s(65)
We will abbreviate this number as Ξl,s,i R. We now prove the case (A) of the inductive step for
the post-MLP activation:
ΓS,w(d
y(l)
i) S,w(d
Y(l)
i+
dMLP
X
s=1
(b
Bl)s,··Ξl,s,i)
S,w(d
Y(l)
i) +
dMLP
X
s=1
ΓS,w(( b
Bl)s,·)·Ξl,s,i
=wT Y
S∈S
S!Y(l)
i+
dMLP
X
s=1
wT Y
S∈S
S!(Bl)·,s ·Ξl,s,i
=wT Y
S∈S
S!y(l)
i
To prove cases (B) and (C) for the post-MLP activations y(l)
i, we now consider
X
v,w,T1lS1,T2lS2
ΛS1,T1,T2,k,h,v,wd
y(l)
i·S2,T1,T2,k,h,v,wd
y(l)
j
=X
v,w,T1lS1,T2lS2
ΛS1,T1,T2,k,h,v,w d
Y(l)
i+
dMLP
X
s=1
Ξl,s,i ·(b
Bl)·,s!
·S2,T1,T2,k,h,v,w d
Y(l)
j+
dMLP
X
t=1
Ξl,t,j ·(b
Bl)·,t!
=X
v,w,T1lS1,T2lS2
ΛS1,T1,T2,k,h,v,wd
Y(l)
iS2,T1,T2,k,h,v,wd
Y(l)
j
+
dMLP
X
s=1
Ξl,s,j ·X
v,w,T1lS1,T2lS2
ΛS1,T1,T2,k,h,v,wd
Y(l)
iS2,T1,T2,k,h,v,w(b
Bl)·,s
+
dMLP
X
s=1
Ξl,s,i ·X
v,w,T1lS1,T2lS2
ΛS1,T1,T2,k,h,v,w(b
Bl)·,sS2,T1,T2,k,h,v,wd
Y(l)
j
+
dMLP
X
s=1
Ξl,s,i ·
dMLP
X
t=1
Ξl,t,j ·X
v,w,T1lS1,T2lS2
ΛS1,T1,T2,k,h,v,w(b
Bl)·,s
·S2,T1,T2,k,h,v,w(b
Bl)·,t
67
Preprint
We apply the inductive step for the pre-MLP activation in cases (D), and (E) to rewrite the second
and third term, and apply the definition of c
Blto rewrite the fourth term:
=X
v,w,T1lS1,T2lS2
ΛS1,T1,T2,k,h,v,wd
Y(l)
iS2,T1,T2,k,h,v,wd
Y(l)
j
+
dMLP
X
s=1
Ξl,s,j ·(Y(l)
i)T Y
S∈S1
S!T
KT
k,hQk ,h Y
S∈S2
S!(Bl)s,·
+
dMLP
X
s=1
Ξl,s,i ·(Bl)T
·,s Y
S∈S1
S!T
KT
k,hQk ,h Y
S∈S2
S!Y(l)
j
+
dMLP
X
s=1
Ξl,s,i ·
dMLP
X
t=1
Ξl,t,j ·(Bl)T
·,s(Y
S∈S1
S)TKT
k,hQk ,h(Y
S∈S2
S)(Bl)·,t
In the case where S1=S2=, we obtain using case (C) of the inductive hypothesis for the
pre-MLP activation:
=(Y(l)
i)TKT
k,hQk ,hY(l)
jϕk,h(i, j )
+
dMLP
X
s=1
Ξl,s,j ·(Y(l)
i)TKT
k,hQk ,h(Bl)s,·
+
dMLP
X
s=1
Ξl,s,i ·(Bl)T
·,sKT
k,hQk ,hY(l)
j
+
dMLP
X
s=1
Ξl,s,i ·
dMLP
X
t=1
Ξl,t,j ·(Bl)T
·,sKT
k,hQk ,h(Bl)·,t
Using (64) and the definition of Ξl,s,i, this rewrites to
=(y(l)
i)TKT
k,hQk ,hy(l)
jϕk,h(i, j )
from which case (B) of the inductive hypothesis follows by rearranging. If instead S1S2=, the
same reasoning leads to case (C) of the inductive hypothesis. Analogous reasoning establishes cases
(D) and (E) for the post-MLP activations. Overall, we have established the inductive step for cases
(A–E) for the post-MLP activations.
G ADDITIONAL SUPPORTING RESULTS
G.1 RE GULAR IZ ER AT INITIALIZATION
Here, we provide evidence that the additional regularizer (8) is bounded independently of Nunder
plausible initializations of parameters. Recall
(8) =
L
X
l=1
H
X
h=1
N(T)
X
j=1 pT
1KT
l,hQl,h pj2(66)
Intuitively, and as formalized in Proposition 54, when independently initializing the positional en-
codings pi, their inner products as mediated through KT
l,hQl,h will tend to be small. As long as the
width grows linearly with N, the aggregate value of (8) will tend to be bounded independently of
N. Note that (8) only includes products involving position 1, which due to translation invariance for
TΘnplaces a bound on all products. As standard training does not enforce translation invari-
ance of the products pT
iKT
l,hQl,h pj, one may also be interested in a variant that takes all pairs of
68
Preprint
positions into account, to the extent that they can enter causal attention:
1
N(T)
L
X
l=1
H
X
h=1
N(T)
X
j=1
j
X
i=1 pT
iKT
l,hQl,h pj2(67)
Here, the same conclusion holds. We describe it formally, at the example of the second variant, in
Proposition 54.
Proposition 54. Assume d= Θ(N). Assume the entries of each p1,...,pNRdand KT
l,hQl,h
Rd×d(l= 1, . . . , L;h= 1, . . . , H) are initialized i.i.d. from N(0,1
d). The number of layers Land
heads Hare constant with respect to N. Then
E
1
N
L
X
l=1
H
X
h=1 X
1ijN|pT
iKT
l,hQl,h pj|2
=O(1) (68)
Proof. We begin by showing that the expectation of each term in the sum is O(1/d)and hence the
sum is bounded by a constant. There are two cases for the expectation of terms: (i) The first is i=j
when the vectors piand pjare independent and the second is i=jwhen piand pjare dependent.
For this section, let K,Qdenote the matrices Kl,hQl,h for any fixed l, h. Let Kij and Qij denote
the entry of the corresponding matrices ith column and jth row. Let A=KTQRd×d. Note that
the expectation of any entry of A,
E[Aij ] = E[KT
iQj] = E[
d
X
k=1
Ki,kQj,k ]=0.
Further,
E[A2
ij ] = E[(
d
X
k=1
Ki,kQj,k )2] = E[
d
X
k=1
K2
i,kQ2
j,k]+2E[
d1
X
m=1
d
X
n=m+1
Ki,mQj,m Ki,nQj,n ]
=E[
d
X
k=1
K2
i,kQ2
j,k] = 4=σ2
For products of two different entries Ai,jAmn , note that E[Ai,j Am,n] =
E[Pd
u=1 Pd
v=1 Ki,uQj,u Km,v Qn,v)] which is 0when i=mor j=n.
For each term |pT
iApj|2, we have
E[|pT
iAQpj|2] =E[(
d
X
u=1
d
X
v=1
pi,uAu,v pj,v )2]
=E[
d
X
u=1
d
X
v=1
(pi,uAu,v pj,v )2]+2E[X
u,v=m,n
pi,upi,v Au,v Am,npj,vpj,n ]
=E[
d
X
u=1
d
X
v=1
p2
i,uA2
u,vp2
j,v]
since all terms of the form E[pi,upi,v Au,v Am,npj,vpj,n ]are 0due to independence of Aand p.
For i=j, we have
E[
d
X
u=1
d
X
v=1
p2
i,uA2
u,vp2
j,v] =
d
X
u=1
d
X
v=1
E[p2
i,u]E[A2
u,v]E[p2
j,v] = d2σ6=1
d.
69
Preprint
For i=j, we have
E[
d
X
u=1
d
X
v=1
p2
i,uA2
u,vp2
i,v] = E[
d
X
u=1
p4
i,uA2
u,v] + E[
d
X
u=1 X
v=u
p2
i,uA2
u,vp2
i,v]
=d(3σ6)+(d2d)σ6<3
d.
Since d= Θ(N)and each term is less than 3
d, we have that the sum in Eq.68 is O(1).
G.2 EMPIRICAL LE NGTH GENERALIZATION OF POSITIONAL FUNCTIONS
Here, we show empirically that, when directly fitting parameters so that a product piKTQpjre-
produces some function ϕ(·,·)at smaller distances, it will length generalize when these are local
or periodic but under different conditions matching the role of local and periodic functions in our
theory. Specifically, we show that they length-generalize well at large dwhen they are LOCA L;
whereas, when dis smaller, length generalization works well when they are PERIODIC. Length
generalization is poor on functions that are neither local nor periodic.
Experimental Setup We randomly initialize 200 position embeddings {piRd: 1 i < 201},
as well as query and key matrices, Q,KRd×d. We experiment with d={32,256}. We optimize
the mean square error (MSE) between pT
iKTQpjand ϕ(·,·)on length of 50, and test on length
{50,100,150}. Concretely, during training, we sample random offsets ofrom [0, 150] and take
the sequence of p1+o,...,p50+oto compute the loss. When testing on length n, we compute the
average loss over all offsets in [0, 200-n]. We ignore the loss on those entries where j < i to mimic
causal masking. The ϕ(·,·)we use in experiments (except for the one combined from two ϕ(·,·))
only takes two values, 0 when the condition is false and 2 log 50 when the condition is true. We thus
use different conditions to describe different ϕ(·,·). For example, we use ϕ:j=i-c to denote the
following function:
ϕ(j, i) = 2 log 50 j=ic
0else (69)
where cis a constant number.
The embeddings and weight matrices are trained with Adam optimizer, using a batch size of 64,
a learning rate of 1e-3, for 15k steps. Additionally, we add mean squared weights (i.e., squared
Frobenius norm divided by the number of elements) to the loss to mimic the training regularizer,
with a coefficient of 0.01.
Results are in Figure 5 and 6. Note that in both figures, the y-axis uses a logarithmic scale above 1.0
and stays linear scale below 1.0. In the last column of 6, “combined” denotes functions that combine
two functions as follows: ϕ=ϕ1+ϕ2, where ϕ1:j=i-c and ϕ2:(i-j)=c2mod c1.
G.3 BOUND FOR ENCODINGS NO RM IN TE RM S OF FUNCTION COMPLEXITY
Recall that our regularizer includes a penalty (8) on attention dot products. Here, we discuss a
conjecture:
Conjecture 55. The term (8) can be removed from the regularizer while maintaining a (potentially
weaker) length generalization guarantee for Limit Transformers.
To provide a heuristic argument for this, assume that for each upper bound N on the input length,
we have a configuration of positional encodings and the matrix A, such that the following property
holds: For any indices Nj > i > 0, let
pT
iApj=F(ji)(70)
where F:NRis a function that maps to numbers representable in p-bit precision. The function
Fand the precision pare chosen globally, across the different N’s. Boundedness of R(Tn)across
70
Preprint
Figure 5: Appendix G.2: MSE loss in fitting (length = 50) and generalizing (higher lengths) func-
tions ϕ(·,·)with products pT
jKTQpi. We show local functions testing if j=ic(left), if j > ic
(center), periodic functions testing whether ijc2(modc1)(right). We show results at small (top,
d= 32) and high (bottom, d= 256) dimensionality. Local functions length-generalize well when
dimensionality is high (bottom left, bottom center); generalization is more successful with functions
concentrated on few pairs (bottom left is nonzero at only one value of ji; bottom center is nonzero
at cdifferent values of ji). Periodic functions length-generalize well when dimensionality is low
(top right). The results match the distinct roles played by local and periodic functions in our theo-
retical constructions: Periodic functions are mediated by bounded-rank products (Lemma 48), local
functions are mediated by the products pTKT
l,hQp.
Figure 6: Appendix G.2: MSE loss in fitting (length = 50) and generalizing (higher lengths) func-
tions ϕ(·,·)with products pT
jKTQpi. We show functions that are neither local nor periodic, which
test if j < i c(left) if ijis a prime number (center), and a function created by adding a local
and a periodic function (right). We show results at small (top, d= 32) and high (bottom, d= 256)
dimensionality. Compared with results in Figure 5, we can see that such functions, neither local nor
periodic, length-generalize poorly.
71
Preprint
nentails pi2,A< C, for Ca global constant. We also know that supxN|F(x)|<.
We conjecture that one can use these assumptions, and Lemma 56, to prove that F cannot be “too
complicated”. Specifically, we conjecture that Fwill be ultimately periodic: when xexceeds
some threshold, F(x+ ∆) = F(x)for some period . For, if Fis not ultimately periodic, we
hope to construct a matrix Gwhose nuclear norm can be made arbitrarily large, so large as to give
a superconstant lower bound on A which is a contradiction. First note that Lemma 56 even
holds if G=YTAX where X,Yare two different matrices with nunit-norm columns. That
is, we can consider a matrix Gij =pT
xiApyjwhere we conjecture that one can choose x1, ..., xn
and y1, ..., ynto give an arbitrarily large lower bound on A, under the assumption that Fis not
ultimately periodic. Here, it is important that Fmaps to fixed-precision output; otherwise, one could
get functions that have irrational periods and thus are not periodic when restricted to N.
Lemma 56. Let x1, . . . , xnRdbe vectors with xi2= 1, and let ARd×darbitrary. Let
GRn×nsuch that Gij =xT
iAxj. Then
A G
n(71)
where Adenotes the spectral norm.
Proof. First, note that for any matrix B=UΣV, we have
B=tr(Σ) = trV V TUTU) = tr(UΣV V TUT) = tr(BV TUT)(72)
where VTUThas each singular value bounded by 1 (in fact, it’s orthogonal). In other words,
B=tr(BM)where Mis an orthogonal matrix.
For any two real-valued and possibly non-square matrices, we have
T r(ATB) AFBF(73)
and by submultiplicativity, we have
ABF A∥∥BF(74)
Using Eq. 72 and the above two properties, it follows that,
T r(XTAXM ) = T r(AX M XT) XTATFMXTF
AXFMXTF A∥∥XFM∥∥XFnA.
72
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In this paper, the regular languages of wire linear AC0AC0\hbox {AC}^0are characterized as the languages expressible in the two-variable fragment of first-order logic with regular predicates, FO2[reg]FO2[reg]\mathrm{FO}^2[\mathrm{reg}]. Additionally, they are characterized as the languages recognized by the algebraic class QLDAQLDA\mathbf {QLDA}. The class is shown to be decidable and examples of languages in and outside of it are presented.
Conference Paper
In this paper we consider the class of all regular languages definable by the extended majority quantifier and the order predicate but using only two variables. The main part of the paper is the presentation of a geometric method which is used to show that a given regular language cannot be defined by such formulas. Applying this method we can give a necessary condition in terms of an equation as well as an upper and a lower bound for the corresponding class of monoids. As a consequence we obtain that FO + MAJ2[ 2[
Conference Paper
Following recent works connecting two-variable logic to circuits and monoids, we establish, for numerical predicate sets satisfying a certain closure property, a one-to-one correspondence between FO[<,\ensuremath{\mathfrak{P}}]-uniform linear circuits, two-variable formulae with \ensuremath{\mathfrak{P}} predicates, and weak block products of monoids. In particular, we consider the case of linear TC0, majority quantifiers, and finitely typed monoids. This correspondence will hold for any numerical predicate set which is FO[ < ]-closed and whose predicates do not depend on the input length.
Article
We give several characterizations, in terms of formal logic, semigroup theory, and operations on languages, of the regular languages in the circuit complexity class AC0, thus answering a question of Chandra, Fortune, and Lipton. As a by-product, we are able to determine effectively whether a given regular language is in AC0 and to solve in part an open problem originally posed by McNaughton. Using recent lower-bound results of Razborov and Smolensky, we obtain similar characterizations of the family of regular languages recognized by constant-depth circuit families that include unbounded fan-in mod p addition gates for a fixed prime p along with unbounded fan-in boolean gates. We also obtain logical characterizations for the class of all languages recognized by nonuniform circuit families in which mod m gates (where m is not necessarily prime) are permitted. Comparison of this characterization with our previous results provides evidence for a conjecture concerning the regular languages in this class. A proof of this conjecture would show that computing the bit sum modulo p, where p is a prime not dividing m, is not AC0-reducible to addition mod m, and thus that MAJORITY is not AC0-reducible to addition mod m. Peer Reviewed http://deepblue.lib.umich.edu/bitstream/2027.42/30017/1/0000385.pdf
  • K Ahuja
  • A Mansouri
Ahuja, K. and Mansouri, A. (2024). On provable length and compositional generalization. arXiv preprint arXiv:2402.04875.
Exploring length generalization in large language models
  • C Anil
  • Y Wu
  • A Andreassen
  • A Lewkowycz
  • V Misra
  • V Ramasesh
  • A Slone
  • G Gur-Ari
  • E Dyer
  • B Neyshabur
Anil, C., Wu, Y., Andreassen, A., Lewkowycz, A., Misra, V., Ramasesh, V., Slone, A., Gur-Ari, G., Dyer, E., and Neyshabur, B. (2022). Exploring length generalization in large language models. Advances in Neural Information Processing Systems, 35:38546-38556.
Improving length-generalization in transformers via task hinting
  • P Awasthi
  • A Gupta
Awasthi, P. and Gupta, A. (2023). Improving length-generalization in transformers via task hinting. CoRR, abs/2310.00726.
Logical languages accepted by transformer encoders with hard attention
  • P Barcelo
  • A Kozachinskiy
  • A W Lin
  • V Podolskii
Barcelo, P., Kozachinskiy, A., Lin, A. W., and Podolskii, V. (2024). Logical languages accepted by transformer encoders with hard attention. In The Twelfth International Conference on Learning Representations.