Content uploaded by Juan-Pablo Ortega

Author content

All content in this area was uploaded by Juan-Pablo Ortega on Apr 28, 2021

Content may be subject to copyright.

DISCRETE-TIME SIGNATURES AND RANDOMNESS IN RESERVOIR COMPUTING 1

Discrete-time signatures and randomness in

reservoir computing

Christa Cuchiero, Lukas Gonon, Lyudmila Grigoryeva, Juan-Pablo Ortega, and Josef Teichmann

Abstract—A new explanation of geometric nature of the reser-

voir computing phenomenon is presented. Reservoir computing

is understood in the literature as the possibility of approximating

input/output systems with randomly chosen recurrent neural

systems and a trained linear readout layer. Light is shed on this

phenomenon by constructing what is called strongly universal

reservoir systems as random projections of a family of state-space

systems that generate Volterra series expansions. This procedure

yields a state-afﬁne reservoir system with randomly generated

coefﬁcients in a dimension that is logarithmically reduced with

respect to the original system. This reservoir system is able to

approximate any element in the fading memory ﬁlters class just

by training a different linear readout for each different ﬁlter.

Explicit expressions for the probability distributions needed in

the generation of the projected reservoir system are stated and

bounds for the committed approximation error are provided.

Index Terms—Reservoir computing, recurrent neural net-

work, state-afﬁne system, SAS, signature state-afﬁne sys-

tem, SigSAS, echo state network, ESN, Johnson-Lindenstrauss

Lemma, Volterra series, machine learning.

I. INTRODUCTION

MANY dynamical problems in engineering, signal pro-

cessing, forecasting, time series analysis, recurrent neu-

ral networks, or control theory can be described using in-

put/output (IO) systems. These mathematical objects establish

a functional link that describes the relation between the time

evolution of one or several explanatory variables (the input)

and a second collection of dependent or explained variables

(the output).

A generic question in all those ﬁelds is to determine the

IO system underlying an observed phenomenon. This is the

so called system identiﬁcation problem. For this purpose, ﬁrst

principles coming from physics or chemistry can be invoked,

when either these are known or the setup is simple enough

to apply them. In complex situations, in which access to all

the variables that determine the behavior of the systems is

difﬁcult or impossible, or when a precise mathematical relation

between input and output is not known, it has proved more

efﬁcient to carry out the system identiﬁcation using generic

families of models with strong approximation abilities that are

estimated using observed data. This approach, that we refer to

as empirical system identiﬁcation, has been developed using

different techniques coming simultaneously from engineering,

statistics, and computer science.

C. Cuchiero is afﬁliated with the University of Vienna, Austria. L. Gonon is

with the Ludwig-Maximilians-Universit¨

at M¨

unchen, Germany. L. Grigoryeva

is with the Universit¨

at Konstanz, Germany. J.-P. Ortega is with the Nanyang

Technological University, Singapore. J. Teichmann’s afﬁliation is ETH Z¨

urich,

Switzerland.

In this paper, we focus on a particularly promising strategy

for empirical system identiﬁcation known as reservoir comput-

ing (RC). Reservoir computing capitalizes on the revolutionary

idea that there are learning systems that attain universal

approximation properties without the need to estimate all

their parameters using, for instance, supervised learning. More

speciﬁcally, RC can be seen as a recurrent neural networks ap-

proach to model IO systems using state-space representations

in which

•the state equation is randomly generated, sometimes with

sparsity features, and

•only the (usually very simple) functional form of the

observation equation is tailored to the speciﬁc problem

using observed data.

RC can be found in the literature under other denominations

like Liquid State Machines [1]–[5] and is represented by

various learning paradigms, with Echo State Networks (ESNs)

[6]–[8] being a particularly important example.

RC has shown superior performance in many forecasting

and classiﬁcation engineering tasks (see [9]–[12] and refer-

ences therein) and has shown unprecedented abilities in the

learning of the attractors of complex nonlinear inﬁnite di-

mensional dynamical systems [8], [13]–[15]. Additionally, RC

implementations with dedicated hardware have been designed

and built (see, for instance, [16]–[24]) that exhibit information

processing speeds that largely outperform standard Turing-

type computers.

The most far-reaching and radical innovation in the RC

approach is the use of untrained, randomly generated, and

sometimes sparse state maps. This circumvents well-known

difﬁculties in the training of generic recurrent neural networks

arising bifurcation phenomena [25], which, despite recent

progress in the regularization and training of deep RNN

structures (see, for instance [26]–[28], and references therein),

render classical gradient descent methods non-convergent.

Randomization has already been successfully applied in a

static setup using neural networks with randomized weights, in

particular in seminal works on random feature models [29] and

Extreme Learning Machines [30]. This built-in randomness

makes reservoir models different from other conventional

approaches where state-space systems appear. For instance,

Kalman ﬁltering [31] has been used for decades in signal

processing and, in that case, both linear and nonlinear [32],

[33] Kalman techniques hinge on the idea of designing the

state map to result in a posteriori residual errors of minimal

variance. This requires a signiﬁcant computational effort in

relation with recursive parameter estimation which is not

DISCRETE-TIME SIGNATURES AND RANDOMNESS IN RESERVOIR COMPUTING 2

needed for RC systems. In the dynamical systems context, an

important result in [34] shows that randomly drawn ESNs can

be trained by exclusively optimizing a linear readout using

generic one-dimensional observations of a given invertible

and differentiable dynamical system to produce dynamics that

are topologically conjugate to that given system; in other

words, randomly generated ESNs are capable of learning the

attractors of invertible dynamical systems. More generally,

the approximation capabilities of randomly generated ESNs

have been established in [35] in the more general setup of IO

systems. There, approximation bounds have been provided in

terms of their architecture parameters.

In this paper, we provide an additional insight on the

randomization question for another family of RC systems,

namely, for the non-homogeneous state-afﬁne systems (SAS).

These systems have been introduced and proved to be universal

approximants in [36], [37]. We here show that they also have

this universality property when they are randomly generated.

The approach pursued in this work is considerably different

from the one in the above-cited references and is based on the

following steps. First, we consider causal and time-invariant

analytic ﬁlters with semi-inﬁnite inputs. The Taylor series

expansion of these objects coincide with what is known as

their Volterra series representation. Second, we show that

the truncated Volterra series representation (whose associ-

ated truncation error can be quantiﬁed) admits a state-space

representation with linear readouts in a (potentially) high-

dimensional adequately constructed tensor space. We refer

to this system as the signature state-afﬁne system (SigSAS):

on the one hand, it belongs to the SAS family and, on the

other hand, it shares fundamental properties with the so-called

signature process from the (continuous time) theory of rough

paths, which inspired the title of the paper.

Rough paths theory, as introduced by T. Lyons in the

seminal work [38], has initially been developed to deal with

controlled differential equations driven by rough signals in a

pathwise way. These equations can be seen as a continuous-

time analogue of time series models, where the rough sig-

nals play the role of the model innovations. The key object

in this theory is the signature, which was ﬁrst studied by

K. Chen [39], [40] and consists in enhancing the rough input

with additional curves (satisfying certain algebraic properties)

mimicking what in the smooth case corresponds to iterated

integrals of the curve with itself.

It is a deep mathematical fact that unique solutions of the

rough differential equation exist and are a continuous map

of the signature (in appropriately chosen topologies). Surpris-

ingly, this non-linear continuous map can be arbitrarily well

approximated by linear maps of the signature. More generally,

on compact sets of so-called “non tree-like” paths (see [41] for

a precise deﬁnition), every continuous path functional (with

respect to a certain p-variation norm) can be uniformly ap-

proximated by a linear function of the signature. Indeed, linear

functionals of the signature form a point separating algebra on

sets of “non tree-like” paths, which by the Stone-Weierstrass

Theorem then yields a universal approximation theorem for

general path functionals (see, for instance, [42]). Rough path

theory has been substantially extended by Martin Hairer [43]

towards the theory of regularity structures and is nowadays the

tool to analyze deep analytic properties of continuous-time IO

systems.

From a machine learning perspective, the signature can be

thought of as a feature map capturing all speciﬁc character-

istics of a given path. More precisely, it serves as a linear

regression basis and can thus be interpreted as an abstract

reservoir (for the moment without random speciﬁcations) for

solutions of rough differential equations. These appealing

properties made signature methods highly popular for machine

learning applications both for streamed data (in particular, in

ﬁnance) and for complex classiﬁcation tasks. For inspiring

examples of the rapidly growing literature on machine learning

using signature methods we refer to [44]–[51] and references

therein.

Returning to the SAS family we will show that the solutions

of the SigSAS introduced in this paper share exactly the two

crucial properties which make signature central in rough path

theory: ﬁrst, the SigSAS solutions fully characterize the input

sequences and, second, any (sufﬁciently regular) IO system

can be written as a linear map of the SigSAS system. These

properties have been exploited in the continuous-time setup

in [52].

Finally, we use the Johnson-Lindenstrauss Lemma [53] to

prove that a random projection of the SigSAS system yields a

smaller dimensional SAS system with random matrix coefﬁ-

cients (that can be chosen to be sparse) that approximates the

original system. Moreover, this constructive procedure gives

us full knowledge of the law that needs to be used to draw

the entries of the low-dimensional SAS approximating system,

without ever having to use the original large dimensional

SigSAS, which amounts to a form of information compression

with efﬁcient reconstruction in this setup [54]. An important

feature of the dimension reduced randomly drawn SAS system

is that it serves as a universal approximator for any reasonably

behaved IO system and that only the linear output layer that

is applied to it depends on the individual system that needs

to be learnt. We refer to this feature as the strong universality

property.

This approach to the approximation problem in recurrent

neural networks using randomized systems provides a new

explanation of geometric nature of the reservoir computing

phenomenon. The results in the following pages show that

randomly generated SAS reservoir systems approximate well

any sufﬁciently regular IO system just by tuning a linear

readout because they coincide with an error-controlled random

projection of a higher dimensional Volterra series expansion

of that system.

II. TRU NC ATED VO LTE RR A RE PR ES ENTATION S OF

ANA LYTIC FILTER S

We start by describing the setup that we shall be working on,

together with the main approximation tool which we will be

using later on in the paper, namely, Volterra series expansions.

Details on the concepts introduced in the following paragraphs

can be found in, for instance, [55]–[57], and references therein.

All along the paper, the symbol Zdenotes the set of all

integers and Z−stands for the set of negative integers with

DISCRETE-TIME SIGNATURES AND RANDOMNESS IN RESERVOIR COMPUTING 3

the zero element included. Let Dd⊂Rdand Dm⊂Rm. We

refer to the maps of the type U: (Dd)Z−→ (Dm)Zbetween

inﬁnite sequences with values in Ddand Dm, respectively, as

ﬁlters,operators, or discrete-time input/output systems, and to

those like H: (Dd)Z−→ Dm(or H: (Dd)Z−−→ Dm) as

Rm-valued functionals. These deﬁnitions will be sometimes

extended to accommodate situations where the domains and

the targets of the ﬁlters are not necessarily product spaces but

just arbitrary subsets of RdZand (Rm)Zlike, for instance,

∞(Rd)and ∞(Rm).

A ﬁlter U: (Dd)Z−→ (Dm)Zis called causal when for

any two elements z,w∈(Dd)Zthat satisfy that zτ=wτfor

any τ≤t, for a given t∈Z, we have that U(z)t=U(w)t.

Let Tτ: (Dd)Z−→ (Dd)Zbe the time delay operator deﬁned

by Tτ(z)t:= zt−τ. The ﬁlter Uis called time-invariant

(TI) when it commutes with the time delay operator, that is,

Tτ◦U=U◦Tτ, for any τ∈Z(in this expression, the

two operators Tτhave to be understood as deﬁned in the

appropriate sequence spaces). There is a bijection between

causal and time-invariant ﬁlters and functionals. We denote

by UH: (Dd)Z−→ (Dm)Z(respectively, HU: (Dd)Z−−→

Dm) the ﬁlter (respectively, the functional) associated to the

functional H: (Dd)Z−−→ Dm(respectively, the ﬁlter

U: (Dd)Z−→ (Dm)Z). Causal and time-invariant ﬁlters are

fully determined by their restriction to semi-inﬁnite sequences,

that is, U: (Dd)Z−−→ (Dm)Z−, that will be denoted using

the same symbol.

In most cases, we work in the situation in which Ddand Dm

are compact and the sequence spaces (Dd)Z−and (Dm)Z−are

endowed with the product topology. It can be shown (see [55])

that this topology is equivalent to the norm topology induced

by any weighted norm deﬁned by kzkw:= supt∈Z−{ztw−t},

z∈(Dd)Z−, where w:N−→ (0,1] is an arbitrary strictly

decreasing sequence (we call it weighting sequence) with zero

limit and such that w0= 1. Filters and functionals that are

continuous with respect to this topology are said to have the

fading memory property (FMP).

A particularly important class of IO systems are those gen-

erated by state-space sytems in which the output y∈(Dm)Z−

is obtained out of the input z∈(Dd)Z−as the solution of the

equations

xt=F(xt−1,zt),

yt=h(xt),

(1)

(2)

where F:DN×Dd−→ DNis the so-called state map,

for some DN⊂RN,N∈N, and h:DN−→ Dmis the

readout or observation map. When for any input z∈(Dd)Z−

there is only one output y∈(Dm)Z−that satisﬁes (1)-

(2), we say that this state-space system has the echo state

property (ESP), in which case it determines a unique ﬁlter

UF

h: (Dd)Z−−→ (Dm)Z−. When the ESP holds at the

level of the state equation (1), then it determines another ﬁlter

UF: (Dd)Z−−→ (DN)Z−and then UF

h=h(UF). The

ﬁlters UF

hand UF, when they exist, are automatically causal

and TI (see [55]). The continuity and the differentiability

properties of the state and observation maps Fand himply

continuity and differentiability for UF

hand UFunder very

general hypotheses; see [56] for an in-depth study of this

question.

We denote by k·k the Euclidean norm if not stated oth-

erwise and use the symbol |||·||| for the operator norm with

respect to the 2-norms in the target and the domain spaces.

Additionally, for any z∈(Rd)Z−we deﬁne p-norms as

kzkp:= Pt∈Z−kztkp1/p

, for 1≤p < ∞, and kzk∞:=

supt∈Z−{kztk} for p=∞. Given M > 0, we denote by

KM:= {z∈(Rd)Z−| kztk ≤ Mfor all t∈Z−}. It is easy

to see that KM=BM⊂∞

−(Rd), with BM:= Bk·k∞(0, M )

and ∞

−(Rd) := {z∈(Rd)Z−| kzk∞<∞}. We deﬁne

e

BM:= BM∩1

−(Rd)with 1

−(Rd) := {z∈(Rd)Z−|

kzk1<∞} and use the same symbol e

BMwhenever d= 1.

Additionally, we will write L(V, W )to refer to the space of

linear maps between the real vector spaces Vand W. The

following statement is the main approximation result that will

be used in the paper.

Theorem II.1. Let M, L > 0and let U:KM⊂∞

−(Rd)−→

KL⊂∞

−(Rm)be a causal and time-invariant fading memory

ﬁlter whose restriction U|BMis analytic as a map between

open sets in the Banach spaces ∞

−(Rd)and ∞

−(Rm)and

satisﬁes U(0) = 0. Then, for any z∈e

BMthere exists a

Volterra series representation of Ugiven by

U(z)t=∞

X

j=1

0

X

m1=−∞ ···

0

X

mj=−∞

gj(m1,...,mj)(zm1+t⊗···⊗zmj+t),

(3)

with t∈Z−and where the map gj: (Z−)j−→ L(Rd⊗·· ·⊗

Rd,Rm)is given by

gj(m1, . . . , mj)(ei1⊗·· ·⊗eij) = 1

j!DjHU(0)(ei1

m1,...,eij

mj),

(4)

where, for any z0in some open subset of ∞

−(Rd),DjHU(z0)

with j≥1denotes the j-order Fr´

echet differential at z0of

the functional HUassociated to the ﬁlter U,{e1,...,ed}is

the canonical basis of Rdand the sequences eil

mk∈∞

−(Rd)

are deﬁned by:

eil

mkt:= eil∈Rd,if t=mk,

0,otherwise.

Moreover, there exists a monotonically decreasing sequence

wUwith zero limit such that, for any p, l ∈N,

U(z)t−

p

X

j=1

0

X

m1=−l···

0

X

mj=−l

gj(m1,...,mj)(zm1+t⊗ ·· · ⊗ zmj+t)

≤wU

l+L1−kzk∞

M−1kzk∞

Mp+1

.(5)

A. The signature state-afﬁne system (SigSAS)

We now show that the ﬁlter obtained out of the truncated

Volterra series expansion in the expression (5) can be written

down as the unique solution of a non-homogeneous state-afﬁne

system (SAS) with linear readouts that, as we shall show in

Section II-B, has particularly strong universal approximation

properties. We ﬁrst brieﬂy recall how the SAS family is

constructed.

DISCRETE-TIME SIGNATURES AND RANDOMNESS IN RESERVOIR COMPUTING 4

Let α= (α1, . . . , αd)>∈Ndand z= (z1, . . . , zd)>∈Rd

and deﬁne the monomials zα:= zα1

1·· · zαd

d. We denote by

MN1,N2the space of real N1×N2matrices with N1, N2∈N

and use MN1,N2[z]to refer to the space of polynomials in

z∈Rdwith matrix coefﬁcients in MN1,N2, that is, the set of

elements pof the form

p(z) = X

α∈Vp

zαAα,

with Vp⊂Nda ﬁnite subset and Aα∈MN1,N2the matrix

coefﬁcients. A state-afﬁne system (SAS) is given by

(xt=p(zt)xt−1+q(zt),

yt=Wxt,(6)

p∈MN,N [z],q∈MN,1[z]are polynomials with matrix

and vector coefﬁcients, respectively, and W∈Mm,N . If we

consider inputs in the set KMand the pand qin the state-

space system (6) such that

Mp:= sup

z∈Bk·k(0,M ){|||p(z)|||} <1,

Mq:= sup

z∈Bk·k(0,M ){|||q(z)|||} <∞,

where Bk·k(0, M )denotes the closed ball in Rdof radius M

and center 0with respect to the Euclidean norm, then a unique

state-ﬁlter Up,q :KM−→ KLcan be associated to it, with

L:= Mq/(1 −Mp). It has been shown in [36], [37] that SAS

systems are universal approximants in the fading memory and

in the Lp-integrable categories in the sense that given a ﬁlter

in any of those two categories, there exists a SAS system of

type (6) that uniformly or in the Lp-sense approximates it.

The signature state-afﬁne system that we construct in this

section exhibits what we call the strong universality prop-

erty. This means that the state equation for this state-space

representation is the same for any fading memory ﬁlter that

is being approximated, and it is only the linear readout that

changes. In other words, we provide a result that yields the

approximation (as accurate as desired) of any fading memory

IO system, as the linear readout of the solution of a ﬁxed non-

homogeneous SAS system that does not depend on the ﬁlter

being approximated.

Since the important property that we just described is

reminiscent of an analogous feature of the signature process in

the context of the representation of the solutions of controlled

stochastic differential equations [52], we shall refer to this

state system as the signature SAS (SigSAS) system.

Before we proceed, we need to introduce some notation.

First, for any l, d ∈N, we denote by Tl(Rd)the space of

tensors of order lon Rd, that is,

Tl(Rd) :=

d

X

i1,...,il=1

ai1,...,ilei1⊗ · ·· ⊗ eil|ai1,...,il∈R

.

The tensor space Tl(Rd)will be understood as a normed

space with a crossnorm [58] that we shall leave unspeciﬁed

for the time being. We shall be using an order lowering

map πl:Tl+1(Rd)−→ Tl(Rd)that, for any vector v:=

Pd

i1,...,il+1=1 ai1,...,il+1 ei1⊗·· ·⊗eil+1 ∈Tl+1 (Rd), is deﬁned

as,

πl(v) :=

d

X

i2,...,il+1=1

a1,i2,...,il+1 ei2⊗ · ·· ⊗ eil+1 ∈Tl(Rd).

The order lowering map is linear and its operator norm satisﬁes

that |||πl||| = 1.

We shall restrict the presentation to one-dimensional inputs,

that is, we consider input sequences z∈KM⊂∞

−(R). Now,

for ﬁxed l, p ∈N, we deﬁne for any z∈KMand t∈Z−,

e

zt:=

p+1

X

i=1

zi−1

tei∈Rp+1 and b

zt:= e

zt−l⊗ · ·· ⊗ e

zt.(7)

Note that e

ztis the Vandermonde vector [59] associated to zt

and that b

ztis a tensor in Tl+1 Rp+1whose components

in the canonical basis are all the monomials on the variables

zt, . . . , zt−lthat contain powers up to order pin each of those

variables, namely

b

zt=

p+1

X

i1,...,il+1=1

zi1−1

t−l·· · zil+1 −1

tei1⊗ · ·· ⊗ eil+1 .

Finally, given I0⊂ {1, . . . , p + 1}an arbitrarily chosen but

ﬁxed subset of cardinality higher than 1that contains the

element 1, we deﬁne:

b

z0

t=X

i∈I0

zi−1

te1⊗ · ·· ⊗ e1

| {z }

l-times ⊗ei∈Tl+1 Rp+1.(8)

The next proposition introduces the SigSAS state system for

ﬁxed l, p ∈N, whose solution is used later on in Theorem II.4

to represent the truncated Volterra series expansions in Theo-

rem II.1 of polynomial degree pand lag −l(see expression

(5)).

Proposition II.2 (The SigSAS system).Let M > 0and let

l, p ∈N. Let 0< λ < min n1,1/Pp

j=0 Mjo. Consider the

state system with uniformly bounded scalar inputs in KM=

[−M, M ]Z−and states in Tl+1(Rp+1 )given by the recursion

xt=λπl(xt−1)⊗e

zt+b

z0

t.(9)

This state equation is induced by the state map FSigSAS

λ,l,p :

Tl+1(Rp+1 )×R−→ Tl+1 (Rp+1)deﬁned by

FSigSAS

λ,l,p (x, z) := λπl(x)⊗e

z+b

z0,(10)

which is a contraction in the state variable with contraction

constant

λf

M < 1,where f

M:=

p

X

j=0

Mj,(11)

and hence restricts to a map FSigSAS

λ,l,p :Bk·k(0, L)×

[−M, M ]−→ Bk·k (0, L), with

L:= f

M/(1 −λf

M).(12)

DISCRETE-TIME SIGNATURES AND RANDOMNESS IN RESERVOIR COMPUTING 5

This state system has the echo state and the fading memory

properties and its continuous, time-invariant, and causal as-

sociated ﬁlter USigSAS

λ,l,p :KM−→ KL⊂Tl+1(Rp+1)is given

by:

USigSAS

λ,l,p (z)t=λl+1

1−λb

zt

+λlπl(πl(···(πl(

| {z }

l-times b

z0

t−l)⊗e

zt−(l−1))⊗ · · ·)⊗e

zt−1)⊗e

zt+

···+λπl(b

z0

t−1)⊗e

zt+b

z0

t.(13)

Remark II.3.The state equation (9) is indeed a SAS with

states deﬁned in Tl+1(Rp+1 )as it has the same form as the

ﬁrst equality in (6). Indeed, this equation can be written as

xt=p(zt)xt−1+q(zt)with p(zt)and q(zt)the polynomials

in ztwith coefﬁcients in L(Tl+1(Rp+1), T l+1(Rp+1 )) and

Tl+1(Rp+1 ), respectively, given by:

p(zt)xt−1:= λπl(xt−1)⊗e

zt=

p+1

X

i=1

zi−1

t(λπl(xt−1)⊗ei),

q(zt) := b

z0

t=X

i∈I0

zi−1

te1⊗ · ·· ⊗ e1⊗ei.

B. The SigSAS approximation theorem

As we already pointed out, b

ztis a vector in Tl+1 Rp+1

whose components in the canonical basis are all the monomials

on the variables zt, . . . , zt−lthat contain powers up to order

pin each of those variables. Moreover, it is easy to see

that all the other summands in the expression (13) of the

ﬁlter USigSAS

λ,l,p are proportional (with a positive constant) to

monomials already contained in b

zt. This implies the existence

of a linear map Aλ,l,p ∈L(Tl+1(Rp+1 ), T l+1(Rp+1)) with

an invertible matrix representation with non-negative entries

such that

USigSAS

λ,l,p (z)t=Aλ,l,pb

zt.(14)

In the sequel we will denote the matrix representation of Aλ,l,p

using the same symbol Aλ,l,p ∈MN,N ,N:= (p+ 1)l+1. This

observation, together with Theorem II.1, can be used to prove

the following result.

Theorem II.4. Let M, L > 0and let U:KM⊂∞

−(R)−→

KL⊂∞

−(Rm)be a causal and time-invariant fading memory

ﬁlter whose restriction U|BMis analytic as a map between

open sets in the Banach spaces ∞

−(R)and ∞

−(Rm)and sat-

isﬁes U(0) = 0. Then, there exists a monotonically decreasing

sequence wUwith zero limit such that, for any p, l ∈N, and

any 0< λ < min n1,1/Pp

j=0 Mjo, there exists a linear

map W∈L(Tl+1(Rp+1 ),Rm)such that, for any z∈e

BM:

U(z)t−W U SigSAS

λ,l,p (z)t

≤wU

l+L1−kzk∞

M−1kzk∞

Mp+1

.

(15)

Remark II.5.Theorem II.4 establishes the strong universality

of the SigSAS system in the sense that the state equation

of this system is the same for any fading memory ﬁlter U

that is being approximated, and it is only the linear readout

that changes. Nevertheless, we emphasize that the quality of

the approximation is not ﬁlter independent, as the decreasing

sequence wUin the bound (15) depends on how fast the ﬁlter

U“forgets” past inputs.

Remark II.6.The analyticity hypothesis in the statement of

Theorem II.4 can be dropped by using the fact that ﬁnite order

and ﬁnite memory Volterra series are universal approximators

in the fading memory category (see [60] and [56, Theorem

31]). In that situation, the bound for the truncation error in

(15) does not necessarily apply anymore, in particular its

second summand, which is intrinsically linked to analyticity.

A generalized bound can be formulated in that case using

arguments along the lines of those found in [35].

III. JOH NS ON -LINDENSTRAUSS REDUCTION OF THE

SIG SAS RE PR ES EN TATION

The price to pay for the strong universality property exhib-

ited by the signature state-afﬁne system that we constructed

in the previous section, is the potentially large dimension of

the tensor space in which this state-space representation is

deﬁned. In this section we concentrate on this problem by

proposing a dimension reduction strategy which consists in

using the random projections in the Johnson-Lindenstrauss

Lemma [53] in order to construct a smaller dimensional SAS

system with random matrix coefﬁcients (that can be chosen

to be sparse). The results contained in the next subsections

quantify the increase in approximation error committed when

applying this dimensionality reduction strategy.

We start by introducing the Johnson-Lindenstrauss (JL)

Lemma [53] and some properties that are needed later on in

the presentation. Following this, we spell out how to use it in

the dimension reduction of state-space systems in general and

of the SigSAS representation in particular.

A. The JL Lemma and approximate projections

Given a N-dimensional Hilbert space (V, h·,·i)and Qa

n-point subset of V, the Johnson-Lindenstrauss (JL) Lemma

[53], [61] guarantees, for any 0< < 1, the existence of a

linear map f:V−→ Rk, with k∈Nsatisfying

k≥24 log n

32−23,(16)

that respects -approximately the distances between the points

in the set Q. More speciﬁcally,

(1−)kv1−v2k2≤ kf(v1)−f(v2)k2≤(1+)kv1−v2k2,

(17)

for any v1,v2∈Q. The norm k·k in Rkcomes from an inner

product that makes it into a Hilbert space or, in other words,

it satisﬁes the parallelogram identity. This remarkable result

is even more so in connection with further developments that

guarantee that the linear map fcan be randomly chosen [61]–

[63] and, moreover, within a family of sparse transformations

[64], [65] (see also [66]).

In the developments in this paper, we use the original

version of this result in which the JL map fis realized by

a matrix A∈Mk,N whose entries are such that

Aij ∼N(0,1/k).(18)

DISCRETE-TIME SIGNATURES AND RANDOMNESS IN RESERVOIR COMPUTING 6

It can be shown that with this choice, the probability of the

relation (17) to hold for any pair of points in Qis bounded

below by 1/n.

Lemma III.1. Let (V , k·k)be a normed space and let Q

be a (ﬁnite or inﬁnite countable) subset of V. Deﬁne k·kQ:

span {Q} −→ R+by

kvkQ:= inf

Card Q

X

j=1 |λj|

Card Q

X

j=1

λjvj=v,vj∈Q

.

(i) k·kQdeﬁnes a seminorm in span {Q}. If

MQ:= sup {kvik | vi∈Q}(19)

is ﬁnite, then k·kQis a norm.

(ii) kvk≤kvkQMQ, for any v∈span {Q}.

(iii) Let Q1, Q2be subsets of Vsuch that Q1⊂Q2. Then

kvkQ2≤ kvkQ1for any v∈span {Q1}.

Remark III.2.If the hypothesis MQ<∞is dropped in part

(i) of Lemma III.1, then k·kQis in general not a norm as the

following example shows. Take V=Rand vi=i,i∈N. It

is easy to see that, in this setup,

k1kQ= inf 1

i|i∈N= 0.

Proposition III.3. Let Qbe a set of points in the Hilbert space

(V, h·,·i)with MQ:= sup {kvik | vi∈Q}<∞such that

−Q:= {−v|v∈Q}=Q. Let > 0, let f:V−→ Rkbe a

linear map that satisﬁes the Johnson-Lindenstrauss property

(17)with respect to , and let f∗:Rk−→ Vthe adjoint map

with respect to a ﬁxed inner product h·,·i in Rk. Then,

|hw1,(IV−f∗◦f) (w2)i| ≤ M2

Qkw1kQkw2kQ,(20)

for any w1,w2∈span {Q}.

Corollary III.4. In the hypotheses of the previous proposition,

let

CQ:= inf

c∈R+nkvkQ≤ckvk, for all v∈span{Q}o.(21)

Then, for any v∈span{Q}such that (f∗◦f)(v)∈span{Q},

we have

k(IV−f∗◦f) (v)k ≤ M2

QC2

Qkvk.(22)

This corollary is just a consequence of the inequality (20)

that guarantees that

k(IV−f∗◦f) (v)k2≤M2

Qk(IV−f∗◦f) (v)kQkvkQ

≤M2

QC2

Qk(IV−f∗◦f) (v)kkvk,(23)

which yields (22).

B. Johnson-Lindenstrauss projection of state-space dynamics

The next result shows how, when the dimension kof the

target of the JL map fdetermined by (16) is chosen so that

this map is generically surjective, then any contractive state-

space system with states in the domain of fcan be projected

onto another one with states in its smaller dimensional image.

This result also shows that if the original system has the ESP

and the FMP, then so does the projected one. Additionally, it

gives bounds that quantify the dynamical differences between

the two systems.

Theorem III.5. Let Fρ:RN×Dd−→ RNbe a one-

parameter family of continuous state maps, where Dd⊂Rd

is a compact subset, 0<ρ<1, and Fρis a ρ-contraction

on the ﬁrst component. Let Qbe a n-point spanning subset

of RNsatisfying −Q=Q. Let f:RN−→ Rkbe a JL map

that satisﬁes (17)with 0< < 1where the dimension khas

been chosen so that fis generically surjective. Then:

(i) Let Ff

ρ:Rk×Dd−→ Rkbe the state map deﬁned by:

Ff

ρ(x,z) := f(Fρ(f∗(x),z)) ,

for any x∈Rkand z∈Dd. If the parameter ρis chosen

so that

ρ < 1/|||f|||2,(24)

then Ff

ρis a contraction on the ﬁrst entry. The symbol

|||·||| in (24)denotes the operator norm with respect to the

2-norms in RNand Rk.

(ii) Let Vk:= f∗(Rk)⊂RNand let Ff

ρ:Vk×Dd−→ Vkbe

the state map with states on the vector space Vk, deﬁned

by:

Ff

ρ(x,z) := f∗Ff

ρ((f∗)−1(x),z)=f∗◦f(Fρ(x,z)),

(25)

for any x∈Vkand z∈Dd. If the contraction parameter

satisﬁes (24)then Ff

ρis also a contraction on the ﬁrst

entry. Moreover, the restricted linear map f∗:Rk−→ Vk

is a state-map equivariant linear isomorphism between Ff

ρ

and Ff

ρ.

(iii) Suppose, additionally, that there exist two constants

C, Cf>0such that the state spaces of the state maps

Fρand Ff

ρcan be restricted as Fρ:Bk·k(0, C)×Dd−→

Bk·k(0, C )and Ff

ρ:Bk·k(0, Cf)×Dd−→ Bk·k(0, Cf).

Then, both Fρand Ff

ρhave the ESP and have unique

FMP associated ﬁlters Uρ: (Dd)Z−−→ KCand

Uf

ρ: (Dd)Z−−→ KCf, respectively. The state map

Ff

ρ:f∗Bk·k(0, Cf)×Dd−→ f∗Bk·k(0, Cf)is

isomorphic to the restricted version of Ff

ρ, also has the

ESP and a FMP associated ﬁlter Uf

ρ: (Dd)Z−−→

f∗Bk·k(0, Cf)Z−. The state map Ff

ρand the ﬁlter

Uf

ρare called the JL projected versions of Fρand Uρ,

respectively.

(iv) In the hypotheses of the previous point, for any z∈

(Dd)Z−and t∈Z−:

Uρ(z)t− Uf

ρ(z)t

≤1/2CMQCQ

(1 + |||f|||2)1/2

1−ρ,

(26)

where MQand CQare given by (19)and (21), respec-

tively. Alternatively, it can also be shown that:

Uρ(z)t− Uf

ρ(z)t

≤CM 2

QC2

Q

1−ρ.(27)

DISCRETE-TIME SIGNATURES AND RANDOMNESS IN RESERVOIR COMPUTING 7

(v) Let R > max{1/|||f|||2,1}and set ρ= 1/(R|||f|||2). Then,

the elements in the set Qcan be chosen so that the bounds

in (26)and (27)reduce to

1/2N3/4C1 + |||f|||21/2R|||f|||2

R|||f|||2−1and (28)

NC R|||f|||2

R|||f|||2−1,(29)

respectively.

C. The Johnson-Lindenstrauss reduced SigSAS system

We now use the previous theorem to spell out the Johnson-

Lindenstrauss projected version of SigSAS approximations

and to establish error bounds analogous to those introduced in

(28) and (29). Given that Theorem III.5 is formulated using the

one and the two-norms in Euclidean spaces and Proposition

II.2 deﬁnes the SigSAS system on a tensor space endowed

with an unspeciﬁed cross-norm, we notice that those two

frameworks can be matched by using the norms k·k and k·k1

in Tl+1(Rp+1 )given by

kvk2:=

p+1

X

i1,...,il+1=1

λ2

i1,...,il+1 ,kvk2

1:=

p+1

X

i1,...,il+1=1 |λi1,...,il+1 |,

with v=Pp+1

i1,...,il+1=1 λi1,...,il+1 ei1⊗ · ·· ⊗ eil+1 and

ei1⊗ · ·· ⊗ eil+1 i1,...,il+1∈{1,...,p+1}the canonical basis in

Tl+1(Rp+1 ). It is easy to check that these two norms are cross-

norms and that k·k is the norm associated to the inner product

deﬁned by the extension by bilinearity of the assigment

hei1⊗ · ·· ⊗ eil+1 ,ej1⊗ · ·· ⊗ ejl+1 i:= δi1j1··· δil+1 jl+1 ,

that makes (Tl+1(Rp+1 ),h·,·i)into a Hilbert space, a feature

that is needed to use the Johnson-Lindenstrauss Lemma.

Corollary III.6. Let M > 0and let (FSigSAS

λ,l,p , W )be

the SigSAS system that approximates a causal and TI ﬁlter

U:KM−→ ∞

−(Rm), as introduced in Theorem II.4. Let

N:= (p+ 1)l+1,f

Mas in (11), and let 0< < 1. Let

f:RN−→ Rkbe a JL map that satisﬁes (17), where

the dimension khas been chosen to make fgenerically

surjective. Then, for any R > max{1/|||f|||2,1/(f

M|||f|||2),1},

λ:= 1/(Rf

M|||f|||2), and Las in (12), there exists a JL

reduced version FSigSAS

λ,l,p,f :f∗Bk·k(0, Lf)×[−M , M]−→

f∗Bk·k(0, Lf)of FSigSAS

λ,l,p :Bk·k(0, L)×[−M , M]−→

Bk·k(0, L), with Lf:= f

M|||f|||/1−λf

M|||f|||2, that has the

ESP and a unique FMP associated ﬁlter USigSAS

λ,l,p,f :KM−→

f∗Bk·k(0, Lf)Z−. Moreover, we have that

W U SigSAS

λ,l,p (z)t− WUSigSAS

λ,l,p,f (z)t

≤ |||W|||1

2N3

4(1 + |||f|||2)1

2f

MR2|||f|||4

(R|||f|||2−1)2,(30)

W U SigSAS

λ,l,p (z)t− WUSigSAS

λ,l,p,f (z)t

≤ |||W|||N f

MR2|||f|||4

(R|||f|||2−1)2,(31)

for any z∈KMand t∈Z−, and where W:= W◦ik∈

Mm,k, with ik:f∗◦f(Tl+1 (Rp+1)) →Tl+1(Rp+1 )the

inclusion.

This result shows that causal and time-invariant ﬁlters can

be approximated by JL reduced SigSAS systems. The goal in

the following paragraphs consists in showing that such systems

are just SAS systems with randomly drawn matrix coefﬁcients

and, additionally, in precisely spelling out the law of their

entries. These facts show precisely that a large class of ﬁlters

can be learnt just by randomly generating a SAS and by tuning

a linear readout layer for each individual ﬁlter that needs to

be approximated. We emphasize that the JL reduced randomly

generated SigSAS system is the same for the entire class of

FMP ﬁlters that are being approximated and that only the

linear readout depends on the individual ﬁlter that needs to

be learnt, which amounts to the strong universality property

that we discussed in the Introduction and in Section II-A. As

in Remark II.5, we recall that the quality of the approximation

using a JL reduced random SigSAS system may change from

ﬁlter to ﬁlter because of the dependence on the sequence wU

in the bound (15) and the presence of the linear readout Win

(30) and (31).

The next statement needs the following fact that is known in

the literature as Gordon’s Theorem (see [67, Theorem 5.32]

and references therein): given a random matrix A∈Mn,m

with standard Gaussian IID entries, we have that

E [|||A|||]≤√n+√m. (32)

Additionally, the element b

z0∈Tl+1(Rp+1 )introduced in

(8) for the construction of the SigSAS system will be chosen

in a speciﬁc randomized way in this case. Indeed, this time

around, we replace (8) by

b

z0=rX

i∈I0

zi−1e1⊗ · ·· ⊗ e1⊗ei,(33)

where ris a Rademacher random variable that is chosen

independent from all the other random variables that will

appear in the different constructions. If we take in Tl+1(Rp+1)

the canonical basis in lexicographic order, the element b

z0can

be written as the image of a linear map as

b

z0=rCI0(1, z, . . . , zp)>,with (34)

CI0:= Sc

O(p+1)((p+1)l−1),p+1 ∈M(p+1)l+1,p+1 ,

and Sc∈Mp+1 a diagonal selection matrix with the elements

given by Sc

ii = 1 if i∈I0, and Sc

ii = 0 otherwise.

Theorem III.7. Let M > 0, let f

Mas in (11),l, p, k ∈N, and

deﬁne N:= (p+ 1)l+1,N0:= (p+ 1)l. Consider a SigSAS

state map FSigSAS

λ,l,p :Tl+1(Rp+1 )×[−M, M ]−→ Tl+1(Rp+1)

of the type introduced in (10)and deﬁned by choosing the non-

homogeneous term b

z0as in (33). Let now f:RN−→ Rkbe

DISCRETE-TIME SIGNATURES AND RANDOMNESS IN RESERVOIR COMPUTING 8

a JL projection randomly drawn according to (18). Let δ > 0

be small enough so that

λ0:= δ

2f

Mrk

N0

<min (1

f

M,1

f

M|||f|||2,1).(35)

Then, the JL reduced version FSigSAS

λ0,l,p,f of FSigSAS

λ0,l,p has the ESP

and the FPM with probability at least 1−δand, in the limit

N0→ ∞, it is isomorphic to the family of randomly generated

SAS systems FSigSAS

λ0,l,p,f with states in Rkand given by

FSigSAS

λ0,l,p,f (x, z) :=

p+1

X

i=1

zi−1Aix+B(1, z, ··· , zp)>,(36)

where A1, . . . , Ap+1 ∈Mkand B∈Mk,p+1 are random

matrices whose entries are drawn according to:

(A1)j,m,...,(Ap+1 )j,m ∼N0,δ2

4kf

M2,(37)

Bj,m ∼N0,1

kif m∈I0,

0otherwise. (38)

All the entries in the matrices A1, . . . , Ap+1 are independent

random variables. The entries in the matrix Bare independent

from each other and they are decorrelated and asymptoti-

cally independent (in the limit as N0→ ∞) from those in

A1, . . . , Ap+1.

We conclude with a result that uses in a combined manner

the SigSAS Approximation Theorem II.4 with its JL reduction

in Corollary III.6, as well as its SAS characterization with

random coefﬁcients in Theorem III.7. This statement shows

that in order to approximate a large class of sufﬁciently regular

FMP ﬁlters with uniformly bounded inputs, it sufﬁces to

randomly generate a common SAS system for all of them

and to tune a linear readout for each different ﬁlter in that

class that needs to be approximated.

Theorem III.8. Let M, L > 0and let U:KM⊂∞

−(R)−→

KL⊂∞

−(Rm)be a causal and time-invariant fading memory

ﬁlter that satisﬁes the hypotheses in Theorem II.4. Fix now

l, p, k ∈Nand δ > 0small enough so that (35)holds.

Construct now the SAS system with states in Rkgiven by

FSigSAS

λ0,l,p,f (x, z) =

p+1

X

i=1

zi−1Aix+B(1, z, ··· , zp)>,(39)

with matrix coefﬁcients randomly generated according to the

laws spelled out in (37)and (38).

If pand lare large enough, then the SAS system FSigSAS

λ0,l,p,f

has the ESP and the FPM with probability at least 1−δ. In

that case FSigSAS

λ0,l,p,f has a ﬁlter USigSAS

λ0,l,p,f associated and there

exists a monotonically decreasing sequence wUwith zero limit

and a linear map W∈L(Rk,Rm)such that for any z∈e

BM

it holds that

U(z)t−W U SigSAS

λ0,l,p,f (z)t

≤wU

l+L1−kzk∞

M−1kzk∞

Mp+1

+Il,p,(40)

where Il,p is either

Il,p := |||W|||1

2N3

4f

M1 + |||f|||21

2

1−δ

2rk

N02or

Il,p := |||W|||N f

M1

1−δ

2rk

N02.(41)

In these expressions W∈L(Tl+1(Rp+1),Rm)is a linear map

such that W=W◦f∗,N= (p+1)l+1,f

Mis deﬁned in (11),

and 0<<1satisﬁes (16)with nreplaced by N.

IV. NUMERICAL ILLUSTRATION

In order to illustrate the main contributions of the paper,

we consider an IO system given by the so-called generalized

autoregressive conditional heteroskedastic (GARCH) model

[68], [69]. GARCH is a popular discrete-time process in time

series analysis which is used in the econometrics literature and

by practitioners to model and forecast the dynamics of con-

ditional volatilities in ﬁnancial time series. More speciﬁcally,

the GARCH(1,1) model is given by

(yt=σtzt, zt∼N(0,1),

σ2

t=ω+αy2

t−1+βσ2

t−1, t ∈Z,(42)

where ω > 0,α, β ≥0,α+β < 1(see [70] for a careful

discussion of the properties of GARCH processes). The IO

system is driven by the input innovations {zt}t∈Zand the

observations {yt}t∈Zrepresent its output. In the experiment we

use ω= 0.0001,α= 0.1,β= 0.87 and in order to learn the

corresponding IO system we construct: (i) a SigSAS system

as in Proposition II.2; (ii) a JL reduced SigSAS system as in

Corollary III.6; (iii) a randomly generated SAS as in Theorem

III.7. For all the systems, the corresponding readout maps are

obtained by a linear regression. Figure 1illustrates the result

in Theorem II.4 and shows that the SigSAS approximation

error decreases with N. Figure 2shows that the approximation

errors committed by both the JL reduced SigSAS and its

randomly generated analogue decrease as the JL dimension

kincreases. We emphasize that the mean errors are computed

using 160 randomly drawn instances of these two reduced

SigSAS systems and note that the errors reported in this ﬁgure

for the two systems are visually indistinguishable. We remind

that even though the result of Theorem III.7 is proved to hold

in the limit as N0= (p+ 1)l→ ∞, it is clear from this

particular example that even for moderately small N0(p= 8

and l= 3) randomly generated small-dimensional SigSAS can

excel in learning a given IO system.

The implications of the strong universality features of the

randomly generated SAS systems are far-reaching in terms of

their empirical performance since, as we already emphasized

several times, it is only the linear readout that is tuned for

each individual IO system of interest. In particular, this opens

door to multi-task learning (when different components of the

readout are trained for different tasks in parallel) and to new

hardware implementations of these randomized SAS systems.

DISCRETE-TIME SIGNATURES AND RANDOMNESS IN RESERVOIR COMPUTING 9

4 8 9 16 16 25 27 36 49 64 64 81 81 125 216 256 343 512 625 729 1296 2401 4096 6561

N

0.15

0.2

0.25

0.3

0.35

0.4

Fig. 1. Box plots for the training mean squared errors (all MSE values

are multiplied by 1e+4 for convenience) committed by SigSAS systems in

the modeling of GARCH realizations for increasing N, where each N=

(p+ 1)l+1 is computed using pairs (p, l),p={1,...,8},l={1,2,3},

in lexicographical order. The distribution of errors is constructed using 200

GARCH paths of length 10000 and I0={1,2}in the SigSAS prescription.

The seemingly slow decay of the MSE values with Nis due to linear

regression problems which are ill-conditioned for large Nand which would

require adequate regularization.

10 10 100 100 190 190 280 280 370 370 460 460 550 550 640 640

k

0.28

0.29

0.3

0.31

0.32

0.33

0.34

0.35

0.36

t

GARCH

JL SigSAS

Random SAS

Fig. 2. Box plots for the distributions of training mean squared errors (all MSE

values are multiplied by 1e+4 for convenience) committed by 160 instances of

randomly JL reduced SigSAS systems and randomly generated SAS systems

according to Theorem III.7. The MSEs are computed with respect to one given

GARCH path of length 7000 for different values of k. For each k, the box

plots corresponding to the two systems are plotted next to each other to ease

comparison (JL SigSAS in blue and random SAS in magenta). The subplot in

the upper right corner shows a comparison of a part of this GARCH path for

t= 1,...,100 and its approximations using a JL SigSAS and a randomly

generated SAS system with k= 10.

V. CONCLUSION

Reservoir computing capitalizes on the remarkable fact that

there are learning systems that attain universal approximation

properties without requiring that all their parameters are esti-

mated using a supervised learning procedure. These untrained

parameters are most of the time randomly generated and it

is only an output layer that needs to be estimated using a

simple functional prescription. This phenomenon has been

explained for static (extreme learning machines [30]) and

dynamic (echo state networks [34], [35]) neural paradigms and

its performance has been quantiﬁed using mostly probabilistic

methods.

In this paper, we have concentrated on a different class of

reservoir computing systems, namely the state-afﬁne (SAS)

family. The SAS class was introduced and proved universal in

[36] and we have shown here that the possibility of randomly

constructing these systems and at the same time preserving

their approximation properties is of geometric nature. The

rationale behind our description relies on the following points:

•Any analytic ﬁlter can be represented as a Volterra series

expansion. When this ﬁlter is additionally of fading memory

type, the truncation error can be easily quantiﬁed.

•Truncated Volterra series admit a natural state-space repre-

sentation with linear observation equation in a conveniently

chosen tensor space. The state equation of this representa-

tion has a strong universality property whose unique solution

can be used to approximate any analytic fading memory

ﬁlter just by modifying the linear observation equation. We

refer to this strongly universal ﬁlter as the SigSAS system.

•The random projections of the SigSAS system yield SAS

systems with randomly generated coefﬁcients in a poten-

tially much smaller dimension which approximately pre-

serve the good properties of the original SigSAS system.

The loss in performance that one incurs because of the

projection mechanism can be quantiﬁed using the Johnson-

Lindenstrauss Lemma.

These observations, together with the numerical experiment,

collectively show that SAS reservoir systems with randomly

chosen coefﬁcients exhibit excellent empirical performances in

the learning of fading memory input/output systems because

they approximately correspond to very high-degree Volterra

series expansions of those systems.

APPENDIX

A. Proof of Theorem II.1

The representation (3) is a straightforward multivariate generalization of

Theorem 29 in [56]. For any z∈e

BMand any p, l ∈Ndeﬁne

Ul,p(z)t:=

p

X

j=1

0

X

m1=−l···

0

X

mj=−l

gj(m1,...,mj)(zm1+t⊗···⊗zmj+t).

Now, for any z∈e

BMand t1, t2∈Z−such that t2≤t1, deﬁne the

sequence zt1

t2∈e

BMby zt1

t2:= (...,0,zt2,...,zt1). Additionally, for any

u∈(Rd)Z−and any z∈(Rd)N+, the symbol uz1

t∈(Rd)Z−,t∈N+,

denotes the concatenation of the left-shifted vector uwith the truncated vector

z1

t:= (z1,...,zt)obtained out of z. With this notation, we now show (5).

By the triangle inequality and the time-invariance of U, for any z∈e

BMwe

have

U(z)t−Ul,p(z)t

≤

U(z)t−Ul,∞(z)t

+

Ul,∞(z)t−Ul,p(z)t

=

∞

X

j=1

−l−1

X

m1=−∞ ··· −l−1

X

mj=−∞

gj(m1,...,mj)(zm1+t⊗ ·· · ⊗ zmj+t)

+

U(zt

−l+t)0−U∞,p(zt

−l+t)0

=

U(z−l−1+t

−∞ 0l+1)0

+

U(zt

−l+t)0−U∞,p(zt

−l+t)0

,(43)

where the symbol 0l+1 stands for a l+ 1-tuple of the element 0∈Rd. The

second summand of this expression can be bounded using the Taylor bound

provided in [56, Theorem 29]. As to the ﬁrst summand, we shall use the

input forgetting property that the ﬁlter Uexhibits since, by hypothesis, has

the FMP. More speciﬁcally, if we apply Theorem 6 in [56] to the FMP ﬁlter

U:KM−→ `∞

−(Rm), we can conclude the existence of a montonically

decreasing sequence wUwith zero limit such that for any l∈N

U(z−l−1+t

−∞ 0l+1)0

=

U(z−l−1+t

−∞ 0l+1)0−U(0)0

≤wU

l.

These two arguments substituted in (43) yield the bound in (5).

B. Proof of Proposition II.2

The map FSigSAS

λ,l,p :Tl+1(Rp+1 )×[−M, M ]−→ Tl+1(Rp+1)is

clearly continuous and, additionally, it is a contraction in its ﬁrst component.

Indeed, let x1,x2∈Tl+1(Rp+1 )and let z∈[−M, M ]be arbitrary. Notice

ﬁrst that

ke

zk=

p+1

X

i=1

zi−1ei

≤1 + M+···+Mp=: f

M. (44)

DISCRETE-TIME SIGNATURES AND RANDOMNESS IN RESERVOIR COMPUTING 10

It is easy to see that f

M=1−Mp+1

1−Mfor M6= 1 and f

M= (p+ 1)M,

otherwise. Now, since we are using a crossnorm in Tl+1(Rp+1), we have

that,

FSigSAS

λ,l,p (x1, z)−FSigSAS

λ,l,p (x2, z)

=λkπl(x1−x2)⊗e

zk

=λkπl(x1−x2)kke

zk.

If we use in this equality the relation (44) and the fact that |||πl||| = 1, we

can conclude that:

FSigSAS

λ,l,p (x1, z)−FSigSAS

λ,l,p (x2, z)

≤λf

Mkx1−x2k.(45)

The hypothesis λ < 1/f

Mimplies that FSigSAS

λ,l,p is a contraction and es-

tablishes (11). Additionally,

b

z0

=

Pi∈I0zi−1e1⊗ · ·· ⊗ e1⊗ei

≤

1 + M+···+Mp=f

M, which implies that

FSigSAS

λ,l,p (0, z)

≤f

M, for all z∈[−M, M],(46)

and hence by [71, Remark 2] we can conclude that FSigSAS

λ,l,p restricts to

a map FSigSAS

λ,l,p :Bk·k(0, L)×[−M , M]−→ Bk·k(0, L), for any L≥

f

M/(1−λf

M). Finally, the contractivity condition established in (45) and [56,

Theorem 12], imply that the corresponding state system has the ESP and the

FMP. We now show that its unique solution is given by (13). First, it is easy

to see that by iterating the recursion (9) twice and three times, one obtains:

xt=λ2πl(πl(xt−2)⊗e

zt−1)⊗e

zt+λπl(b

z0

t−1)⊗e

zt+b

z0

t

=λ3πl(πl(πl(xt−3)⊗e

zt−2)⊗e

zt−1)⊗e

zt

+λ2πl(πl(b

z0

t−2)⊗e

zt−1)⊗e

zt+λπl(b

z0

t−1)⊗e

zt+b

z0

t.

More generally, after l+ 1 iterations one obtains,

xt=λl+1 πl(πl(···(πl(

| {z }

l+ 1-times

xt−(l+1))⊗e

zt−l)⊗ · ·· )⊗e

zt−1)⊗e

zt

+λlπl(πl(···(πl(

| {z }

l-times b

z0

t−l)⊗e

zt−(l−1))⊗ · · ·)⊗e

zt−1)⊗e

zt

+···+λπl(b

z0

t−1)⊗e

zt+b

z0

t.

Consequently, in order to establish (13) it sufﬁces to show that

λl+1 πl(πl(···(πl(

| {z }

l+ 1-times

xt−(l+1))⊗e

zt−l)⊗ · ·· )⊗e

zt−1)⊗e

zt=λl+1

1−λb

zt.

(47)

We show this equality by writing

xt=

p+1

X

i1,...,il+1=1

at

i1,...,il+1

ei1⊗ · ·· ⊗ eil+1 ,(48)

for some coefﬁcients at

i1,...,il+1 ∈Rthat by (9) and the assumption that

1∈I0, satisfy that

at

1,...,1=λat−1

1,...,1+ 1,for any t∈Z−.

This recursion can be rewritten for any r∈Nand t∈Zas

at

1,...,1=

r−1

X

j=0

λj+λrat−r

1,...,1.

Since by hypothesis the parameter λ < 1and, additionally, we just showed

that kxtk ≤ L, for all t∈Z, with L≥f

M/(1 −λf

M), this equation has a

unique solution given by

at

1,...,1=1

1−λ,for all t∈Z−.(49)

Now, notice that using (48), we can write:

πl(xt−(l+1))⊗e

zt−l

=

p+1

X

i2,...,il+1,j1=1

at−(l+1)

1,i2,...,il+1 zj1−1

t−lei2⊗ · ·· ⊗ eil+1 ⊗ej1.

If we repeat this procedure l+ 1 times, we obtain that

πl(πl(···(πl(

| {z }

l+ 1-times

xt−(l+1))⊗e

zt−l)⊗ · ·· )⊗e

zt−1)⊗e

zt

=

p+1

X

j1,...,jl+1=1

at−(l+1)

1,...,1zj1−1

t−lzj2−1

t−(l−1) ···zjl+1−1

tej1⊗ · ·· ⊗ ejl+1

=at−(l+1)

1,...,1b

zt=1

1−λb

zt,

where the last equality is a consequence of (49). This identity proves (47).

C. Proof of Theorem II.4

It is a straightforward corollary of Theorem II.1 and of the expres-

sion (13) of the ﬁlter USigSAS

λ,l,p . The linear map Wis constructed by

matching the coefﬁcients gj(m1,...,mj)of the truncated Volterra series

representation of Uup to polynomial degree pwith the terms of the ﬁlter

USigSAS

λ,l,p (z)tin the canonical basis of Tl+1(Rp+1 ). More speciﬁcally,

W∈L(Tl+1(Rp+1 ),Rm)is the linear map that satisﬁes:

W U SigSAS

λ,l,p (z)t=

p

X

j=1

0

X

m1=−l···

0

X

mj=−l

gj(m1,...,mj)zm1+t···zmj+t,

(50)

for any z∈KM,t∈Z−, where the right hand side of this equality is

the truncated Volterra series expansion of U, available by Theorem II.1. The

equality (50) does determine Wbecause by (14), it is equivalent to:

p+1

X

i1,...,il+1=1

W Aλ,l,p ei1⊗ · ·· ⊗ eil+1 zi1−1

t−l···zil+1−1

t

=

p

X

j=1

0

X

m1=−l···

0

X

mj=−l

gj(m1,...,mj)zm1+t···zmj+t.

Since this equality between polynomials has to hold for any z∈KMand

t∈Z−, we can conclude that the matrix coefﬁcients on both sides have to co-

incide. This implies that, in particular, for any i1,...,il+1 ∈ {1,...,p + 1},

W Aλ,l,p ei1⊗ · ·· ⊗ eil+1 =X

Ii1,...,il+1

gj(m1,...,mj)∈Rm,

(51)

where Ii1,...,il+1 ={(j, m1,...,mj)}is the set of indices with j∈

{1,...,p},mi∈ {−l,...,0}, and zm1···zmj=zi1−1

−l···zil+1−1

0.

As (51) speciﬁes the image of a basis by the map W Aλ,l,p and Aλ,l,p is

invertible, then (51) and consequently (50) fully determine W. The bound in

(15) is then a consequence of (50) and (5) in Theorem II.1.

D. Proof of Lemma III.1

(i) It is obvious that if v=0then kvkQ= 0 and that kλvkQ=|λ|kvkQ,

for all λ∈Rand v∈span {Q}. Let now w1,w2∈span {Q}and

Cq:= Card Q. Given that,

inf

Cq

X

j=1 |λ1

j+λ2

j| |

Cq

X

j=1

λ1

jvj=w1,

Cq

X

j=1

λ2

jvj=w2,vj∈Q

≥inf

Cq

X

j=1 |λj| |

Cq

X

j=1

λjvj=w1+w2,vj∈Q

,

we can conclude that

kw1+w2kQ≤

inf

Cq

X

j=1 |λ1

j+λ2

j| |

Cq

X

j=1

λ1

jvj=w1,

Cq

X

j=1

λ2

jvj=w2,vj∈Q

≤inf

Cq

X

j=1 |λ1

j|+|λ2

j| |

Cq

X

j=1

λ1

jvj=w1,

Cq

X

j=1

λ2

jvj=w2,vj∈Q

=kw1kQ+kw2kQ,

DISCRETE-TIME SIGNATURES AND RANDOMNESS IN RESERVOIR COMPUTING 11

which establishes the triangle inequality and hence shows that k·kQis a

seminorm. Suppose now that MQ<∞and let v∈span {Q}such

that kvkQ= 0. By the approximation property of the inﬁmum, for any

> 0there exist λ1,...,λCq∈Rsuch that PCq

j=1 λjvj=vand

0≤PCq

j=1 |λj|< . This inequality implies that

kvk=

Cq

X

j=1

λjvj

≤MQ

Cq

X

j=1 |λj|< MQ. (52)

Since MQis ﬁnite and > 0can be made arbitrarily small, this inequality

implies that kvk= 0 and hence v=0, necessarily, which proves that k·kQ

is a norm in this case.

Since the ﬁrst inequality (52) holds for any v∈span {Q}, the statement

in part (ii) follows (when MQis not ﬁnite we use the convention that ∞·0 =

∞). Part (iii) is obvious.

E. Proof of Proposition III.3

Since Vand Rkare Hilbert spaces, the parallelogram law holds for the

associated norms and hence, for any v1,v2∈Q,

hv1,v2−f∗◦f(v2)i=hv1,v2i−hf(v1), f (v2)i

=1

4kv1+v2k2− kv1−v2k2

−1

4kf(v1) + f(v2)k2− kf(v1)−f(v2)k2

=1

4kv1−(−v2)k2− kv1−v2k2

−1

4kf(v1)−f(−v2)k2− kf(v1)−f(v2)k2

≤

4kv1−v2k2+kv1+v2k2=

2kv1k2+kv2k2,(53)

where in the inequality in the last line we used the JL property (17) together

with the hypothesis −Q=Q. Let now w1=PCard Q

i=1 λ1

ivi,w2=

PCard Q

i=1 λ2

ivi∈span {Q}. Then, by (53):

|hw1,w2−f∗◦f(w2)i| =

Card Q

X

i,j=1

λ1

iλ2

jhvi,vj−f∗◦f(vj)i

≤

Card Q

X

i,j=1 |λ1

i||λ2

j|

2kvik2+kvjk2≤

Card Q

X

i=1 |λ1

i|

Card Q

X

j=1 |λ2

j|M2

Q.

Since this inequality holds true for any linear decomposition of w1,w2∈

span {Q}, we can take inﬁma on its right hand side with respect to those

decompositions, which clearly implies (20).

F. Proof of Theorem III.5

(i) We show that when condition (24) holds, then Ff

ρis a contraction on the

ﬁrst entry. Let x1,x2∈Rkand let z∈Dd, then

Ff

ρ(x1,z)−Ff

ρ(x2,z)

=kf(Fρ(f∗(x1),z)) −f(Fρ(f∗(x2),z))k ≤ ρ|||f||||||f∗|||kx1−x2k.

The claim follows from this inequality, the equality |||f||| =|||f∗|||, and

condition (24).

(ii) The proof is straightforward. The only point that needs to be emphasized

is that (f∗)−1:Vk−→ Rkis well-deﬁned because since fis surjective,

then f∗:Rk−→ Vkis necessarily injective.

(iii) First of all, the existence of the restricted versions of Fρand Ff

ρto

compact state-spaces and the fact that these maps are contractions on the

ﬁrst entry with contraction rates ρand ρ|||f|||2, respectively, implies by [56,

Theorem 7, part (i)] that they have the ESP and associated FMP ﬁlters Uρand

Uf

ρ. The statement about the JL-projected state map Ff

ρand its associated

ﬁlter Uf

ρis a straightforward consequence of the fact that the restricted linear

map f∗:Rk−→ Vkis a state-map equivariant linear isomorphism between

Ff

ρand Ff

ρand of the properties of this kind of maps (see, for instance, [72,

Proposition 2.3]).

(iv) Let z∈(Dd)Z−and t∈Z−arbitrary. Then, using (25), we have

Uρ(z)t− Uf

ρ(z)t

=

Fρ(Uρ(z)t−1,zt)− Fρ(Uf

ρ(z)t−1,zt)

=kFρ(Uρ(z)t−1,zt)−Fρ(Uf

ρ(z)t−1,zt)

+Fρ(Uf

ρ(z)t−1,zt)− Fρ(Uf

ρ(z)t−1,zt)k

≤ρ

Uρ(z)t−1− Uf

ρ(z)t−1

+

(IN−f∗◦f)(Fρ(Uf

ρ(z)t−1,zt))

.

(54)

The bounds in (26) and (27) are obtained by bounding the last expression in

(54) in two different fashions. First, if we use (20) and the hypothesis on Q

being a spanning set of RN, we have that:

ρ

Uρ(z)t−1− Uf

ρ(z)t−1

+

(IN−f∗◦f)Fρ(Uf

ρ(z)t−1,zt)

≤ρ

Uρ(z)t−1− Uf

ρ(z)t−1

+1/2MQ

(IN−f∗◦f)(Fρ(Uf

ρ(z)t−1,zt))

1/2

Q

Fρ(Uf

ρ(z)t−1,zt)

1/2

Q

≤ρ

Uρ(z)t−1− Uf

ρ(z)t−1

+1/2CMQCQ1 + |||f|||21/2.(55)

If we now iterate the procedure in (54) on the ﬁrst summand of this expression,

we obtain

Uρ(z)t− Uf

ρ(z)t

=

Fρ(Uρ(z)t−1,zt)− Fρ(Uf

ρ(z)t−1,zt)

≤ρρ

Uρ(z)t−2− Uf

ρ(z)t−2

+1/2CMQCQ1 + |||f|||21/2

+1/2CMQCQ1 + |||f|||21/2

=ρ2

Uρ(z)t−2− Uf

ρ(z)t−2

+ (1 + ρ)1/2CMQCQ1 + |||f|||21/2

≤ρj

Uρ(z)t−j− Uf

ρ(z)t−j

+ (1 + ρ+ρ2+···ρj−1)1/2CMQCQ1 + |||f|||21/2.(56)

As by hypothesis ρ < 1, we can take the limit j→ ∞ in this expression,

which yields (26). In order to obtain (27) it sufﬁces to replace the use of (20)

in (55) by that of (22).

(v) First of all, note that for any R > max{1/|||f|||2,1}, the contrac-

tion parameter ρ= 1/(R|||f|||2)satisﬁes the condition (24). Set now

Q:= {±e1,...,±eN}. It is easy to see that with this choice, the

norm k·kQintroduced in Lemma III.1 satisﬁes that k·kQ=k·k1and

that MQ= 1. If we now recall that k·k ≤ k·k1≤√Nk·k and that

(1/√N)|||·||| ≤ |||·|||1≤√N|||·|||, we can rewrite the inequality (55) as

Uρ(z)t− Uf

ρ(z)t

≤ρ

Uρ(z)t−1− Uf

ρ(z)t−1

+1/2|||IN−f∗◦f|||1/2

1

Fρ(Uf

ρ(z)t−1,zt)

1

≤ρ

Uρ(z)t−1− Uf

ρ(z)t−1

+1/2N3/4C|||IN−f∗◦f|||1/2

≤