ArticlePDF Available

Discrete-Time Signatures and Randomness in Reservoir Computing

Authors:

Abstract

A new explanation of geometric nature of the reservoir computing phenomenon is presented. Reservoir computing is understood in the literature as the possibility of approximating input/output systems with randomly chosen recurrent neural systems and a trained linear readout layer. Light is shed on this phenomenon by constructing what is called strongly universal reservoir systems as random projections of a family of state-space systems that generate Volterra series expansions. This procedure yields a state-affine reservoir system with randomly generated coefficients in a dimension that is logarithmically reduced with respect to the original system. This reservoir system is able to approximate any element in the fading memory filters class just by training a different linear readout for each different filter. Explicit expressions for the probability distributions needed in the generation of the projected reservoir system are stated and bounds for the committed approximation error are provided.
DISCRETE-TIME SIGNATURES AND RANDOMNESS IN RESERVOIR COMPUTING 1
Discrete-time signatures and randomness in
reservoir computing
Christa Cuchiero, Lukas Gonon, Lyudmila Grigoryeva, Juan-Pablo Ortega, and Josef Teichmann
Abstract—A new explanation of geometric nature of the reser-
voir computing phenomenon is presented. Reservoir computing
is understood in the literature as the possibility of approximating
input/output systems with randomly chosen recurrent neural
systems and a trained linear readout layer. Light is shed on this
phenomenon by constructing what is called strongly universal
reservoir systems as random projections of a family of state-space
systems that generate Volterra series expansions. This procedure
yields a state-affine reservoir system with randomly generated
coefficients in a dimension that is logarithmically reduced with
respect to the original system. This reservoir system is able to
approximate any element in the fading memory filters class just
by training a different linear readout for each different filter.
Explicit expressions for the probability distributions needed in
the generation of the projected reservoir system are stated and
bounds for the committed approximation error are provided.
Index Terms—Reservoir computing, recurrent neural net-
work, state-affine system, SAS, signature state-affine sys-
tem, SigSAS, echo state network, ESN, Johnson-Lindenstrauss
Lemma, Volterra series, machine learning.
I. INTRODUCTION
MANY dynamical problems in engineering, signal pro-
cessing, forecasting, time series analysis, recurrent neu-
ral networks, or control theory can be described using in-
put/output (IO) systems. These mathematical objects establish
a functional link that describes the relation between the time
evolution of one or several explanatory variables (the input)
and a second collection of dependent or explained variables
(the output).
A generic question in all those fields is to determine the
IO system underlying an observed phenomenon. This is the
so called system identification problem. For this purpose, first
principles coming from physics or chemistry can be invoked,
when either these are known or the setup is simple enough
to apply them. In complex situations, in which access to all
the variables that determine the behavior of the systems is
difficult or impossible, or when a precise mathematical relation
between input and output is not known, it has proved more
efficient to carry out the system identification using generic
families of models with strong approximation abilities that are
estimated using observed data. This approach, that we refer to
as empirical system identification, has been developed using
different techniques coming simultaneously from engineering,
statistics, and computer science.
C. Cuchiero is affiliated with the University of Vienna, Austria. L. Gonon is
with the Ludwig-Maximilians-Universit¨
at M¨
unchen, Germany. L. Grigoryeva
is with the Universit¨
at Konstanz, Germany. J.-P. Ortega is with the Nanyang
Technological University, Singapore. J. Teichmann’s affiliation is ETH Z¨
urich,
Switzerland.
In this paper, we focus on a particularly promising strategy
for empirical system identification known as reservoir comput-
ing (RC). Reservoir computing capitalizes on the revolutionary
idea that there are learning systems that attain universal
approximation properties without the need to estimate all
their parameters using, for instance, supervised learning. More
specifically, RC can be seen as a recurrent neural networks ap-
proach to model IO systems using state-space representations
in which
the state equation is randomly generated, sometimes with
sparsity features, and
only the (usually very simple) functional form of the
observation equation is tailored to the specific problem
using observed data.
RC can be found in the literature under other denominations
like Liquid State Machines [1]–[5] and is represented by
various learning paradigms, with Echo State Networks (ESNs)
[6]–[8] being a particularly important example.
RC has shown superior performance in many forecasting
and classification engineering tasks (see [9]–[12] and refer-
ences therein) and has shown unprecedented abilities in the
learning of the attractors of complex nonlinear infinite di-
mensional dynamical systems [8], [13]–[15]. Additionally, RC
implementations with dedicated hardware have been designed
and built (see, for instance, [16]–[24]) that exhibit information
processing speeds that largely outperform standard Turing-
type computers.
The most far-reaching and radical innovation in the RC
approach is the use of untrained, randomly generated, and
sometimes sparse state maps. This circumvents well-known
difficulties in the training of generic recurrent neural networks
arising bifurcation phenomena [25], which, despite recent
progress in the regularization and training of deep RNN
structures (see, for instance [26]–[28], and references therein),
render classical gradient descent methods non-convergent.
Randomization has already been successfully applied in a
static setup using neural networks with randomized weights, in
particular in seminal works on random feature models [29] and
Extreme Learning Machines [30]. This built-in randomness
makes reservoir models different from other conventional
approaches where state-space systems appear. For instance,
Kalman filtering [31] has been used for decades in signal
processing and, in that case, both linear and nonlinear [32],
[33] Kalman techniques hinge on the idea of designing the
state map to result in a posteriori residual errors of minimal
variance. This requires a significant computational effort in
relation with recursive parameter estimation which is not
DISCRETE-TIME SIGNATURES AND RANDOMNESS IN RESERVOIR COMPUTING 2
needed for RC systems. In the dynamical systems context, an
important result in [34] shows that randomly drawn ESNs can
be trained by exclusively optimizing a linear readout using
generic one-dimensional observations of a given invertible
and differentiable dynamical system to produce dynamics that
are topologically conjugate to that given system; in other
words, randomly generated ESNs are capable of learning the
attractors of invertible dynamical systems. More generally,
the approximation capabilities of randomly generated ESNs
have been established in [35] in the more general setup of IO
systems. There, approximation bounds have been provided in
terms of their architecture parameters.
In this paper, we provide an additional insight on the
randomization question for another family of RC systems,
namely, for the non-homogeneous state-affine systems (SAS).
These systems have been introduced and proved to be universal
approximants in [36], [37]. We here show that they also have
this universality property when they are randomly generated.
The approach pursued in this work is considerably different
from the one in the above-cited references and is based on the
following steps. First, we consider causal and time-invariant
analytic filters with semi-infinite inputs. The Taylor series
expansion of these objects coincide with what is known as
their Volterra series representation. Second, we show that
the truncated Volterra series representation (whose associ-
ated truncation error can be quantified) admits a state-space
representation with linear readouts in a (potentially) high-
dimensional adequately constructed tensor space. We refer
to this system as the signature state-affine system (SigSAS):
on the one hand, it belongs to the SAS family and, on the
other hand, it shares fundamental properties with the so-called
signature process from the (continuous time) theory of rough
paths, which inspired the title of the paper.
Rough paths theory, as introduced by T. Lyons in the
seminal work [38], has initially been developed to deal with
controlled differential equations driven by rough signals in a
pathwise way. These equations can be seen as a continuous-
time analogue of time series models, where the rough sig-
nals play the role of the model innovations. The key object
in this theory is the signature, which was first studied by
K. Chen [39], [40] and consists in enhancing the rough input
with additional curves (satisfying certain algebraic properties)
mimicking what in the smooth case corresponds to iterated
integrals of the curve with itself.
It is a deep mathematical fact that unique solutions of the
rough differential equation exist and are a continuous map
of the signature (in appropriately chosen topologies). Surpris-
ingly, this non-linear continuous map can be arbitrarily well
approximated by linear maps of the signature. More generally,
on compact sets of so-called “non tree-like” paths (see [41] for
a precise definition), every continuous path functional (with
respect to a certain p-variation norm) can be uniformly ap-
proximated by a linear function of the signature. Indeed, linear
functionals of the signature form a point separating algebra on
sets of “non tree-like” paths, which by the Stone-Weierstrass
Theorem then yields a universal approximation theorem for
general path functionals (see, for instance, [42]). Rough path
theory has been substantially extended by Martin Hairer [43]
towards the theory of regularity structures and is nowadays the
tool to analyze deep analytic properties of continuous-time IO
systems.
From a machine learning perspective, the signature can be
thought of as a feature map capturing all specific character-
istics of a given path. More precisely, it serves as a linear
regression basis and can thus be interpreted as an abstract
reservoir (for the moment without random specifications) for
solutions of rough differential equations. These appealing
properties made signature methods highly popular for machine
learning applications both for streamed data (in particular, in
finance) and for complex classification tasks. For inspiring
examples of the rapidly growing literature on machine learning
using signature methods we refer to [44]–[51] and references
therein.
Returning to the SAS family we will show that the solutions
of the SigSAS introduced in this paper share exactly the two
crucial properties which make signature central in rough path
theory: first, the SigSAS solutions fully characterize the input
sequences and, second, any (sufficiently regular) IO system
can be written as a linear map of the SigSAS system. These
properties have been exploited in the continuous-time setup
in [52].
Finally, we use the Johnson-Lindenstrauss Lemma [53] to
prove that a random projection of the SigSAS system yields a
smaller dimensional SAS system with random matrix coeffi-
cients (that can be chosen to be sparse) that approximates the
original system. Moreover, this constructive procedure gives
us full knowledge of the law that needs to be used to draw
the entries of the low-dimensional SAS approximating system,
without ever having to use the original large dimensional
SigSAS, which amounts to a form of information compression
with efficient reconstruction in this setup [54]. An important
feature of the dimension reduced randomly drawn SAS system
is that it serves as a universal approximator for any reasonably
behaved IO system and that only the linear output layer that
is applied to it depends on the individual system that needs
to be learnt. We refer to this feature as the strong universality
property.
This approach to the approximation problem in recurrent
neural networks using randomized systems provides a new
explanation of geometric nature of the reservoir computing
phenomenon. The results in the following pages show that
randomly generated SAS reservoir systems approximate well
any sufficiently regular IO system just by tuning a linear
readout because they coincide with an error-controlled random
projection of a higher dimensional Volterra series expansion
of that system.
II. TRU NC ATED VO LTE RR A RE PR ES ENTATION S OF
ANA LYTIC FILTER S
We start by describing the setup that we shall be working on,
together with the main approximation tool which we will be
using later on in the paper, namely, Volterra series expansions.
Details on the concepts introduced in the following paragraphs
can be found in, for instance, [55]–[57], and references therein.
All along the paper, the symbol Zdenotes the set of all
integers and Zstands for the set of negative integers with
DISCRETE-TIME SIGNATURES AND RANDOMNESS IN RESERVOIR COMPUTING 3
the zero element included. Let DdRdand DmRm. We
refer to the maps of the type U: (Dd)Z(Dm)Zbetween
infinite sequences with values in Ddand Dm, respectively, as
filters,operators, or discrete-time input/output systems, and to
those like H: (Dd)ZDm(or H: (Dd)ZDm) as
Rm-valued functionals. These definitions will be sometimes
extended to accommodate situations where the domains and
the targets of the filters are not necessarily product spaces but
just arbitrary subsets of RdZand (Rm)Zlike, for instance,
(Rd)and (Rm).
A filter U: (Dd)Z(Dm)Zis called causal when for
any two elements z,w(Dd)Zthat satisfy that zτ=wτfor
any τt, for a given tZ, we have that U(z)t=U(w)t.
Let Tτ: (Dd)Z(Dd)Zbe the time delay operator defined
by Tτ(z)t:= ztτ. The filter Uis called time-invariant
(TI) when it commutes with the time delay operator, that is,
TτU=UTτ, for any τZ(in this expression, the
two operators Tτhave to be understood as defined in the
appropriate sequence spaces). There is a bijection between
causal and time-invariant filters and functionals. We denote
by UH: (Dd)Z(Dm)Z(respectively, HU: (Dd)Z
Dm) the filter (respectively, the functional) associated to the
functional H: (Dd)ZDm(respectively, the filter
U: (Dd)Z(Dm)Z). Causal and time-invariant filters are
fully determined by their restriction to semi-infinite sequences,
that is, U: (Dd)Z(Dm)Z, that will be denoted using
the same symbol.
In most cases, we work in the situation in which Ddand Dm
are compact and the sequence spaces (Dd)Zand (Dm)Zare
endowed with the product topology. It can be shown (see [55])
that this topology is equivalent to the norm topology induced
by any weighted norm defined by kzkw:= suptZ{ztwt},
z(Dd)Z, where w:N(0,1] is an arbitrary strictly
decreasing sequence (we call it weighting sequence) with zero
limit and such that w0= 1. Filters and functionals that are
continuous with respect to this topology are said to have the
fading memory property (FMP).
A particularly important class of IO systems are those gen-
erated by state-space sytems in which the output y(Dm)Z
is obtained out of the input z(Dd)Zas the solution of the
equations
xt=F(xt1,zt),
yt=h(xt),
(1)
(2)
where F:DN×DdDNis the so-called state map,
for some DNRN,NN, and h:DNDmis the
readout or observation map. When for any input z(Dd)Z
there is only one output y(Dm)Zthat satisfies (1)-
(2), we say that this state-space system has the echo state
property (ESP), in which case it determines a unique filter
UF
h: (Dd)Z(Dm)Z. When the ESP holds at the
level of the state equation (1), then it determines another filter
UF: (Dd)Z(DN)Zand then UF
h=h(UF). The
filters UF
hand UF, when they exist, are automatically causal
and TI (see [55]). The continuity and the differentiability
properties of the state and observation maps Fand himply
continuity and differentiability for UF
hand UFunder very
general hypotheses; see [56] for an in-depth study of this
question.
We denote by k·k the Euclidean norm if not stated oth-
erwise and use the symbol |||·||| for the operator norm with
respect to the 2-norms in the target and the domain spaces.
Additionally, for any z(Rd)Zwe define p-norms as
kzkp:= PtZkztkp1/p
, for 1p < , and kzk:=
suptZ{kztk} for p=. Given M > 0, we denote by
KM:= {z(Rd)Z| kztk ≤ Mfor all tZ}. It is easy
to see that KM=BM
(Rd), with BM:= Bk·k(0, M )
and
(Rd) := {z(Rd)Z| kzk<∞}. We define
e
BM:= BM1
(Rd)with 1
(Rd) := {z(Rd)Z|
kzk1<∞} and use the same symbol e
BMwhenever d= 1.
Additionally, we will write L(V, W )to refer to the space of
linear maps between the real vector spaces Vand W. The
following statement is the main approximation result that will
be used in the paper.
Theorem II.1. Let M, L > 0and let U:KM
(Rd)
KL
(Rm)be a causal and time-invariant fading memory
filter whose restriction U|BMis analytic as a map between
open sets in the Banach spaces
(Rd)and
(Rm)and
satisfies U(0) = 0. Then, for any ze
BMthere exists a
Volterra series representation of Ugiven by
U(z)t=
X
j=1
0
X
m1=−∞ ···
0
X
mj=−∞
gj(m1,...,mj)(zm1+t⊗···⊗zmj+t),
(3)
with tZand where the map gj: (Z)jL(Rd·· ·
Rd,Rm)is given by
gj(m1, . . . , mj)(ei1⊗·· ·⊗eij) = 1
j!DjHU(0)(ei1
m1,...,eij
mj),
(4)
where, for any z0in some open subset of
(Rd),DjHU(z0)
with j1denotes the j-order Fr´
echet differential at z0of
the functional HUassociated to the filter U,{e1,...,ed}is
the canonical basis of Rdand the sequences eil
mk
(Rd)
are defined by:
eil
mkt:= eilRd,if t=mk,
0,otherwise.
Moreover, there exists a monotonically decreasing sequence
wUwith zero limit such that, for any p, l N,
U(z)t
p
X
j=1
0
X
m1=l···
0
X
mj=l
gj(m1,...,mj)(zm1+t⊗ ·· · ⊗ zmj+t)
wU
l+L1kzk
M1kzk
Mp+1
.(5)
A. The signature state-affine system (SigSAS)
We now show that the filter obtained out of the truncated
Volterra series expansion in the expression (5) can be written
down as the unique solution of a non-homogeneous state-affine
system (SAS) with linear readouts that, as we shall show in
Section II-B, has particularly strong universal approximation
properties. We first briefly recall how the SAS family is
constructed.
DISCRETE-TIME SIGNATURES AND RANDOMNESS IN RESERVOIR COMPUTING 4
Let α= (α1, . . . , αd)>Ndand z= (z1, . . . , zd)>Rd
and define the monomials zα:= zα1
1·· · zαd
d. We denote by
MN1,N2the space of real N1×N2matrices with N1, N2N
and use MN1,N2[z]to refer to the space of polynomials in
zRdwith matrix coefficients in MN1,N2, that is, the set of
elements pof the form
p(z) = X
αVp
zαAα,
with VpNda finite subset and AαMN1,N2the matrix
coefficients. A state-affine system (SAS) is given by
(xt=p(zt)xt1+q(zt),
yt=Wxt,(6)
pMN,N [z],qMN,1[z]are polynomials with matrix
and vector coefficients, respectively, and WMm,N . If we
consider inputs in the set KMand the pand qin the state-
space system (6) such that
Mp:= sup
zBk·k(0,M ){|||p(z)|||} <1,
Mq:= sup
zBk·k(0,M ){|||q(z)|||} <,
where Bk·k(0, M )denotes the closed ball in Rdof radius M
and center 0with respect to the Euclidean norm, then a unique
state-filter Up,q :KMKLcan be associated to it, with
L:= Mq/(1 Mp). It has been shown in [36], [37] that SAS
systems are universal approximants in the fading memory and
in the Lp-integrable categories in the sense that given a filter
in any of those two categories, there exists a SAS system of
type (6) that uniformly or in the Lp-sense approximates it.
The signature state-affine system that we construct in this
section exhibits what we call the strong universality prop-
erty. This means that the state equation for this state-space
representation is the same for any fading memory filter that
is being approximated, and it is only the linear readout that
changes. In other words, we provide a result that yields the
approximation (as accurate as desired) of any fading memory
IO system, as the linear readout of the solution of a fixed non-
homogeneous SAS system that does not depend on the filter
being approximated.
Since the important property that we just described is
reminiscent of an analogous feature of the signature process in
the context of the representation of the solutions of controlled
stochastic differential equations [52], we shall refer to this
state system as the signature SAS (SigSAS) system.
Before we proceed, we need to introduce some notation.
First, for any l, d N, we denote by Tl(Rd)the space of
tensors of order lon Rd, that is,
Tl(Rd) :=
d
X
i1,...,il=1
ai1,...,ilei1⊗ · ·· ⊗ eil|ai1,...,ilR
.
The tensor space Tl(Rd)will be understood as a normed
space with a crossnorm [58] that we shall leave unspecified
for the time being. We shall be using an order lowering
map πl:Tl+1(Rd)Tl(Rd)that, for any vector v:=
Pd
i1,...,il+1=1 ai1,...,il+1 ei1·· ·eil+1 Tl+1 (Rd), is defined
as,
πl(v) :=
d
X
i2,...,il+1=1
a1,i2,...,il+1 ei2⊗ · ·· ⊗ eil+1 Tl(Rd).
The order lowering map is linear and its operator norm satisfies
that |||πl||| = 1.
We shall restrict the presentation to one-dimensional inputs,
that is, we consider input sequences zKM
(R). Now,
for fixed l, p N, we define for any zKMand tZ,
e
zt:=
p+1
X
i=1
zi1
teiRp+1 and b
zt:= e
ztl⊗ · ·· ⊗ e
zt.(7)
Note that e
ztis the Vandermonde vector [59] associated to zt
and that b
ztis a tensor in Tl+1 Rp+1whose components
in the canonical basis are all the monomials on the variables
zt, . . . , ztlthat contain powers up to order pin each of those
variables, namely
b
zt=
p+1
X
i1,...,il+1=1
zi11
tl·· · zil+1 1
tei1⊗ · ·· ⊗ eil+1 .
Finally, given I0⊂ {1, . . . , p + 1}an arbitrarily chosen but
fixed subset of cardinality higher than 1that contains the
element 1, we define:
b
z0
t=X
iI0
zi1
te1⊗ · ·· ⊗ e1
| {z }
l-times eiTl+1 Rp+1.(8)
The next proposition introduces the SigSAS state system for
fixed l, p N, whose solution is used later on in Theorem II.4
to represent the truncated Volterra series expansions in Theo-
rem II.1 of polynomial degree pand lag l(see expression
(5)).
Proposition II.2 (The SigSAS system).Let M > 0and let
l, p N. Let 0< λ < min n1,1/Pp
j=0 Mjo. Consider the
state system with uniformly bounded scalar inputs in KM=
[M, M ]Zand states in Tl+1(Rp+1 )given by the recursion
xt=λπl(xt1)e
zt+b
z0
t.(9)
This state equation is induced by the state map FSigSAS
λ,l,p :
Tl+1(Rp+1 )×RTl+1 (Rp+1)defined by
FSigSAS
λ,l,p (x, z) := λπl(x)e
z+b
z0,(10)
which is a contraction in the state variable with contraction
constant
λf
M < 1,where f
M:=
p
X
j=0
Mj,(11)
and hence restricts to a map FSigSAS
λ,l,p :Bk·k(0, L)×
[M, M ]Bk·k (0, L), with
L:= f
M/(1 λf
M).(12)
DISCRETE-TIME SIGNATURES AND RANDOMNESS IN RESERVOIR COMPUTING 5
This state system has the echo state and the fading memory
properties and its continuous, time-invariant, and causal as-
sociated filter USigSAS
λ,l,p :KMKLTl+1(Rp+1)is given
by:
USigSAS
λ,l,p (z)t=λl+1
1λb
zt
+λlπl(πl(···(πl(
| {z }
l-times b
z0
tl)e
zt(l1))⊗ · · ·)e
zt1)e
zt+
···+λπl(b
z0
t1)e
zt+b
z0
t.(13)
Remark II.3.The state equation (9) is indeed a SAS with
states defined in Tl+1(Rp+1 )as it has the same form as the
first equality in (6). Indeed, this equation can be written as
xt=p(zt)xt1+q(zt)with p(zt)and q(zt)the polynomials
in ztwith coefficients in L(Tl+1(Rp+1), T l+1(Rp+1 )) and
Tl+1(Rp+1 ), respectively, given by:
p(zt)xt1:= λπl(xt1)e
zt=
p+1
X
i=1
zi1
t(λπl(xt1)ei),
q(zt) := b
z0
t=X
iI0
zi1
te1⊗ · ·· ⊗ e1ei.
B. The SigSAS approximation theorem
As we already pointed out, b
ztis a vector in Tl+1 Rp+1
whose components in the canonical basis are all the monomials
on the variables zt, . . . , ztlthat contain powers up to order
pin each of those variables. Moreover, it is easy to see
that all the other summands in the expression (13) of the
filter USigSAS
λ,l,p are proportional (with a positive constant) to
monomials already contained in b
zt. This implies the existence
of a linear map Aλ,l,p L(Tl+1(Rp+1 ), T l+1(Rp+1)) with
an invertible matrix representation with non-negative entries
such that
USigSAS
λ,l,p (z)t=Aλ,l,pb
zt.(14)
In the sequel we will denote the matrix representation of Aλ,l,p
using the same symbol Aλ,l,p MN,N ,N:= (p+ 1)l+1. This
observation, together with Theorem II.1, can be used to prove
the following result.
Theorem II.4. Let M, L > 0and let U:KM
(R)
KL
(Rm)be a causal and time-invariant fading memory
filter whose restriction U|BMis analytic as a map between
open sets in the Banach spaces
(R)and
(Rm)and sat-
isfies U(0) = 0. Then, there exists a monotonically decreasing
sequence wUwith zero limit such that, for any p, l N, and
any 0< λ < min n1,1/Pp
j=0 Mjo, there exists a linear
map WL(Tl+1(Rp+1 ),Rm)such that, for any ze
BM:
U(z)tW U SigSAS
λ,l,p (z)t
wU
l+L1kzk
M1kzk
Mp+1
.
(15)
Remark II.5.Theorem II.4 establishes the strong universality
of the SigSAS system in the sense that the state equation
of this system is the same for any fading memory filter U
that is being approximated, and it is only the linear readout
that changes. Nevertheless, we emphasize that the quality of
the approximation is not filter independent, as the decreasing
sequence wUin the bound (15) depends on how fast the filter
U“forgets” past inputs.
Remark II.6.The analyticity hypothesis in the statement of
Theorem II.4 can be dropped by using the fact that finite order
and finite memory Volterra series are universal approximators
in the fading memory category (see [60] and [56, Theorem
31]). In that situation, the bound for the truncation error in
(15) does not necessarily apply anymore, in particular its
second summand, which is intrinsically linked to analyticity.
A generalized bound can be formulated in that case using
arguments along the lines of those found in [35].
III. JOH NS ON -LINDENSTRAUSS REDUCTION OF THE
SIG SAS RE PR ES EN TATION
The price to pay for the strong universality property exhib-
ited by the signature state-affine system that we constructed
in the previous section, is the potentially large dimension of
the tensor space in which this state-space representation is
defined. In this section we concentrate on this problem by
proposing a dimension reduction strategy which consists in
using the random projections in the Johnson-Lindenstrauss
Lemma [53] in order to construct a smaller dimensional SAS
system with random matrix coefficients (that can be chosen
to be sparse). The results contained in the next subsections
quantify the increase in approximation error committed when
applying this dimensionality reduction strategy.
We start by introducing the Johnson-Lindenstrauss (JL)
Lemma [53] and some properties that are needed later on in
the presentation. Following this, we spell out how to use it in
the dimension reduction of state-space systems in general and
of the SigSAS representation in particular.
A. The JL Lemma and approximate projections
Given a N-dimensional Hilbert space (V, ,·i)and Qa
n-point subset of V, the Johnson-Lindenstrauss (JL) Lemma
[53], [61] guarantees, for any 0<  < 1, the existence of a
linear map f:VRk, with kNsatisfying
k24 log n
3223,(16)
that respects -approximately the distances between the points
in the set Q. More specifically,
(1)kv1v2k2≤ kf(v1)f(v2)k2(1+)kv1v2k2,
(17)
for any v1,v2Q. The norm k·k in Rkcomes from an inner
product that makes it into a Hilbert space or, in other words,
it satisfies the parallelogram identity. This remarkable result
is even more so in connection with further developments that
guarantee that the linear map fcan be randomly chosen [61]–
[63] and, moreover, within a family of sparse transformations
[64], [65] (see also [66]).
In the developments in this paper, we use the original
version of this result in which the JL map fis realized by
a matrix AMk,N whose entries are such that
Aij N(0,1/k).(18)
DISCRETE-TIME SIGNATURES AND RANDOMNESS IN RESERVOIR COMPUTING 6
It can be shown that with this choice, the probability of the
relation (17) to hold for any pair of points in Qis bounded
below by 1/n.
Lemma III.1. Let (V , k·k)be a normed space and let Q
be a (finite or infinite countable) subset of V. Define k·kQ:
span {Q} −R+by
kvkQ:= inf
Card Q
X
j=1 |λj|
Card Q
X
j=1
λjvj=v,vjQ
.
(i) k·kQdefines a seminorm in span {Q}. If
MQ:= sup {kvik | viQ}(19)
is finite, then k·kQis a norm.
(ii) kvk≤kvkQMQ, for any vspan {Q}.
(iii) Let Q1, Q2be subsets of Vsuch that Q1Q2. Then
kvkQ2≤ kvkQ1for any vspan {Q1}.
Remark III.2.If the hypothesis MQ<is dropped in part
(i) of Lemma III.1, then k·kQis in general not a norm as the
following example shows. Take V=Rand vi=i,iN. It
is easy to see that, in this setup,
k1kQ= inf 1
i|iN= 0.
Proposition III.3. Let Qbe a set of points in the Hilbert space
(V, ,·i)with MQ:= sup {kvik | viQ}<such that
Q:= {−v|vQ}=Q. Let  > 0, let f:VRkbe a
linear map that satisfies the Johnson-Lindenstrauss property
(17)with respect to , and let f:RkVthe adjoint map
with respect to a fixed inner product ,·i in Rk. Then,
|hw1,(IVff) (w2)i| ≤ M2
Qkw1kQkw2kQ,(20)
for any w1,w2span {Q}.
Corollary III.4. In the hypotheses of the previous proposition,
let
CQ:= inf
cR+nkvkQckvk, for all vspan{Q}o.(21)
Then, for any vspan{Q}such that (ff)(v)span{Q},
we have
k(IVff) (v)k ≤ M2
QC2
Qkvk.(22)
This corollary is just a consequence of the inequality (20)
that guarantees that
k(IVff) (v)k2M2
Qk(IVff) (v)kQkvkQ
M2
QC2
Qk(IVff) (v)kkvk,(23)
which yields (22).
B. Johnson-Lindenstrauss projection of state-space dynamics
The next result shows how, when the dimension kof the
target of the JL map fdetermined by (16) is chosen so that
this map is generically surjective, then any contractive state-
space system with states in the domain of fcan be projected
onto another one with states in its smaller dimensional image.
This result also shows that if the original system has the ESP
and the FMP, then so does the projected one. Additionally, it
gives bounds that quantify the dynamical differences between
the two systems.
Theorem III.5. Let Fρ:RN×DdRNbe a one-
parameter family of continuous state maps, where DdRd
is a compact subset, 0<ρ<1, and Fρis a ρ-contraction
on the first component. Let Qbe a n-point spanning subset
of RNsatisfying Q=Q. Let f:RNRkbe a JL map
that satisfies (17)with 0<  < 1where the dimension khas
been chosen so that fis generically surjective. Then:
(i) Let Ff
ρ:Rk×DdRkbe the state map defined by:
Ff
ρ(x,z) := f(Fρ(f(x),z)) ,
for any xRkand zDd. If the parameter ρis chosen
so that
ρ < 1/|||f|||2,(24)
then Ff
ρis a contraction on the first entry. The symbol
|||·||| in (24)denotes the operator norm with respect to the
2-norms in RNand Rk.
(ii) Let Vk:= f(Rk)RNand let Ff
ρ:Vk×DdVkbe
the state map with states on the vector space Vk, defined
by:
Ff
ρ(x,z) := fFf
ρ((f)1(x),z)=ff(Fρ(x,z)),
(25)
for any xVkand zDd. If the contraction parameter
satisfies (24)then Ff
ρis also a contraction on the first
entry. Moreover, the restricted linear map f:RkVk
is a state-map equivariant linear isomorphism between Ff
ρ
and Ff
ρ.
(iii) Suppose, additionally, that there exist two constants
C, Cf>0such that the state spaces of the state maps
Fρand Ff
ρcan be restricted as Fρ:Bk·k(0, C)×Dd
Bk·k(0, C )and Ff
ρ:Bk·k(0, Cf)×DdBk·k(0, Cf).
Then, both Fρand Ff
ρhave the ESP and have unique
FMP associated filters Uρ: (Dd)ZKCand
Uf
ρ: (Dd)ZKCf, respectively. The state map
Ff
ρ:fBk·k(0, Cf)×DdfBk·k(0, Cf)is
isomorphic to the restricted version of Ff
ρ, also has the
ESP and a FMP associated filter Uf
ρ: (Dd)Z
fBk·k(0, Cf)Z. The state map Ff
ρand the filter
Uf
ρare called the JL projected versions of Fρand Uρ,
respectively.
(iv) In the hypotheses of the previous point, for any z
(Dd)Zand tZ:
Uρ(z)t− Uf
ρ(z)t
1/2CMQCQ
(1 + |||f|||2)1/2
1ρ,
(26)
where MQand CQare given by (19)and (21), respec-
tively. Alternatively, it can also be shown that:
Uρ(z)t− Uf
ρ(z)t
CM 2
QC2
Q
1ρ.(27)
DISCRETE-TIME SIGNATURES AND RANDOMNESS IN RESERVOIR COMPUTING 7
(v) Let R > max{1/|||f|||2,1}and set ρ= 1/(R|||f|||2). Then,
the elements in the set Qcan be chosen so that the bounds
in (26)and (27)reduce to
1/2N3/4C1 + |||f|||21/2R|||f|||2
R|||f|||21and (28)
NC R|||f|||2
R|||f|||21,(29)
respectively.
C. The Johnson-Lindenstrauss reduced SigSAS system
We now use the previous theorem to spell out the Johnson-
Lindenstrauss projected version of SigSAS approximations
and to establish error bounds analogous to those introduced in
(28) and (29). Given that Theorem III.5 is formulated using the
one and the two-norms in Euclidean spaces and Proposition
II.2 defines the SigSAS system on a tensor space endowed
with an unspecified cross-norm, we notice that those two
frameworks can be matched by using the norms k·k and k·k1
in Tl+1(Rp+1 )given by
kvk2:=
p+1
X
i1,...,il+1=1
λ2
i1,...,il+1 ,kvk2
1:=
p+1
X
i1,...,il+1=1 |λi1,...,il+1 |,
with v=Pp+1
i1,...,il+1=1 λi1,...,il+1 ei1⊗ · ·· ⊗ eil+1 and
ei1⊗ · ·· ⊗ eil+1 i1,...,il+1∈{1,...,p+1}the canonical basis in
Tl+1(Rp+1 ). It is easy to check that these two norms are cross-
norms and that k·k is the norm associated to the inner product
defined by the extension by bilinearity of the assigment
hei1⊗ · ·· ⊗ eil+1 ,ej1⊗ · ·· ⊗ ejl+1 i:= δi1j1··· δil+1 jl+1 ,
that makes (Tl+1(Rp+1 ),,·i)into a Hilbert space, a feature
that is needed to use the Johnson-Lindenstrauss Lemma.
Corollary III.6. Let M > 0and let (FSigSAS
λ,l,p , W )be
the SigSAS system that approximates a causal and TI filter
U:KM
(Rm), as introduced in Theorem II.4. Let
N:= (p+ 1)l+1,f
Mas in (11), and let 0<  < 1. Let
f:RNRkbe a JL map that satisfies (17), where
the dimension khas been chosen to make fgenerically
surjective. Then, for any R > max{1/|||f|||2,1/(f
M|||f|||2),1},
λ:= 1/(Rf
M|||f|||2), and Las in (12), there exists a JL
reduced version FSigSAS
λ,l,p,f :fBk·k(0, Lf)×[M , M]
fBk·k(0, Lf)of FSigSAS
λ,l,p :Bk·k(0, L)×[M , M]
Bk·k(0, L), with Lf:= f
M|||f|||/1λf
M|||f|||2, that has the
ESP and a unique FMP associated filter USigSAS
λ,l,p,f :KM
fBk·k(0, Lf)Z. Moreover, we have that
W U SigSAS
λ,l,p (z)t− WUSigSAS
λ,l,p,f (z)t
≤ |||W|||1
2N3
4(1 + |||f|||2)1
2f
MR2|||f|||4
(R|||f|||21)2,(30)
W U SigSAS
λ,l,p (z)t− WUSigSAS
λ,l,p,f (z)t
≤ |||W|||N f
MR2|||f|||4
(R|||f|||21)2,(31)
for any zKMand tZ, and where W:= Wik
Mm,k, with ik:ff(Tl+1 (Rp+1)) Tl+1(Rp+1 )the
inclusion.
This result shows that causal and time-invariant filters can
be approximated by JL reduced SigSAS systems. The goal in
the following paragraphs consists in showing that such systems
are just SAS systems with randomly drawn matrix coefficients
and, additionally, in precisely spelling out the law of their
entries. These facts show precisely that a large class of filters
can be learnt just by randomly generating a SAS and by tuning
a linear readout layer for each individual filter that needs to
be approximated. We emphasize that the JL reduced randomly
generated SigSAS system is the same for the entire class of
FMP filters that are being approximated and that only the
linear readout depends on the individual filter that needs to
be learnt, which amounts to the strong universality property
that we discussed in the Introduction and in Section II-A. As
in Remark II.5, we recall that the quality of the approximation
using a JL reduced random SigSAS system may change from
filter to filter because of the dependence on the sequence wU
in the bound (15) and the presence of the linear readout Win
(30) and (31).
The next statement needs the following fact that is known in
the literature as Gordon’s Theorem (see [67, Theorem 5.32]
and references therein): given a random matrix AMn,m
with standard Gaussian IID entries, we have that
E [|||A|||]n+m. (32)
Additionally, the element b
z0Tl+1(Rp+1 )introduced in
(8) for the construction of the SigSAS system will be chosen
in a specific randomized way in this case. Indeed, this time
around, we replace (8) by
b
z0=rX
iI0
zi1e1⊗ · ·· ⊗ e1ei,(33)
where ris a Rademacher random variable that is chosen
independent from all the other random variables that will
appear in the different constructions. If we take in Tl+1(Rp+1)
the canonical basis in lexicographic order, the element b
z0can
be written as the image of a linear map as
b
z0=rCI0(1, z, . . . , zp)>,with (34)
CI0:= Sc
O(p+1)((p+1)l1),p+1 M(p+1)l+1,p+1 ,
and ScMp+1 a diagonal selection matrix with the elements
given by Sc
ii = 1 if iI0, and Sc
ii = 0 otherwise.
Theorem III.7. Let M > 0, let f
Mas in (11),l, p, k N, and
define N:= (p+ 1)l+1,N0:= (p+ 1)l. Consider a SigSAS
state map FSigSAS
λ,l,p :Tl+1(Rp+1 )×[M, M ]Tl+1(Rp+1)
of the type introduced in (10)and defined by choosing the non-
homogeneous term b
z0as in (33). Let now f:RNRkbe
DISCRETE-TIME SIGNATURES AND RANDOMNESS IN RESERVOIR COMPUTING 8
a JL projection randomly drawn according to (18). Let δ > 0
be small enough so that
λ0:= δ
2f
Mrk
N0
<min (1
f
M,1
f
M|||f|||2,1).(35)
Then, the JL reduced version FSigSAS
λ0,l,p,f of FSigSAS
λ0,l,p has the ESP
and the FPM with probability at least 1δand, in the limit
N0→ ∞, it is isomorphic to the family of randomly generated
SAS systems FSigSAS
λ0,l,p,f with states in Rkand given by
FSigSAS
λ0,l,p,f (x, z) :=
p+1
X
i=1
zi1Aix+B(1, z, ··· , zp)>,(36)
where A1, . . . , Ap+1 Mkand BMk,p+1 are random
matrices whose entries are drawn according to:
(A1)j,m,...,(Ap+1 )j,m N0,δ2
4kf
M2,(37)
Bj,m N0,1
kif mI0,
0otherwise. (38)
All the entries in the matrices A1, . . . , Ap+1 are independent
random variables. The entries in the matrix Bare independent
from each other and they are decorrelated and asymptoti-
cally independent (in the limit as N0→ ∞) from those in
A1, . . . , Ap+1.
We conclude with a result that uses in a combined manner
the SigSAS Approximation Theorem II.4 with its JL reduction
in Corollary III.6, as well as its SAS characterization with
random coefficients in Theorem III.7. This statement shows
that in order to approximate a large class of sufficiently regular
FMP filters with uniformly bounded inputs, it suffices to
randomly generate a common SAS system for all of them
and to tune a linear readout for each different filter in that
class that needs to be approximated.
Theorem III.8. Let M, L > 0and let U:KM
(R)
KL
(Rm)be a causal and time-invariant fading memory
filter that satisfies the hypotheses in Theorem II.4. Fix now
l, p, k Nand δ > 0small enough so that (35)holds.
Construct now the SAS system with states in Rkgiven by
FSigSAS
λ0,l,p,f (x, z) =
p+1
X
i=1
zi1Aix+B(1, z, ··· , zp)>,(39)
with matrix coefficients randomly generated according to the
laws spelled out in (37)and (38).
If pand lare large enough, then the SAS system FSigSAS
λ0,l,p,f
has the ESP and the FPM with probability at least 1δ. In
that case FSigSAS
λ0,l,p,f has a filter USigSAS
λ0,l,p,f associated and there
exists a monotonically decreasing sequence wUwith zero limit
and a linear map WL(Rk,Rm)such that for any ze
BM
it holds that
U(z)tW U SigSAS
λ0,l,p,f (z)t
wU
l+L1kzk
M1kzk
Mp+1
+Il,p,(40)
where Il,p is either
Il,p := |||W|||1
2N3
4f
M1 + |||f|||21
2
1δ
2rk
N02or
Il,p := |||W|||N f
M1
1δ
2rk
N02.(41)
In these expressions WL(Tl+1(Rp+1),Rm)is a linear map
such that W=Wf,N= (p+1)l+1,f
Mis defined in (11),
and 0<<1satisfies (16)with nreplaced by N.
IV. NUMERICAL ILLUSTRATION
In order to illustrate the main contributions of the paper,
we consider an IO system given by the so-called generalized
autoregressive conditional heteroskedastic (GARCH) model
[68], [69]. GARCH is a popular discrete-time process in time
series analysis which is used in the econometrics literature and
by practitioners to model and forecast the dynamics of con-
ditional volatilities in financial time series. More specifically,
the GARCH(1,1) model is given by
(yt=σtzt, ztN(0,1),
σ2
t=ω+αy2
t1+βσ2
t1, t Z,(42)
where ω > 0,α, β 0,α+β < 1(see [70] for a careful
discussion of the properties of GARCH processes). The IO
system is driven by the input innovations {zt}tZand the
observations {yt}tZrepresent its output. In the experiment we
use ω= 0.0001,α= 0.1,β= 0.87 and in order to learn the
corresponding IO system we construct: (i) a SigSAS system
as in Proposition II.2; (ii) a JL reduced SigSAS system as in
Corollary III.6; (iii) a randomly generated SAS as in Theorem
III.7. For all the systems, the corresponding readout maps are
obtained by a linear regression. Figure 1illustrates the result
in Theorem II.4 and shows that the SigSAS approximation
error decreases with N. Figure 2shows that the approximation
errors committed by both the JL reduced SigSAS and its
randomly generated analogue decrease as the JL dimension
kincreases. We emphasize that the mean errors are computed
using 160 randomly drawn instances of these two reduced
SigSAS systems and note that the errors reported in this figure
for the two systems are visually indistinguishable. We remind
that even though the result of Theorem III.7 is proved to hold
in the limit as N0= (p+ 1)l→ ∞, it is clear from this
particular example that even for moderately small N0(p= 8
and l= 3) randomly generated small-dimensional SigSAS can
excel in learning a given IO system.
The implications of the strong universality features of the
randomly generated SAS systems are far-reaching in terms of
their empirical performance since, as we already emphasized
several times, it is only the linear readout that is tuned for
each individual IO system of interest. In particular, this opens
door to multi-task learning (when different components of the
readout are trained for different tasks in parallel) and to new
hardware implementations of these randomized SAS systems.
DISCRETE-TIME SIGNATURES AND RANDOMNESS IN RESERVOIR COMPUTING 9
4 8 9 16 16 25 27 36 49 64 64 81 81 125 216 256 343 512 625 729 1296 2401 4096 6561
N
0.15
0.2
0.25
0.3
0.35
0.4
Fig. 1. Box plots for the training mean squared errors (all MSE values
are multiplied by 1e+4 for convenience) committed by SigSAS systems in
the modeling of GARCH realizations for increasing N, where each N=
(p+ 1)l+1 is computed using pairs (p, l),p={1,...,8},l={1,2,3},
in lexicographical order. The distribution of errors is constructed using 200
GARCH paths of length 10000 and I0={1,2}in the SigSAS prescription.
The seemingly slow decay of the MSE values with Nis due to linear
regression problems which are ill-conditioned for large Nand which would
require adequate regularization.
10 10 100 100 190 190 280 280 370 370 460 460 550 550 640 640
k
0.28
0.29
0.3
0.31
0.32
0.33
0.34
0.35
0.36
Fig. 2. Box plots for the distributions of training mean squared errors (all MSE
values are multiplied by 1e+4 for convenience) committed by 160 instances of
randomly JL reduced SigSAS systems and randomly generated SAS systems
according to Theorem III.7. The MSEs are computed with respect to one given
GARCH path of length 7000 for different values of k. For each k, the box
plots corresponding to the two systems are plotted next to each other to ease
comparison (JL SigSAS in blue and random SAS in magenta). The subplot in
the upper right corner shows a comparison of a part of this GARCH path for
t= 1,...,100 and its approximations using a JL SigSAS and a randomly
generated SAS system with k= 10.
V. CONCLUSION
Reservoir computing capitalizes on the remarkable fact that
there are learning systems that attain universal approximation
properties without requiring that all their parameters are esti-
mated using a supervised learning procedure. These untrained
parameters are most of the time randomly generated and it
is only an output layer that needs to be estimated using a
simple functional prescription. This phenomenon has been
explained for static (extreme learning machines [30]) and
dynamic (echo state networks [34], [35]) neural paradigms and
its performance has been quantified using mostly probabilistic
methods.
In this paper, we have concentrated on a different class of
reservoir computing systems, namely the state-affine (SAS)
family. The SAS class was introduced and proved universal in
[36] and we have shown here that the possibility of randomly
constructing these systems and at the same time preserving
their approximation properties is of geometric nature. The
rationale behind our description relies on the following points:
Any analytic filter can be represented as a Volterra series
expansion. When this filter is additionally of fading memory
type, the truncation error can be easily quantified.
Truncated Volterra series admit a natural state-space repre-
sentation with linear observation equation in a conveniently
chosen tensor space. The state equation of this representa-
tion has a strong universality property whose unique solution
can be used to approximate any analytic fading memory
filter just by modifying the linear observation equation. We
refer to this strongly universal filter as the SigSAS system.
The random projections of the SigSAS system yield SAS
systems with randomly generated coefficients in a poten-
tially much smaller dimension which approximately pre-
serve the good properties of the original SigSAS system.
The loss in performance that one incurs because of the
projection mechanism can be quantified using the Johnson-
Lindenstrauss Lemma.
These observations, together with the numerical experiment,
collectively show that SAS reservoir systems with randomly
chosen coefficients exhibit excellent empirical performances in
the learning of fading memory input/output systems because
they approximately correspond to very high-degree Volterra
series expansions of those systems.
APPENDIX
A. Proof of Theorem II.1
The representation (3) is a straightforward multivariate generalization of
Theorem 29 in [56]. For any ze
BMand any p, l Ndefine
Ul,p(z)t:=
p
X
j=1
0
X
m1=l···
0
X
mj=l
gj(m1,...,mj)(zm1+t⊗···⊗zmj+t).
Now, for any ze
BMand t1, t2Zsuch that t2t1, define the
sequence zt1
t2e
BMby zt1
t2:= (...,0,zt2,...,zt1). Additionally, for any
u(Rd)Zand any z(Rd)N+, the symbol uz1
t(Rd)Z,tN+,
denotes the concatenation of the left-shifted vector uwith the truncated vector
z1
t:= (z1,...,zt)obtained out of z. With this notation, we now show (5).
By the triangle inequality and the time-invariance of U, for any ze
BMwe
have
U(z)tUl,p(z)t
U(z)tUl,(z)t
+
Ul,(z)tUl,p(z)t
=
X
j=1
l1
X
m1=−∞ ··· l1
X
mj=−∞
gj(m1,...,mj)(zm1+t⊗ ·· · ⊗ zmj+t)
+
U(zt
l+t)0U,p(zt
l+t)0
=
U(zl1+t
−∞ 0l+1)0
+
U(zt
l+t)0U,p(zt
l+t)0
,(43)
where the symbol 0l+1 stands for a l+ 1-tuple of the element 0Rd. The
second summand of this expression can be bounded using the Taylor bound
provided in [56, Theorem 29]. As to the first summand, we shall use the
input forgetting property that the filter Uexhibits since, by hypothesis, has
the FMP. More specifically, if we apply Theorem 6 in [56] to the FMP filter
U:KM`
(Rm), we can conclude the existence of a montonically
decreasing sequence wUwith zero limit such that for any lN
U(zl1+t
−∞ 0l+1)0
=
U(zl1+t
−∞ 0l+1)0U(0)0
wU
l.
These two arguments substituted in (43) yield the bound in (5).
B. Proof of Proposition II.2
The map FSigSAS
λ,l,p :Tl+1(Rp+1 )×[M, M ]Tl+1(Rp+1)is
clearly continuous and, additionally, it is a contraction in its first component.
Indeed, let x1,x2Tl+1(Rp+1 )and let z[M, M ]be arbitrary. Notice
first that
ke
zk=
p+1
X
i=1
zi1ei
1 + M+···+Mp=: f
M. (44)
DISCRETE-TIME SIGNATURES AND RANDOMNESS IN RESERVOIR COMPUTING 10
It is easy to see that f
M=1Mp+1
1Mfor M6= 1 and f
M= (p+ 1)M,
otherwise. Now, since we are using a crossnorm in Tl+1(Rp+1), we have
that,
FSigSAS
λ,l,p (x1, z)FSigSAS
λ,l,p (x2, z)
=λkπl(x1x2)e
zk
=λkπl(x1x2)kke
zk.
If we use in this equality the relation (44) and the fact that |||πl||| = 1, we
can conclude that:
FSigSAS
λ,l,p (x1, z)FSigSAS
λ,l,p (x2, z)
λf
Mkx1x2k.(45)
The hypothesis λ < 1/f
Mimplies that FSigSAS
λ,l,p is a contraction and es-
tablishes (11). Additionally,
b
z0
=
PiI0zi1e1⊗ · ·· ⊗ e1ei
1 + M+···+Mp=f
M, which implies that
FSigSAS
λ,l,p (0, z)
f
M, for all z[M, M],(46)
and hence by [71, Remark 2] we can conclude that FSigSAS
λ,l,p restricts to
a map FSigSAS
λ,l,p :Bk·k(0, L)×[M , M]Bk·k(0, L), for any L
f
M/(1λf
M). Finally, the contractivity condition established in (45) and [56,
Theorem 12], imply that the corresponding state system has the ESP and the
FMP. We now show that its unique solution is given by (13). First, it is easy
to see that by iterating the recursion (9) twice and three times, one obtains:
xt=λ2πl(πl(xt2)e
zt1)e
zt+λπl(b
z0
t1)e
zt+b
z0
t
=λ3πl(πl(πl(xt3)e
zt2)e
zt1)e
zt
+λ2πl(πl(b
z0
t2)e
zt1)e
zt+λπl(b
z0
t1)e
zt+b
z0
t.
More generally, after l+ 1 iterations one obtains,
xt=λl+1 πl(πl(···(πl(
| {z }
l+ 1-times
xt(l+1))e
ztl)⊗ · ·· )e
zt1)e
zt
+λlπl(πl(···(πl(
| {z }
l-times b
z0
tl)e
zt(l1))⊗ · · ·)e
zt1)e
zt
+···+λπl(b
z0
t1)e
zt+b
z0
t.
Consequently, in order to establish (13) it suffices to show that
λl+1 πl(πl(···(πl(
| {z }
l+ 1-times
xt(l+1))e
ztl)⊗ · ·· )e
zt1)e
zt=λl+1
1λb
zt.
(47)
We show this equality by writing
xt=
p+1
X
i1,...,il+1=1
at
i1,...,il+1
ei1⊗ · ·· ⊗ eil+1 ,(48)
for some coefficients at
i1,...,il+1 Rthat by (9) and the assumption that
1I0, satisfy that
at
1,...,1=λat1
1,...,1+ 1,for any tZ.
This recursion can be rewritten for any rNand tZas
at
1,...,1=
r1
X
j=0
λj+λratr
1,...,1.
Since by hypothesis the parameter λ < 1and, additionally, we just showed
that kxtk ≤ L, for all tZ, with Lf
M/(1 λf
M), this equation has a
unique solution given by
at
1,...,1=1
1λ,for all tZ.(49)
Now, notice that using (48), we can write:
πl(xt(l+1))e
ztl
=
p+1
X
i2,...,il+1,j1=1
at(l+1)
1,i2,...,il+1 zj11
tlei2⊗ · ·· ⊗ eil+1 ej1.
If we repeat this procedure l+ 1 times, we obtain that
πl(πl(···(πl(
| {z }
l+ 1-times
xt(l+1))e
ztl)⊗ · ·· )e
zt1)e
zt
=
p+1
X
j1,...,jl+1=1
at(l+1)
1,...,1zj11
tlzj21
t(l1) ···zjl+11
tej1⊗ · ·· ⊗ ejl+1
=at(l+1)
1,...,1b
zt=1
1λb
zt,
where the last equality is a consequence of (49). This identity proves (47).
C. Proof of Theorem II.4
It is a straightforward corollary of Theorem II.1 and of the expres-
sion (13) of the filter USigSAS
λ,l,p . The linear map Wis constructed by
matching the coefficients gj(m1,...,mj)of the truncated Volterra series
representation of Uup to polynomial degree pwith the terms of the filter
USigSAS
λ,l,p (z)tin the canonical basis of Tl+1(Rp+1 ). More specifically,
WL(Tl+1(Rp+1 ),Rm)is the linear map that satisfies:
W U SigSAS
λ,l,p (z)t=
p
X
j=1
0
X
m1=l···
0
X
mj=l
gj(m1,...,mj)zm1+t···zmj+t,
(50)
for any zKM,tZ, where the right hand side of this equality is
the truncated Volterra series expansion of U, available by Theorem II.1. The
equality (50) does determine Wbecause by (14), it is equivalent to:
p+1
X
i1,...,il+1=1
W Aλ,l,p ei1⊗ · ·· ⊗ eil+1 zi11
tl···zil+11
t
=
p
X
j=1
0
X
m1=l···
0
X
mj=l
gj(m1,...,mj)zm1+t···zmj+t.
Since this equality between polynomials has to hold for any zKMand
tZ, we can conclude that the matrix coefficients on both sides have to co-
incide. This implies that, in particular, for any i1,...,il+1 ∈ {1,...,p + 1},
W Aλ,l,p ei1⊗ · ·· ⊗ eil+1 =X
Ii1,...,il+1
gj(m1,...,mj)Rm,
(51)
where Ii1,...,il+1 ={(j, m1,...,mj)}is the set of indices with j
{1,...,p},mi∈ {−l,...,0}, and zm1···zmj=zi11
l···zil+11
0.
As (51) specifies the image of a basis by the map W Aλ,l,p and Aλ,l,p is
invertible, then (51) and consequently (50) fully determine W. The bound in
(15) is then a consequence of (50) and (5) in Theorem II.1.
D. Proof of Lemma III.1
(i) It is obvious that if v=0then kvkQ= 0 and that kλvkQ=|λ|kvkQ,
for all λRand vspan {Q}. Let now w1,w2span {Q}and
Cq:= Card Q. Given that,
inf
Cq
X
j=1 |λ1
j+λ2
j| |
Cq
X
j=1
λ1
jvj=w1,
Cq
X
j=1
λ2
jvj=w2,vjQ
inf
Cq
X
j=1 |λj| |
Cq
X
j=1
λjvj=w1+w2,vjQ
,
we can conclude that
kw1+w2kQ
inf
Cq
X
j=1 |λ1
j+λ2
j| |
Cq
X
j=1
λ1
jvj=w1,
Cq
X
j=1
λ2
jvj=w2,vjQ
inf
Cq
X
j=1 |λ1
j|+|λ2
j| |
Cq
X
j=1
λ1
jvj=w1,
Cq
X
j=1
λ2
jvj=w2,vjQ
=kw1kQ+kw2kQ,
DISCRETE-TIME SIGNATURES AND RANDOMNESS IN RESERVOIR COMPUTING 11
which establishes the triangle inequality and hence shows that k·kQis a
seminorm. Suppose now that MQ<and let vspan {Q}such
that kvkQ= 0. By the approximation property of the infimum, for any
 > 0there exist λ1,...,λCqRsuch that PCq
j=1 λjvj=vand
0PCq
j=1 |λj|< . This inequality implies that
kvk=
Cq
X
j=1
λjvj
MQ
Cq
X
j=1 |λj|< MQ. (52)
Since MQis finite and  > 0can be made arbitrarily small, this inequality
implies that kvk= 0 and hence v=0, necessarily, which proves that k·kQ
is a norm in this case.
Since the first inequality (52) holds for any vspan {Q}, the statement
in part (ii) follows (when MQis not finite we use the convention that ∞·0 =
). Part (iii) is obvious.
E. Proof of Proposition III.3
Since Vand Rkare Hilbert spaces, the parallelogram law holds for the
associated norms and hence, for any v1,v2Q,
hv1,v2ff(v2)i=hv1,v2i−hf(v1), f (v2)i
=1
4kv1+v2k2− kv1v2k2
1
4kf(v1) + f(v2)k2− kf(v1)f(v2)k2
=1
4kv1(v2)k2− kv1v2k2
1
4kf(v1)f(v2)k2− kf(v1)f(v2)k2
4kv1v2k2+kv1+v2k2=
2kv1k2+kv2k2,(53)
where in the inequality in the last line we used the JL property (17) together
with the hypothesis Q=Q. Let now w1=PCard Q
i=1 λ1
ivi,w2=
PCard Q
i=1 λ2
ivispan {Q}. Then, by (53):
|hw1,w2ff(w2)i| =
Card Q
X
i,j=1
λ1
iλ2
jhvi,vjff(vj)i
Card Q
X
i,j=1 |λ1
i||λ2
j|
2kvik2+kvjk2
Card Q
X
i=1 |λ1
i|
Card Q
X
j=1 |λ2
j|M2
Q.
Since this inequality holds true for any linear decomposition of w1,w2
span {Q}, we can take infima on its right hand side with respect to those
decompositions, which clearly implies (20).
F. Proof of Theorem III.5
(i) We show that when condition (24) holds, then Ff
ρis a contraction on the
first entry. Let x1,x2Rkand let zDd, then
Ff
ρ(x1,z)Ff
ρ(x2,z)
=kf(Fρ(f(x1),z)) f(Fρ(f(x2),z))k ≤ ρ|||f||||||f|||kx1x2k.
The claim follows from this inequality, the equality |||f||| =|||f|||, and
condition (24).
(ii) The proof is straightforward. The only point that needs to be emphasized
is that (f)1:VkRkis well-defined because since fis surjective,
then f:RkVkis necessarily injective.
(iii) First of all, the existence of the restricted versions of Fρand Ff
ρto
compact state-spaces and the fact that these maps are contractions on the
first entry with contraction rates ρand ρ|||f|||2, respectively, implies by [56,
Theorem 7, part (i)] that they have the ESP and associated FMP filters Uρand
Uf
ρ. The statement about the JL-projected state map Ff
ρand its associated
filter Uf
ρis a straightforward consequence of the fact that the restricted linear
map f:RkVkis a state-map equivariant linear isomorphism between
Ff
ρand Ff
ρand of the properties of this kind of maps (see, for instance, [72,
Proposition 2.3]).
(iv) Let z(Dd)Zand tZarbitrary. Then, using (25), we have
Uρ(z)t− Uf
ρ(z)t
=
Fρ(Uρ(z)t1,zt)− Fρ(Uf
ρ(z)t1,zt)
=kFρ(Uρ(z)t1,zt)Fρ(Uf
ρ(z)t1,zt)
+Fρ(Uf
ρ(z)t1,zt)− Fρ(Uf
ρ(z)t1,zt)k
ρ
Uρ(z)t1− Uf
ρ(z)t1
+
(INff)(Fρ(Uf
ρ(z)t1,zt))
.
(54)
The bounds in (26) and (27) are obtained by bounding the last expression in
(54) in two different fashions. First, if we use (20) and the hypothesis on Q
being a spanning set of RN, we have that:
ρ
Uρ(z)t1− Uf
ρ(z)t1
+
(INff)Fρ(Uf
ρ(z)t1,zt)
ρ
Uρ(z)t1− Uf
ρ(z)t1
+1/2MQ
(INff)(Fρ(Uf
ρ(z)t1,zt))
1/2
Q
Fρ(Uf
ρ(z)t1,zt)
1/2
Q
ρ
Uρ(z)t1− Uf
ρ(z)t1
+1/2CMQCQ1 + |||f|||21/2.(55)
If we now iterate the procedure in (54) on the first summand of this expression,
we obtain
Uρ(z)t− Uf
ρ(z)t
=
Fρ(Uρ(z)t1,zt)− Fρ(Uf
ρ(z)t1,zt)
ρρ
Uρ(z)t2− Uf
ρ(z)t2
+1/2CMQCQ1 + |||f|||21/2
+1/2CMQCQ1 + |||f|||21/2
=ρ2
Uρ(z)t2− Uf
ρ(z)t2
+ (1 + ρ)1/2CMQCQ1 + |||f|||21/2
ρj
Uρ(z)tj− Uf
ρ(z)tj
+ (1 + ρ+ρ2+···ρj1)1/2CMQCQ1 + |||f|||21/2.(56)
As by hypothesis ρ < 1, we can take the limit j→ ∞ in this expression,
which yields (26). In order to obtain (27) it suffices to replace the use of (20)
in (55) by that of (22).
(v) First of all, note that for any R > max{1/|||f|||2,1}, the contrac-
tion parameter ρ= 1/(R|||f|||2)satisfies the condition (24). Set now
Q:= e1,...,±eN}. It is easy to see that with this choice, the
norm k·kQintroduced in Lemma III.1 satisfies that k·kQ=k·k1and
that MQ= 1. If we now recall that k·k ≤ k·k1Nk·k and that
(1/N)|||·||| ≤ |||·|||1N|||·|||, we can rewrite the inequality (55) as
Uρ(z)t− Uf
ρ(z)t
ρ
Uρ(z)t1− Uf
ρ(z)t1
+1/2|||INff|||1/2
1
Fρ(Uf
ρ(z)t1,zt)
1
ρ
Uρ(z)t1− Uf
ρ(z)t1
+1/2N3/4C|||INff|||1/2