ArticlePDF Available

Reservoir Computing Universality With Stochastic Inputs

Authors:

Abstract

The universal approximation properties with respect to L p -type criteria of three important families of reservoir computers with stochastic discrete-time semi-infinite inputs are shown. First, it is proven that linear reservoir systems with either polynomial or neural network readout maps are universal. More importantly, it is proven that the same property holds for two families with linear readouts, namely, trigonometric state-affine systems and echo state networks, which are the most widely used reservoir systems in applications. The linearity in the readouts is a key feature in supervised machine learning applications. It guarantees that these systems can be used in high-dimensional situations and in the presence of large data sets. The L p criteria used in this paper allow the formulation of universality results that do not necessarily impose almost sure uniform boundedness in the inputs or the fading memory property in the filter that needs to be approximated.
GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 1
Reservoir Computing Universality With Stochastic
Inputs
Lukas Gonon and Juan-Pablo Ortega
Abstract—The universal approximation properties with respect
to Lp-type criteria of three important families of reservoir
computers with stochastic discrete-time semi-infinite inputs are
shown. First, it is proved that linear reservoir systems with either
polynomial or neural network readout maps are universal. More
importantly, it is proved that the same property holds for two
families with linear readouts, namely, trigonometric state-affine
systems and echo state networks, which are the most widely used
reservoir systems in applications. The linearity in the readouts
is a key feature in supervised machine learning applications. It
guarantees that these systems can be used in high-dimensional
situations and in the presence of large datasets. The Lpcriteria
used in this paper allow the formulation of universality results
that do not necessarily impose almost sure uniform boundedness
in the inputs or the fading memory property in the filter that
needs to be approximated.
Index Terms—Reservoir computing, echo state network, ESN,
machine learning, uniform system approximation, stochastic
input, universality.
I. INTRODUCTION
AUNIVERSALITY statement in relation to a machine
learning paradigm refers to its versatility at the time of
reproducing a rich number of patterns obtained by modifying
only a limited number of hyperparameters. In the language
of learning theory, universality amounts to the possibility of
making approximation errors as small as one wants [1]–[3].
Well-known universality results are, for example, the uniform
approximation properties of feedforward neural networks es-
tablished in [4], [5] for deterministic inputs and, later on,
extended in [6] to accommodate random inputs.
This paper is a generalization of the universality statements
in [6] to a discrete-time dynamical context. More specifically,
we are interested in the learning not of functions but of
filters that transform semi-infinite random input sequences
parameterized by time into outputs that depend on those inputs
in a causal and time-invariant manner. The approximants
used are small subfamilies of reservoir computers (RC) [7],
[8] or reservoir systems. Reservoir computers (also referred
to in the literature as liquid state machines [9], [10]) are
filters generated by nonlinear state-space transformations that
constitute special types of recurrent neural networks. They are
determined by two maps, namely a reservoir F:RN×Rn
RN,n, N N, and a readout map h:RNRthat under
certain hypotheses transform (or filter) an infinite discrete-time
L. Gonon and J.-P. Ortega are with the Faculty of Mathematics and
Statistics, Universit¨
at Sankt Gallen, Sankt Gallen, Switzerland. L. Gonon is
also affiliated with the Department of Mathematics, ETH Z¨
urich, Switzerland.
J.-P. Ortega is also affiliated with the Centre National de la Recherche
Scientifique (CNRS), France.
input z= (. . . , z1,z0,z1, . . .)(Rn)Zinto an output signal
yRZof the same type using a state-space transformation
given by:
xt=F(xt1,zt),
yt=h(xt),
(1)
(2)
where tZand the dimension NNof the state vectors
xtRNis referred to as the number of virtual neurons
of the system. In supervised machine learning applications
the reservoir map is very often randomly generated and the
memoryless readout is trained so that the output matches a
given teaching signal. An important particular case of the RC
systems in (1)-(2) are echo state networks (ESN) introduced,
in different contexts, in [8], [11], [12], and that are built using
the transformations
(xt=σ(Axt1+Czt+ζ),
yt=w>xt,(3)
with AMN,CMN,n,ζRN, and wRN. The map
σ:RNRNis obtained via the componentwise application
of a given activation function σ:RRthat is denoted
with the same symbol. ESNs have as an important feature
the linearity of the readout specified by the vector wRN
that is estimated using linear regression methods based on a
training dataset. This is done once the other parameters in the
model (A,C, and ζ) have been randomly generated and their
scale has been adapted to the problem in question by tuning
a limited number of hyperparameters (like the sparsity or the
spectral radius of the matrix A).
Families of reservoir systems of the type (1)-(2) have
already been proved to be universal in different contexts. In the
continuous-time setup, it was shown in [13] that linear reser-
voir systems with polynomial readouts or bilinear reservoirs
with linear readouts are able to uniformly approximate any
fading memory filter with uniformly bounded and equicon-
tinuous inputs. The fading memory property is a continuity
feature exhibited by many filters encountered in applications.
See also [9], [10], [14], [15] for other contributions to the RC
universality problem in the continuous-time setup.
In the discrete-time setup, several universality statements
were already part of classical systems theory statements for
inputs defined on a finite number of time points [16]–[18].
In the more general context of semi-infinite inputs, various
universality results have been formulated for systems with
approximate finite memory [11], [12], [19]–[22]. More re-
cently, it has been shown in [23], [24], that RCs generated
by contractive reservoir maps (similar to the ESNs introduced
GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 2
above) exhibit universality properties in the approximate finite
memory category.
These universality results have been recently extended to
the causal and fading memory category in [25], [26]. In those
works the universality of two important families of reservoir
systems with linear readouts has been established, namely,
the so called state affine systems (SAS) and the echo state
networks (ESN) that we just introduced in (3). Moreover, the
universality of the SAS family was established in [25] both for
uniformly bounded deterministic inputs, as well as for almost
surely uniformly bounded stochastic ones. This last statement
was shown to be a corollary of a general transfer theorem
that proves that very important features of causal and time-
invariant filters like the fading memory property or universality
are naturally inherited by reservoir systems with almost surely
uniformly bounded stochastic inputs from their counterparts
with deterministic inputs.
Unfortunately, almost surely bounded random inputs are
not always appropriate for many applications. For example,
most parametric time series models use as driving innova-
tions random variables whose distributions are not compactly
supported (Gaussian, for example) in order to ensure ade-
quate levels of performance. The main goal of this work
is formulating universality results in the stochastic context
that do not impose almost sure uniform boundedness in the
inputs. This is achieved by using a density criterion (which is
the mathematical characterization of universality) based not
on L-type norms, like in [25], [26], but on Lpnorms,
p[1,). This approach follows the pattern introduced in
the static case in [6].
This strategy allows to cover a more general class of input
signals and filters, but it also creates some differences in
the type of approximation results that are obtained. More
specifically, in the stochastic universality statements in [25],
for example, universal families are presented that uniformly
approximate any given filter for any input in a given class of
stochastic processes. In contrast with this statement and like
in [6], we fix here first a discrete-time stochastic process that
models the data generating process (DGP) behind the system
inputs that are being considered. Subsequently, families of
reservoir filters are spelled out whose images of the DGP
are dense in the Lpsense. Equivalently, the image of the
DGP by any measurable causal and time invariant filter can
be approximated by the image of one of the members of the
universal family with respect to an Lpnorm defined using the
law of the prefixed DGP.
It is important to point out that this approach allows us to
formulate universality results for filters that do not necessarily
have the fading memory property since only measurability is
imposed as a hypothesis.
The paper contains three main universality statements. The
first one shows that linear reservoir systems with either poly-
nomial or neural network readout maps are universal in the
Lpsense. More importantly, two other families with linear
readouts are shown to also have this property, namely, trigono-
metric state-affine systems and echo state networks, which
are the most widely used reservoir systems in applications.
The linearity of the readout is a key feature of these systems
since in supervised machine learning applications it reduces
the training task to the solution of a linear regression problem,
which can be implemented efficiently also in high-dimensional
situations and in the presence of large datasets.
We emphasize that, from a learning theoretical perspective,
the results in this paper only establish the possibility of
making the approximation error arbitrarily small when using
the proposed RC families in a specific learning task. We do
not provide bounds neither for the approximation nor the
corresponding estimation errors using finite random samples.
Even though some results in this direction already exist in the
literature [23], [24], we plan to address this important subject
in a forthcoming paper where the same degree of generality
as in the present paper will be adopted.
II. PRELIMINARIES
In this section we introduce some notation and collect
general facts about filters, reservoir systems, and stochastic
input signals.
A. Notation
We write N={0,1, . . .}and Z={. . . , 1,0}. The
elements of the Euclidean spaces Rnwill be written as column
vectors and will be denoted in bold. Given a vector vRn, we
denote its entries by vior by v(i), with i∈ {1, . . . , n}.(Rn)Z
and (Rn)Zdenote the sets of infinite Rn-valued sequences of
the type (. . . , z1,z0,z1, . . .)and (. . . , z1,z0)with ziRn
for iZand iZ, respectively. Additionally, we denote
by z(k)
ithe k-th component of zi. The elements in these
sequence spaces will also be written in bold, for example,
z:= (. . . , z1,z0)(Rn)Z. We denote by Mn,m the space
of real n×mmatrices with m, n N. When n=m, we
use the symbol Mnto refer to the space of square matrices
of order n. Random variables and stochastic processes will be
denoted using upper case characters that will be bold when
they are vector valued.
B. Filters and functionals
A filter is a map U: (Rn)ZRZ. It is called causal, if
for any z,w(Rn)Zwhich satisfy zτ=wτfor all τt
for a given tZ, one has that U(z)t=U(w)t. Denote
by Tτ: (Rn)Z(Rn)Zthe time delay operator defined by
Tτ(z)t:= zt+τ, for any τZ. A filter Uis called time-
invariant, if TτU=UTτfor all τZ.
Causal and time-invariant filters can be equivalently de-
scribed using their naturally associated functionals. We refer
to a map H: (Rn)ZRas a functional. Given a causal
and time-invariant filter U, one defines the functional HU
associated to it by setting HU(z) := U(ze)0. Here zeis an
arbitrary extension of z(Rn)Zto (Rn)Z.HUdoes not
depend on the choice of this extension since Uis causal.
Conversely, given a functional Hone may define a causal and
time-invariant filter UH: (Rn)ZRZby setting UH(z)t:=
H(πZTt(z)), where πZ: (Rn)Z(Rn)Zis the natural
projection. One may verify that any causal and time-invariant
filter can be recovered from its associated functional and
GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 3
conversely. Equivalently, U=UHUand H=HUH. We refer
to [13] for further details.
If Uis causal and time-invariant, then for any z(Rn)Z
the sequence U(z)restricted to Zonly depends on (zt)tZ.
Thus we may also consider Uas a map U: (Rn)ZRZ,
but when we do so this will always be clear from the context.
C. Reservoir computing systems
A specific class of filters can be obtained using the reservoir
computing systems or reservoir computers (RC) introduced in
(1)-(2) when they satisfy the so called echo state property
(ESP) given by the following statement (see [27]–[29]): for
any z(Rn)Zthere exists a unique x(RN)Zsuch that (1)
holds. In the presence of the ESP, the RC system gives rise
to a well-defined filter UF
hthat is constructed by associating
to any z(Rn)Zthe unique x(RN)Zsatisfying (1)
and by mapping xsubsequently to the output in (2), that
is, UF
h(z)t:= yt. Furthermore, it can be shown (see [26,
Proposition 2.1]) that UF
his necessarily causal and time-
invariant and hence we may associate to UF
ha reservoir
functional HF
h: (Rn)ZRdefined as HF
h(z) := UF
h(z)0.
As seen above, the causal and time-invariant filter UF
his
uniquely determined by the reservoir functional HF
h. Since
the latter is determined by the restriction of the RC system to
Z, we will sometimes consider the system (1)-(2) only for
tZ.
D. Deterministic filters with stochastic inputs
We are interested in feeding the filters and the systems
that we just introduced with stochastic processes as inputs.
More explicitly, given a causal and time-invariant filter U
that satisfies certain measurability hypotheses, any stochastic
process Z= (Zt)tZis mapped to a new stochastic process
(U(Z)t)tZ. The main contributions in this article address
the question of approximating U(Z)by reservoir filters in an
Lpsense. We now introduce the precise framework to achieve
this goal.
1) Probabilistic framework: Consider a probability space
(Ω,F,P)on which all random variables are defined. Recall
that the sample space is an arbitrary set representing
possible outcomes, the σ-algebra Fis a collection of sub-
sets of describing the set of events to be considered,
and P:F [0,1] is a probability measure that assigns a
probability of occurrence to each event. The input signal is
modeled as a discrete-time stochastic process Z= (Zt)tZ
with values in Rn. For each outcome ωwe denote
by Z(ω) = (Zt(ω))tZthe realization or sample path of
Z. Thus Zmay be viewed as a random sequence in Rn
and when dealing with stochastic processes we will make no
distinctions between the assignment Z:Z×Rnand
the corresponding map into path space Z: Ω (Rn)Z. We
recall that Zis a stochastic process when the corresponding
map Z: Ω (Rn)Zis measurable. Here (Rn)Zis equipped
with the product σ-algebra tZB(Rn)(which coincides
with the Borel σ-algebra of (Rn)Zequipped with the product
topology by [30, Lemma 1.2]), where B(Rn)is the Borel σ-
algebra on Rn.
We denote by Ft:= σ(Z0,...,Zt),tZ, the σ-algebra
generated by {Z0,...,Zt}and write F−∞ := σ(Zt:tZ).
Thus Ftmodels the information contained in the input stream
at times 0,1, . . . , t. For p[1,]we denote by Lp(Ω,F,P)
the Banach space formed by the real-valued random variables
in (Ω,F,P)that have a finite usual Lpnorm k·kp.
We say that the process Zis stationary when for any
{t1, . . . , tk} ⊂ Z,hZ, and At1, . . . , Atk∈ B(Rn),
we have that
P(Zt1At1,...,ZtkAtk)
=P(Zt1+hAt1,...,Ztk+hAtk).
2) Measurable functionals and filters: We say that a func-
tional His measurable when the map between measurable
spaces H:(Rn)Z,tZB(Rn)(R,B(R)) is measur-
able. When His measurable then so is H(Z) : (Ω,F)
(R,B(R)) since H(Z) = HZis the composition of
measurable maps and hence H(Z)is a random variable on
(Ω,F,P).
Analogously, we will say that a causal, time-invariant filter
Uis measurable when the map between measurable spaces
U:(Rn)Z,tZB(Rn)RZ,tZB(R)is measurable.
In that case, also the restriction of Uto Z(see above) is
measurable and so U(Z)is a real-valued stochastic process.
As discussed above, causal, time-invariant filters and func-
tionals are in a one-to-one correspondence. This relation
is compatible with the measurability condition, that is, a
causal and time-invariant filter is measurable if and only if
the associated functional is measurable. In order to prove
this statement we show first that the operator πZTt:
(Rn)Z,tZB(Rn)(Rn)Z,tZB(Rn)is a mea-
surable map, for any tZ. Indeed, notice first that the
projections pi:(Rn)Z,tZB(Rn)(Rn,B(Rn)),
iZ, given by pi(z) = ziare measurable. Thus πZTt
can be written as the Cartesian product of measurable maps,
i.e. for each kZone has that (πZTt)k=pt+kis
measurable. This yields that πZTtis measurable [30,
Lemma 1.8].
Now, if His a measurable functional, this implies that the
associated filter UHis also measurable, since for each tZ,
(UH)t=HπZTt,(4)
is a composition of measurable functions and hence also
measurable. Conversely, if Uis causal, time-invariant, and
measurable, then so is the associated functional HU=p0U.
3) Lp-norm for functionals: Fix p[1,)and let Hbe
a measurable functional such that H(Z)Lp(Ω,F,P). The
functionals which satisfy that
kH(Z)kp:= E[|H(Z)|p]1/p <(5)
will be referred to as p-integrable with respect to the input
process Z.
Let us now consider the expression (5) from an alternative
point of view. Denote by µZ:= PZ1the law of Zwhen
viewed as a (Rn)Z-valued random variable as above. Thus
µZis a probability measure on (Rn)Zsuch that for any
measurable set A(Rn)Zone has µZ(A) = P(ZA).
GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 4
The requirement H(Z)Lp(Ω,F,P)then translates to
HLp((Rn)Z, µZ)and (5) is equal [30, Lemma 1.22] to
kHkµZ
p:= "Z(Rn)Z
|H(z)|pµZ(dz)#1/p
=kH(Z)kp.
Thus, the results formulated later on in the paper for
functionals with random inputs can also be seen as statements
for functionals with deterministic inputs in (Rn)Z, where
the closeness between them is measured using the norm in
Lp((Rn)Z, µZ). Following the terminology used by [6] we
will refer to µZas the input environment measure.
We emphasize that these two points of view are equivalent.
Given any probability measure µZon (Rn)Zone may set
Ω=(Rn)Z,F=tZB(Rn),P=µZand define Zt(z) :=
ztfor all z. We will switch between these two viewpoints
throughout the paper without much warning to the reader.
4) Lp-norm for filters: Fix p[1,). A causal, time-
invariant, measurable filter Uis said to be p-integrable, if
kU(Z)kp:= sup
tZnE[|U(Z)t|p]1/po<.(6)
It is easy to see that if Uis p-integrable, then so is the
corresponding functional HUdue to the following inequality
kHU(Z)kp=E[|HU(Z)|p]1/p =E[|U(Z)0|p]1/p
sup
tZnE[|U(Z)t|p]1/po=kU(Z)kp<.
The converse implication holds true when the input process
is stationary. In order to show this fact, notice first that if µt
is the law of πZTt(Z),tZ, and Zis by hypothesis
stationary then, for any {t1, . . . , tk} ⊂ Zand At1, . . . , Atk
B(Rn), we have that
P(πZTt(Z))t1At1,...,(πZTt(Z))tkAtk
=P(Zt1+tAt1,...,Ztk+tAtk)
=P(Zt1At1,...,ZtkAtk),
which proves that
µZ=µt,for all tZ.(7)
This identity, together with (4), implies that for any p-
integrable functional H:
kUH(Z)kp= sup
tZnE[|UH(Z)t|p]1/po
= sup
tZnE|H(πZTt(Z))|p1/po
= sup
tZ
"Z(Rn)Z
|H(z)|pµt(dz)#1/p
= sup
tZ
"Z(Rn)Z
|H(z)|pµZ(dz)#1/p
=kH(Z)kp<,
(8)
which proves the p-integrability of the associated filter UH.
III. Lp-UNIVERSALITY RESULTS
Fix p[1,),Zan input process, and a functional H
such that H(Z)Lp(Ω,F,P). The goal of this section is
finding simple families of reservoir systems that are able to
approximate H(Z)as accurately as needed in the Lpsense.
The first part contains a result that shows that linear reservoir
maps with polynomial readouts are able to carry this out. As
we already pointed out in the introduction, a result for the
same type of reservoir systems has been proved in [25] in the
Lsetting for both deterministic and almost surely uniformly
bounded stochastic inputs. The second part presents a family
that is able to achieve universality using only linear readouts,
which is of major importance for applications since in that
case the training effort reduces to solving a linear regression.
Finally, we prove the universality of echo state networks which
is the most widely used family of reservoir systems with linear
readouts.
A. Linear reservoirs with nonlinear readouts
Consider a reservoir system with linear reservoir map and
a polynomial readout. More precisely, given AMN,c
MN,n, and hPolNa real-valued polynomial in Nvariables,
consider the system
(xt=Axt1+czt, t Z,
yt=h(xt), t Z,(9)
for any z(Rn)Z. If the matrix Ais chosen so that
σmax(A)<1, then this system has the echo state property
and the corresponding reservoir filter UA,c
his causal and time-
invariant [25]. We denote by HA,c
hthe associated functional.
We are interested in the approximation capabilities that can be
achieved by using processes of the type HA,c
h(Z), where Zis
a fixed input process and HA,c
h(Z) = Y0, with Y0obviously
determined by the stochastic reservoir system
(Xt=AXt1+cZt, t Z,
Yt=h(Xt), t Z.(10)
Proposition III.1. Fix p[1,), let Zbe a fixed Rn-valued
input process, and let Hbe a functional such that H(Z)
Lp(Ω,F,P). Suppose that for any KNthere exists α > 0
such that
E"exp α
K
X
k=0
n
X
i=1
|Z(i)
k|!#<,(11)
where Z(i)
kdenotes the i-th component of Zk. Then, for any
ε > 0there exists NN,AMN,cMN,n, and hPolN
such that (9)has the echo state property, the corresponding
filter is causal and time-invariant, the associated functional
satisfies HA,c
h(Z)Lp(Ω,F,P), and
kH(Z)HA,c
h(Z)kp< ε. (12)
If the input process Zis stationary then
kUH(Z)UA,c
h(Z)kp< ε. (13)
GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 5
Proof. The proof consists of two steps: In the first one we
use assumption (11) and classical results in the literature to
establish that
Poln(K+1) is dense in Lp(Rn(K+1), µK),for all KN,
(14)
where µKis the law of (Z(1)
0, Z(2)
0, . . . , Z(n1)
K, Z(n)
K)on
Rn(K+1) under P. In the second step we then use (14) to
construct a linear RC system of the type in (9) that yields the
approximation statement (12).
Step 1: Denote by µKthe law of
(Z(1)
0, Z(2)
0, . . . , Z(n1)
K, Z(n)
K)on RNunder P, where
N:= n(K+ 1). By (11) there exists α > 0such that
RRNexp(αkzk1)µK(dz)<, where here and in the rest
of this proof k · k1denotes the Euclidean 1-norm. Denoting
by µj
Kthe j-th marginal distribution of µK, this implies for
j= 1, . . . , N that
ZR
exp(α|z(j)|)µj
K(dz(j))ZRN
exp(αkzk1)µK(dz)<.
Consequently, by [31, Theorem 6], Pol1is dense in Lp(R, µj
K)
for any p[1,),j= 1, . . . , N . By [32, Proposition page
364] this implies that PolNis dense in Lp(RN, µK), where
we note that µKindeed satisfies the moment assumption in
[32, Page 361]: since x2mexp(αx)for any x0,mN,
one has
ZRN
kzk2m
2µK(dz)ZRN
exp(αkzk2)µK(dz)
ZRN
exp(αkzk1)µK(dz)<.
Step 2: Let ε > 0. By Lemma A.1 in the appendix there
exists KNsuch that
kH(Z)E[H(Z)|FK]kp<ε
2(15)
where FK:= σ(Z0,...,ZK). In the following para-
graphs we will establish the approximation statement (12) for
E[H(Z)|FK]instead of H(Z). Combining this with (15) will
then yield (12).
Let N:= n(K+ 1). By definition, E[H(Z)|FK]is
FK-measurable and hence there exists [30, Lemma 1.13] a
measurable function gK:RNRsuch that E[H(Z)|FK] =
gK(Z0,...,ZK). Furthermore,
ZRN
|gK(z)|pµK(dz)
=E[|E[H(Z)|FK]|p]E[|H(Z)|p]<,
by standard properties of conditional expectations (see, for in-
stance, [33, Theorem 5.1.4]) and the assumption that H(Z)
Lp(Ω,F,P). Thus, gKLp(RN, µK)and using the statement
(14) established in Step 1, there exists hPolNsuch that
kE[H(Z)|FK]h(Z>
0,...,Z>
K)kp
=kgKhkLp(RNK)<ε
2.(16)
Define now a reservoir system of the type (10) with inputs
given by the random variables Zt,tZand reservoir
matrices AMNand cMN,n with all entries equal to
0except Ai,in= 1 for i=n+ 1, . . . , N and ci,i = 1 for
i= 1, . . . , n, that is
A=0n,nK 0n,n
InK 0n,n ,and c=In
0nK,n .
This system has the echo state property (all the eigenval-
ues of Aequal zero) and has a unique causal and time
invariant solution associated to the reservoir states Xt:=
Z>
t,Z>
t1,...,Z>
tK>,tZ. It is easy to verify that
the corresponding reservoir functional is given by
HA,c
h(Z) = h(Z>
0,...,Z>
K).(17)
Now the triangle inequality and (15), (16) and (17) allow us
to conclude (12).
The statement in (13) in the presence of the stationarity
hypothesis for Zis a straightforward consequence of (7) and
the equality (8).
Remark III.2.It is important to point out that the reservoir
systems used in the proof of Proposition III.1 all have finite
memory. Thus, this proof shows that it is possible to obtain
universality in the Lpsense with that type of finite memory
systems and that, in particular, they can be used to approximate
infinite memory filters. A key ingredient in this statement
is, apart from the hypothesis (11), the Lemma A.1 in the
Appendix. The other universal systems introduced later on in
the paper (trigonometric state-affine systems and echo state
networks) also share this feature. Similar statements have
also been proved for linear reservoir systems with polynomial
readouts and state-affine systems with linear readouts in the
Lsetup for both deterministic and almost surely uniformly
bounded stochastic inputs (see, for instance, [25, Corollary 11,
Theorem 19]). This phenomenon has also been observed in the
in the context of approximation of deterministic filters using
Volterra series operators (see [13, Theorems 3 and 4]).
Remark III.3.A simple situation in which condition (11) is
satisfied is when for any tZthe random variable Ztis
bounded, i.e. for any tZthere exists Ct0such that
kZtk ≤ Ct,P-a.s. However, as the next remark shows, there
are also practically relevant examples of input streams with
unbounded support, for which (11) is satisfied.
Remark III.4.A sufficient condition for (11) to hold is that
the random variables {Zt:tZ}are independent and
that for each t, there exists a constant α > 0such that
E[exp(αPn
i=1 |Z(i)
t|)] <. This last condition is satisfied,
for instance, if Ztis normally distributed. For input streams
coming from more heavy-tailed distributions like Student’s t-
distribution, the condition is not satisfied and so one should use
the reservoir systems considered below (see Corollary III.8,
Theorem III.9, and Theorem III.10) instead if universality is
needed.
Remark III.5.Assumption (11) can be replaced by alternative
assumptions but it can not be removed. Even if n= 1 and
{Zt:tZ}are independent and identically distributed with
distribution ν, a condition stronger than the existence of mo-
ments of all orders for νis required. As a counterexample, one
may take for νa lognormal distribution. Then νhas moments
of all orders, but (11) is not satisfied. Let us now argue that
GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 6
the approximation result proved under assumption (11) fails
in this case. The following argument relies on results for the
classical moment problem (see, for example, the collection of
references in [34]).
Indeed, by [35]νis not determinate (there exist other
probability measures with identical moments) and thus (see
e.g. [36, Theorem 4.3]) Pol1is not dense in Lp(R, ν)for
p2. In particular, there exists gLp(R, ν)and ε > 0
such that kg˜
hkp> ε for all ˜
hPol1. Suppose that
we are in the case n= 1 and let {Zt:tZ}be
independent and identically distributed with distribution ν
and H(z) := g(z0)for zRZ. Then, for any choice
of N,A,cand hone has E[HA,c
h(Z)|F0] = ˜
h(Z0), where
˜
h(x) := E[h(AX1+cx)], x R, is a polynomial. Thus one
may use [33, Theorem 5.1.4] and the fact that by construction
H(Z)is F0-measurable to obtain
kH(Z)HA,c
h(Z)kp≥ kE[H(Z)|F0]E[HA,c
h(Z)|F0]kp
=kg˜
hkp> ε.
Remark III.6.In previous reservoir computing universality
results for both deterministic and stochastic inputs quoted in
the introduction there was an important continuity hypothesis
called the fading memory property that does not play a role
here and that has been replaced by the integrability require-
ment HLp((Rn)Z, µZ). In particular, the universality
results that we just proved and those that come in the next
section (see Theorem III.9) yield approximations for filters
which do not necessarily have the fading memory property.
Whether or not the approximation results apply depends on the
integrability condition with respect to the input environment
measure µZ. Consider, for example, the functional associated
to the peak-hold operator [13]. In the discrete-time setting, the
associated functional is
H(z) = sup
t0
{zt},with zRZ.
We now show that the two possibilities HLp((Rn)Z, µZ)
and H /Lp((Rn)Z, µZ)are feasible, depending on the
choice of µZ:
Let Z= (Zt)tZbe one dimensional independent
and identically distributed (i.i.d) random variables with
unbounded support and denote by µZthe law of Zon
RZ. Denoting by Fthe distribution function of Z1and
using the i.i.d assumption one calculates, for any aR,
P(H(Z)> a) = 1 P(t<0{Zta})
= 1 lim
n→∞ F(a)n= 1.
Hence, we can conclude that H(Z) = ,µZ-almost
everywhere and therefore H /Lp((Rn)Z, µZ).
Consider now the same setup, but assume this time that
the random variables have bounded support, that is, for
some amax Rone has that P(Ztamax)=1and
P(Zt> amax) = 0. Then, the same argument shows
that H(Z) = amax,µZ-almost everywhere and therefore
HLp((Rn)Z, µZ).
Remark III.7.From the proof of Proposition III.1 one sees
that one could replace in its statement PolNby any other
family {HN}NNthat satisfies the density statement (14). In
particular, the following corollary shows that this result can
be obtained with readouts made out of neural networks.
Denote by HNthe set of feedforward one hidden layer
neural networks with inputs in RNthat are constructed with
a fixed activation function σ. More specifically, HNis made
of functions h:RNRof the type
h(x) =
k
X
j=1
βjσ(αj·xθj),(18)
for some kN,βj, θjR, and αjRN, for j= 1, . . . , k.
Corollary III.8. In the setup of Proposition III.1, consider the
family of neural networks h∈ HNconstructed with a fixed
activation function σthat is bounded and non-constant. Then,
for any ε > 0there exists NN,AMN,cMN,n, and a
neural network h∈ HNsuch that the corresponding reservoir
system (9)has the echo state property and has a unique causal
and time-invariant filter associated. Moreover, the correspond-
ing functional satisfies that HA,c
h(Z)Lp(Ω,F,P)and
kH(Z)HA,c
h(Z)kp< ε. (19)
Proof. By [6, Theorem 1] the set HNis dense in Lp(RN, µ)
for any finite measure µon RN. Thus, statement (14) holds
with HNreplacing Poln(K+1). Mimicking line by line the
proof of Step 2 in Proposition III.1 then proves the Corollary.
B. Trigonometric state-affine systems with linear readouts
Fix M, N Nand consider R:RnMN ,M defined by
R(z) :=
r
X
k=1
Akcos(uk·z) + Bksin(vk·z),zRn,(20)
for some rN,Ak, BkMN,M ,uk,vkRn, for
k= 1, . . . , r. The symbol TrigN,M denotes the set of all
functions of the type (20). We call the elements of TrigN,M
trigonometric polynomials.
We now introduce reservoir systems with linear readouts
and reservoir maps constructed using trigonometric polyno-
mials: let NN,wRN,PTrigN,N ,QTrigN,1and
define, for any z(Rn)Z, the system:
(xt=P(zt)xt1+Q(zt), t Z,
yt=w>xt, t Z.(21)
We call the systems of this type trigonometric state-affine
systems. When such a system has the echo state property and a
unique causal and time-invariant solution for any input, we de-
note by UP,Q
wthe corresponding filter and by HP,Q
w(z) := y0
the associated functional. As in the previous section, we fix
p[1,),Zan input process, and a functional Hsuch that
H(Z)Lp(Ω,F,P)and we are interested in approximating
H(Z)by systems of the form HP,Q
w(Z). Again, we will write
HP,Q
w(Z) = Y0, where Y0is uniquely determined by the
reservoir system with stochastic inputs
(Xt=P(Zt)Xt1+Q(Zt), t Z,
Yt=w>Xt, t Z.(22)
GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 7
Define Aas the set of four-tuples (N, w, P, Q)N×RN×
TrigN,N ×TrigN,1whose associated systems (21) have the
echo state property and the unique solutions are causal and
time-invariant. In particular, for such (N, w, P, Q)a reservoir
functional HP,Q
wassociated to (21) exists.
Theorem III.9. Let p[1,)and let Zbe a fixed Rn-valued
input process. Denote by LZthe set of reservoir functionals
of the type (21)which are p-integrable, that is,
LZ:= {HP,Q
w(Z) : (N, w, P, Q)∈ A} ∩ Lp(Ω,F,P).
Then LZis dense in Lp(Ω,F−∞,P).
In particular, for any functional Hsuch that H(Z)
Lp(Ω,F,P)and any ε > 0, there exists NN,wRN,
PTrigN,N and QTrigN,1such that the system (21)
has the echo state property and causal and time-invariant
solutions. Moreover, HP,Q
w(Z)Lp(Ω,F,P)and
kH(Z)HP,Q
w(Z)kp< ε. (23)
If the input process Zis stationary then
kUH(Z)UP,Q
w(Z)kp< ε. (24)
Proof. We first argue that LZis a linear subspace of
Lp(Ω,F−∞,P). To do this we need to introduce some no-
tation. Given AMN1,M1,BMN2,M2, we denote by
ABMN1+N2,M1+M2the direct sum. Given Ras in
(20) we define RATrigN+N1,M +M1by
RA(z) :=
r
X
k=1
AkAcos(uk·z) + BkAsin(vk·z),
and (with the analogous definition for BR) for Ri
TrigNi,Mi,i= 1,2we set
R1R2=R10N2,M2+0N1,M1R2.
One easily verifies that for λRand (Ni,wi, Pi, Qi)∈ A,
i= 1,2, one has that
(N1+N2,w1λw2, P1P2, Q1Q2)∈ A,
HP1,Q1
w1(Z) + λHP2,Q2
w2(Z) = HP1P2,Q1Q2
w1λw2(Z).
This shows that LZis indeed a linear subspace of
Lp(Ω,F−∞,P).
Secondly, in order to show that LZis dense in
Lp(Ω,F−∞,P), it suffices to prove that if F
Lq(Ω,F−∞,P)satisfies E[F H ]=0for all H∈ LZ, then
F= 0,P-almost surely. Here q(1,]is the H¨
older
conjugate exponent of p. This can be shown by contraposition.
Suppose that LZis not dense in Lp(Ω,F−∞,P). Since LZis
a linear subspace, by the Hahn-Banach theorem there exists
a bounded linear functional Λon Lp(Ω,F−∞,P)such that
Λ(H) = 0 for all H∈ LZ, but Λ6= 0, see e.g. [37,
Theorem 5.19]. Then by [37, Theorem 6.16] there exists
FLq(Ω,F−∞,P)such that Λ(H) = E[F H ]for all
HLp(Ω,F−∞,P)and F6= 0, since Λ6= 0. In particular,
there exists FLq(Ω,F−∞,P)\ {0}such that E[F H ] = 0
for all H∈ LZ.
Thirdly, suppose that FLq(Ω,F−∞,P)satisfies
E[F H ]=0for all H∈ LZ.(25)
If we show that F= 0,P-almost surely, then the statement
in the theorem follows by the argument in the second step.
In order to prove that F= 0,P-almost surely, we first show
that (25) implies the following statement: for any KN, any
subset I⊂ IK:= {0, . . . , K}, and any u0,...,uKRnit
holds that
E
FY
jI
sin(uj·Zj)Y
k∈IK\I
cos(uk·Zk)
= 0.(26)
We prove this claim by induction on KN. For K= 0,
one sets Q1(z) := cos(u0·z)and Q2(z) := sin(u0·z)and
notices that (1,1,0, Qi)∈ A. Moreover, since the sine and
cosine function are bounded, it is easy see that Qi(Z0) =
H0,Qi
1(Z0)∈ LZ, for i∈ {1,2}. Thus (25) implies (26) and
so the statement holds for K= 0. For the induction step, let
KN\ {0}and assume the implication holds for K1.
We now fix Iand u0,...,uKRnas above and prove (26).
To simplify the notation we define for k∈ {0, . . . , K}and
zRnthe function gkby
gk(z) := (sin(uk·z),if kI,
cos(uk·z),if k∈ IK\I.
To prove (26), we set N:= K+ 1, for j∈ {1, . . . , K}define
AjMNwith all entries equal to 0except (Aj)j+1,j = 1,
that is, (Aj)k,l =δk,j+1 δl,j ,k, l ∈ {1, . . . , N }. Define now
for zRn
P(z) :=
K1
X
j=0
AKjgj(z),
Q(z) := e1gK(z),
w:= eK+1,
(27)
where ejis the j-th unit vector in RN, that is, the only non-
zero entry of ejis a 1in the j-th coordinate. By Lemma A.2
in the appendix, one has AjL· · · Aj0= 0 for any j0, . . . , jL
{1, . . . , K}and LK, since jL=j0+Lcan not be satisfied.
In other words, any product of more than Kfactors of matrices
A(j)is equal to 0and thus for any LNwith LKand
any z0,...,zLRnone has P(z0). . . P (zL) = 0. Using
this fact and iterating (21), one obtains that the trigonometric
state-affine system defined by the elements in (27) has a unique
solution given by
xt=Q(zt) +
K
X
j=1
P(zt)· · · P(ztj+1)Q(ztj).(28)
In particular (N, w, P, Q)∈ A and
HP,Q
w(Z) = X0
=w>
Q(Z0) +
K
X
j=1
P(Z0)· · · P(Zj+1)Q(Zj)
.
(29)
The finiteness of the sum in (29) and the boundedness of the
trigonometric polynomials implies that HP,Q
w(Z)∈ LZ.
We conclude the proof of the induction step with the
following chain of equalities that uses (25) in the first one,
GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 8
the representation (29) in the second one, and the choice of
the vector wand the induction hypothesis in the last step:
0 = E[F H P,Q
w(Z)]
=E[Fw>Q(Z0)]
+E[Fw>
K
X
j=1
P(Z0)· · · P(Zj+1)Q(Zj)]
=E[Fw>P(Z0)· · · P(ZK+1)Q(ZK)].
(30)
However, again by Lemma A.2 in the appendix, the only
non-zero product of matrices AjK1· · · Aj0for j0,...jK1
{1, . . . , K}takes place when jk=k+ 1 for k∈ {0, . . . , K
1}. Therefore:
P(Z0)· · · P(ZK+1)
=AKg0(Z0)AK1g1(Z1)· · · A1gK1(ZK+1).
Combining this with (30) and using the identity (49) in
Lemma A.2 in the appendix one obtains
0 = E[F e>
K+1AK· · · A1e1
K
Y
k=0
gk(Zk)]
=E[F
K
Y
k=0
gk(Zk)],
which is the same as (26).
Fourthly, by standard trigonometric identities, the identity
(26) established in the third step implies that for any KN,
E
Fexp
i
K
X
j=0
uj·Zj
= 0 for all u0,...,uKRn.
(31)
We claim that (31) implies F= 0,P-almost surely and
hence the statement in the theorem follows. This fact is
a consequence of the uniqueness theorem for characteristic
functions (which is ultimately a consequence of the Stone-
Weierstrass approximation theorem). See for instance [30,
Theorem 4.3] and the text below that result. To prove F= 0,
P-almost surely, we denote by F+and Fthe positive
and negative parts of F. Then by (31) one has E[F]=0,
necessarily. Thus, if it does not hold that F= 0,P-almost
surely, then c:= E[F+] = E[F]>0and one may
define probability measures Q+and Qon (Ω,F)by setting
Q+(A) := c1E[F+A]and Q(A) := c1E[FA]for
A∈ F. Denote by µ+
Kand µ
Kthe law in Rn(K+1) of the
random variable
ZK:= (Z>
0,Z>
1,...,Z>
K)>
under Q+and Q. Then, the statement (31) implies that for
all uRn(K+1),
ZRn(K+1)
exp(iu ·z)µ+
K(dz) = ZRn(K+1)
exp(iu ·z)µ
K(dz).
By the uniqueness theorem for characteristic functions (see
e.g. [30, Theorem 4.3] and the text below) this implies
that µ+
K=µ
K. Translating this statement back to random
variables, this means that for any bounded and measurable
function g:Rn(K+1) Rone has
0 = cEQ+[g(ZK)] cEQ[g(ZK)] = E[F g(ZK)],
which, by definition, means that E[F|FK]=0,P-almost
surely. Since KNwas arbitrary and FL1(Ω,F−∞,P),
one may combine this with limt→−∞ E[F|Ft] = F,P-almost
surely (see Lemma A.1) to conclude F= 0, as desired.
The statement in (24) in the presence of the stationarity
hypothesis for Zis a straightforward consequence of (7) and
the equality (8).
We emphasize that the use in the proof of the theorem
of nilpotent matrices of the type introduced in Lemma A.2
ensures that the the echo state property is automatically
satisfied (see (28)).
C. Echo state networks
We now turn to showing the universality in the Lpsense
of the the most widely used reservoir systems with linear
readouts, namely, echo state networks. An echo state network
is a RC system determined by
(xt=σ(Axt1+Czt+ζ),
yt=w>xt,(32)
for AMN,CMN,n,ζRN, and wRN. As it
is customary in the neural networks literature, the map σ:
RNRNis obtained via the componentwise application of
a given activation function σ:RRthat is denoted with the
same symbol.
If this system has the echo state property and the resulting
filter is causal and time-invariant, we write as HA,C,ζ
w(z) := y0
the associated functional.
Theorem III.10. Fix p[1,), let Zbe a fixed Rn-
valued input process, and let Hbe a functional such that
H(Z)Lp(Ω,F,P). Suppose that the activation function
σ:RRis non-constant, continuous, and has a bounded
image. Then for any ε > 0, there exists NN,CMN,n ,
ζRN,AMN,wRNsuch that (32)has the echo state
property, the corresponding filter is causal and time-invariant,
the associated functional satisfies HA,C,ζ
w(Z)Lp(Ω,F,P)
and
kH(Z)HA,C,ζ
w(Z)kp< ε. (33)
Proof. First, by Corollary III.8 and (17) there exists K, N
N,wRN,AMN,n(K+1) , and ζRNsuch that the
neural network
h(z) = w>σ(Az+ζ)
satisfies
kH(Z)h(Z>
0,...,Z>
K)kp<ε
2.(34)
Notice that we may rewrite Aas
A= [A(0)A(1) · · · A(K)]
GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 9
with A(j)MN,n and
H(Z):=h(Z>
0,...,Z>
K)
=w>σ
K
X
j=0
A(j)Zj+ζ
.(35)
Second, by the neural network approximation theorem for
continuous functions [6, Theorem 2], for any mNthere
exists a neural network that uniformly approximates the iden-
tity mapping on the hypercube Bm:= {xRn:|xi| ≤
mfor i= 1, . . . , n}. More specifically, [6, Theorem 2] is
formulated for R-valued mappings and we hence apply it
componentwise: for any mNand i= 1, . . . , n there
exists N(m)
iN,w(m)
iRN(m)
i,A(m)
iMN(m)
i,n, and
ζ(m)
iRN(m)
i, such that for all i= 1, . . . , n the neural
network
h(m)
i(x) = w(m)
i>σA(m)
ix+ζ(m)
i
satisfies
sup
xBm
{|h(m)
i(x)xi|} <1
m.(36)
Write h(m)(x)=(h(m)
1(x), . . . , h(m)
n(x))>and for j=
1, . . . , K, denote by [h(m)]j=h(m) · · · h(m)the jth
composition of h(m). We now claim that for all j= 1, . . . , K
and xRnit holds that
lim
m→∞[h(m)]j(x) = x.(37)
Indeed, let us fix xRnand argue by induction on j. To
prove (37) for j= 1, let ε > 0be given and choose m0
Nsatisfying m0>max {|x1|,...,|xn|,1}. Then, for any
mm0one has xBmby definition and (36) implies that
for i= 1, . . . , n,
|h(m)
i(x)xi|<1
m< ε.
Hence (37) indeed holds for j= 1. Now let j2
and assume that (37) has been proved for j1. Define
x(m):= [h(m)]j1(x). Then, by the induction hypothesis, for
any given ε > 0one finds m0Nsuch that for all mm0
and i= 1, . . . , n it holds that
|x(m)
ixi|<ε
2.(38)
Hence, choosing m0Nwith m0>max(m0,|x1|+
ε
2,...,|xn|+ε
2,2)one obtains from the triangle inequality
and (38) that x(m)Bm0for all mm0. In particular for
any mm0one may use the triangle inequality in the first
step, x(m)Bm0Bmand (38) in the second step and (36)
in the last step to estimate
|[h(m)]j
i(x)xi|≤|h(m)
i(x(m))x(m)
i|+|x(m)
ixi|
sup
yBm
{|h(m)
i(y)yi|} +ε
2
<1
m+ε
2< ε.
This proves (37) for all j= 1, . . . , K.
Thirdly, define
Hm(Z) := w>σ
K
X
j=0
A(j)[h(m)]j(Zj) + ζ
with the convention [h(m)]0(x) = x.
Since σis continuous, (37) implies that limm→∞ Hm(Z) =
H(Z),P-almost surely, where Hwas defined in (35).
Furthermore, by assumption there exists C > 0such that
|σ(x)| ≤ Cfor all xR. Hence one has |H(Z)
Hm(Z)|p(2CPN
i=1 |wi|)pfor all mN. Thus one may
apply the dominated convergence theorem to obtain
lim
m→∞ kH(Z)Hm(Z)kp
= lim
m→∞ E[|H(Z)Hm(Z)|p]1/p = 0.
In particular for mNlarge enough one has kH(Z)
Hm(Z)kp<ε
2and combining this with the triangle inequality
and (34) one obtains
kH(Z)Hm(Z)kp≤ kH(Z)H(Z)kp
+kH(Z)Hm(Z)kp< ε. (39)
To conclude the proof we now fix mNlarge enough
(so that (39) holds) and show that Hm(Z) = HA,C,ζ
w(Z)for
suitable choices of A, C, ζand w. To do so, first define NJ:=
N(m)
1+· · · +N(m)
nand the block matrices
WJ:=
(w(m)
1)>0
...
0(w(m)
n)>
Mn,NJ,
ζJ:=
ζ(m)
1.
.
.
ζ(m)
n
RNJ,and AJ:=
A(m)
1
.
.
.
A(m)
n
MNJ,n.
Furthermore, to emphasize that mis fixed and h(m)approxi-
mates the identity, set J(x) := h(m)(x)and note that
J(x) = WJσ(AJx+ζJ).(40)
Now set N:= KNJ+Nand define the block matrix AMN
by
A=
0NJ,NJ
AJWJ0NJ,NJ
AJWJ
...
0
0...0NJ,NJ
AJWJ0NJ,NJ
A(1)WJA(2) WJ· · · · · · A(K)WJ0N,N
and ζRN,CMN,n and wRNby
ζ:=
ζJ
.
.
.
ζJ
ζ
, C :=
AJ
0
.
.
.
0
A(0)
,and w:= 0KNJ,1
w.
GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 10
Furthermore, we partition the reservoir states xtof the corre-
sponding echo state system as
xt:=
x(1)
t
.
.
.
x(K+1)
t
,
with x(j)
tRNJ, for jK, and x(K+1)
tRN. With this
notation for xtand these choices of matrices, the recursions
associated to the echo state reservoir map in (32) read as
x(1)
t=σ(AJzt+ζJ),(41)
x(j)
t=σ(AJWJx(j1)
t1+ζJ),for j= 2, . . . , K, (42)
x(K+1)
t=σ(
K
X
j=1
A(j)WJx(j)
t1+A(0)zt+ζ).(43)
By iteratively inserting (42) into itself and using (41) one
obtains (recall the definition of Jin (40)) that the unique
solution to (42) is given by
x(j)
t=σ(AJ[J]j1(ztj+1) + ζJ).(44)
More formally, one uses induction on j: For j= 1 the two
expressions (44) and (41) coincide. For j= 2, . . . , K one
inserts (44) for j1(which holds by induction hypothesis)
into (42) to obtain
x(j)
t=σ(AJWJσ(AJ[J]j2(ztj+1) + ζJ) + ζJ)
=σ(AJ[J]j1(ztj+1) + ζJ),
which is indeed (44). Finally, combining (44) and (43) one
obtains
yt=w>x(K+1)
t=w>σ(
K
X
j=1
A(j)WJx(j)
t1+A(0)zt+ζ)
=w>σ(
K
X
j=1
A(j)[J]j(ztj) + A(0)zt+ζ).
The statement (44) shows, in particular, that the echo state
network associated to A, C, ζand wsatisfies the echo state
property. Moreover, inserting t= 0 in the previous equality
and comparing with the definition of Hm(Z)one sees that
indeed Hm(Z) = HA,C,ζ
w(Z). The approximation statement
(33) therefore follows from (39).
Remark III.11.In this paper we measure closeness between
filters and functionals in a Lpsense. As we already pointed
out in Remark III.6, this choice allows us to approximate with
the systems used in this paper measurable filters that, unlike
in the Lcase, do not necessarily satisfy the fading memory
property. Therefore, an interesting aspect of the universality
results in Proposition III.1, Corollary III.8, Theorem III.9, and
Theorem III.10 is that it is possible to approximately simulate
any measurable filter that does not necessarily satisfy the fad-
ing memory property using the reservoir systems introduced
in those results that do satisfy the fading memory property.
Remark III.12.The results presented in this article address
the approximation capabilities of echo state networks and
other reservoir computing systems. When these systems are
used in practice not all of their parameters are trained. For
example, the recurrent connections of ESNs do not usually
undergo a training process, that is, the architecture parameters
A, C, ζare randomly drawn from a distribution and only the
readout wis trained by linear regression so as to optimally
fit the given teaching signal. Subsequently, an optimization
over a few hyperparameters (for instance, the spectral radius
of A) is carried out. In addition, in many situations the same
reservoir matrix Acan be used for different input time series
and different learning tasks and only the input-to-reservoir
matrices C, ζand the readout wneed to be modified (see,
for instance, the approach taken in [38], [39] to define time
series kernels). This feature is key in the implementation of
the notion of multi-tasking in the RC context (see [10]). Thus,
the empirically observed robustness of ESNs with respect
to these parameter choices is not entirely explained by the
universality results presented here. While in the static setting
of feedforward neural networks such questions have already
been tackled (see, for instance, [40]) for echo state networks
a full explanation is not available yet and these questions are
the subject of ongoing research.
D. An alternative viewpoint
So far all the universality results have been formulated
for functionals and filters with random inputs. Equivalently,
we may formulate them as Lp-approximation results on the
sequence space (Rn)Zendowed with any measure µthat
makes p-integrable the filter that we want to approximate.
Theorem III.13. Let H: (Rn)ZRbe a measurable
functional. Then, for any probability measure µon (Rn)Z
with HLp((Rn)Z, µ)and any ε > 0there exists a
reservoir system that has the echo state property and such that
the corresponding filter is causal and time-invariant, the as-
sociated functional HRC satisfies that HRC Lp((Rn)Z, µ)
and
kHHRCkLp((Rn)Z
)< ε. (45)
The reservoir functional HRC may be chosen as coming from
any of the following systems:
Linear reservoir with polynomial readout, that is, (9)for
some NN,AMN,cMN,n, and a polynomial h
PolN, if the measure µsatisfies the following condition:
for any KN,
Z(Rn)Z
exp α
K
X
k=0
n
X
i=1
|z(i)
k|!µ(dz)<.
Linear reservoir with neural network readout, that is, (9)
for some NN,AMN,cMN,n, and a neural
network h∈ HN.
Trigonometric state-affine system with linear readout, that
is, (21)for some NN,wRN,PTrigN,N and
QTrigN,1.
Echo state network with linear readout, that is, (32)for
some NN,CMN,n,ζRN,AMN,wRN,
where we assume that σ:RRemployed in (32)is
bounded, continuous and non-constant.
GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 11
Proof. Set Ω=(Rn)Z,F=tZB(Rn),P=µ
and define Zt(z) := ztfor all z,tZ. Then
F=σ(Zt:tZ) = F−∞ and Zis the identity
mapping on (Rn)Z. One may now apply Proposition III.1,
Corollary III.8, Theorem III.9 and Theorem III.10 with this
choice of probability space (Ω,F,P)and input process Z. The
statement of Theorem III.13 then precisely coincides with the
statement of Proposition III.1, Corollary III.8, Theorem III.9
and Theorem III.10, respectively.
E. Approximation of stationary strong time series models
Most parametric time series models commonly used in
financial, macroeconometrics, and forecasting applications are
specified by relations of the type
Xt=G(Xt1,Zt,θ),(46)
where θRkare the parameters of the model and the vector
XtRNis built so that it contains in its components the
time series of interest and that, at the same time, allows for a
Markovian representation of the model as in (46). The model
is driven by the innovations process Z= (Zt)tZ(Rn)Z.
When the innovations are made out of independent and iden-
tically distributed random variables we say that the model is
strong [41]. It is customary in the time series literature to
impose constraints on the parameters vector θso that the
relation (46) has a unique second-order stationary solution or,
in the language of this paper, the system (46) satisfies the echo
state property and the associated filter UG: (Rn)ZRNZ
satisfies that
E[UG(Z)t] =: µand EUG(Z)tUG(Z)>
t+h=: Σh, t, h Z,
(47)
with µRNand ΣhMNconstants that do not depend
on tZ. The Wold decomposition theorem [42, Theorem
5.7.1] shows that any such filter can be uniquely written as
the sum of a linear and a deterministic process.
It is obvious that for strong models the stationarity condition
(7) holds and that, moreover, the condition (47) implies that
kUG(Z)k2= sup
tZnE|UG(Z)t|21/2o= trace (Σ0)1/2<.
(48)
This integrability condition guarantees that the approximation
results in Proposition III.1, Corollary III.8, and Theorems
III.9 and III.10 hold for second-order stationary strong time
series models with p= 2. More specifically, the processes
determined by this kind of models can be approximated in
the L2sense by linear processes with polynomial or neural
network readouts (when the condition in Remark III.4 is
satisfied), by trigonometric state-affine systems with linear
readouts, or by echo state networks.
Important families of models to which this approximation
statement can be applied are, among many others, (see the
references for the meaning of the acronyms) GARCH [43],
[44], VEC [45], BEKK [46], CCC [47], DCC [48], [49],
GDC [50], and ARSV [51], [52].
IV. CONCLUSION
We have shown the universality of three different families
of reservoir computers with respect to the Lpnorm associated
to any given discrete-time semi-infinite input process.
On the one hand we proved that linear reservoir systems
with either neural network or polynomial readout maps (in
this case the input process needs to satisfy the exponential
moments condition (11)) are universal.
On the other hand we showed that the exponential moment
condition (11), which was required in the case of polynomial
readouts, can be dropped by considering two different reservoir
families with linear readouts, namely, trigonometric state-
affine systems and echo state networks. The latter are the most
widely used reservoir systems in applications. The linearity in
the readouts is a key feature in supervised machine learning
applications of these systems. It guarantees that they can be
used in high-dimensional situations and in the presence of
large datasets, since the training in that case is reduced to a
linear regression.
We emphasize that, unlike existing results in the literature
[25], [26] dealing with uniform universal approximation, the
Lpcriteria used in this paper allow to formulate universality
statements that do not necessarily impose almost sure uniform
boundedness on the inputs or the fading memory property on
the filter that needs to be approximated.
APPENDIX
A. Auxiliary Lemmas
Lemma A.1. Let Z:Z×Rnbe a stochastic process
and let Ft:= σ(Z0,...,Zt),tZ, and F−∞ := σ(Zt:t
Z)}. Let FLp(Ω,F−∞,P). Then E[F|Ft]converges to
Fas t→ −∞, both P-almost surely and in norm k·kp, for
any p[1,).
Proof. Since Ft⊂ Ft1⊂ F−∞, for all tN,
and FLp(Ω,F−∞,P)L1(Ω,F−∞ ,P), one has by
L´
evy’s Upward Theorem (see, for instance, [53, II.50.3] or
[33, Theorem 5.5.7]) that Ft:= E[F|Ft]converges for
t→ −∞ to Fin k·k1and P-almost surely. If p= 1 this
already implies the claim. For p > 1one has by standard
properties of conditional expectations (see, for instance, [33,
Theorem 5.1.4]) that suptN{E[|Ft|p]} ≤ E[|F|p]. Hence [33,
Theorem 5.4.5] implies that Ftconverges for t→ −∞ to
some ˜
FLp(Ω,F−∞,P)both in k·kpand P-almost surely.
But this identifies ˜
F= limt→−∞ Ft=F,P-almost surely
and hence Ftconverges for t→ −∞ to Falso in k · kp.
Lemma A.2. For NN\{0,1}and j= 1, . . . , N 1define
AjMNby (Aj)k,l =δk,j+1 δl,j for k, l ∈ {1, . . . , N}. Then
for LN,j0, . . . , jL∈ {1, . . . , N 1}it holds that
(AjL· · · Aj0)k,l =δk,jL+1δl,j0
L
Y
i=1
δji,ji1+1.(49)
In particular AjL· · · Aj06= 0 if and only if ji=j0+ifor
i∈ {1, . . . , L}.
Proof. The last statement directly follows from (49). To prove
(49) we proceed by induction on L. Indeed, for L= 0 the
GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 12
formula (49) is just the definition of Aj0. For the induction
step, one assumes that (49) holds for L1and calculates
(AjL· · · Aj0)k,l
=
N
X
r=1
δk,jL+1δr,jL(AjL1· · · Aj0)r,l
=
N
X
r=1
δk,jL+1δr,jLδr,jL1+1δl,j0
L1
Y
i=1
δji,ji1+1,
which is indeed (49).
ACKNOWLEDGMENT
The authors thank Lyudmila Grigoryeva and Josef Teich-
mann for helpful discussions and remarks and acknowledge
partial financial support coming from the Research Com-
mission of the Universit¨
at Sankt Gallen, the Swiss National
Science Foundation (grants number 175801/1 and 179114),
and the French ANR “BIPHOPROC” project (ANR-14-OHRI-
0018-02).
REFERENCES
[1] F. Cucker and S. Smale, “On the mathematical foundations of learning,
Bulletin of the American Mathematical Society, vol. 39, no. 1, pp. 1–49,
2002.
[2] S. Smale and D.-X. Zhou, “Estimating the approximation error in
learning theory,Analysis and Applications, vol. 01, no. 01, pp. 17–41,
jan 2003.
[3] F. Cucker and D.-X. Zhou, Learning Theory : An Approximation Theory
Viewpoint. Cambridge University Press, 2007.
[4] G. Cybenko, “Approximation by superpositions of a sigmoidal function,”
pp. 303–314, dec 1989.
[5] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward
networks are universal approximators,Neural Networks, vol. 2, no. 5,
pp. 359–366, 1989.
[6] K. Hornik, “Approximation capabilities of muitilayer feedforward net-
works,” Neural Networks, vol. 4, no. 1989, pp. 251–257, 1991.
[7] W. Maass, T. Natschl¨
ager, and H. Markram, “Real-time computing
without stable states: a new framework for neural computation based
on perturbations,” Neural Computation, vol. 14, pp. 2531–2560, 2002.
[8] H. Jaeger and H. Haas, “Harnessing Nonlinearity: Predicting Chaotic
Systems and Saving Energy in Wireless Communication,Science, vol.
304, no. 5667, pp. 78–80, 2004.
[9] W. Maass and H. Markram, “On the computational power of circuits of
spiking neurons,” Journal of Computer and System Sciences, vol. 69,
no. 4, pp. 593–616, 2004.
[10] W. Maass, “Liquid state machines: motivation, theory, and applications,”
in Computability In Context: Computation and Logic in the Real World,
S. S. Barry Cooper and A. Sorbi, Eds., 2011, ch. 8, pp. 275–296.
[11] M. B. Matthews, “On the Uniform Approximation of Nonlinear
Discrete-Time Fading-Memory Systems Using Neural Network Mod-
els,” Ph.D. dissertation, ETH Z¨
urich, 1992.
[12] ——, “Approximating nonlinear fading-memory operators using neural
network models,” Circuits, Systems, and Signal Processing, vol. 12,
no. 2, pp. 279–307, jun 1993.
[13] S. Boyd and L. Chua, “Fading memory and the problem of approxi-
mating nonlinear operators with Volterra series,” IEEE Transactions on
Circuits and Systems, vol. 32, no. 11, pp. 1150–1161, 1985.
[14] K.-i. Funahashi and Y. Nakamura, “Approximation of dynamical systems
by continuous time recurrent neural networks,” Neural Networks, vol. 6,
no. 6, pp. 801–806, jan 1993.
[15] W. Maass, P. Joshi, and E. D. Sontag, “Computational aspects of
feedback in neural circuits,” PLoS Computational Biology, vol. 3, no. 1,
p. e165, 2007.
[16] E. Sontag, “Realization theory of discrete-time nonlinear systems: Part I-
The bounded case,” IEEE Transactions on Circuits and Systems, vol. 26,
no. 5, pp. 342–356, may 1979.
[17] E. D. Sontag, “Polynomial Response Maps,” in Lecture Notes Control
in Control and Information Sciences. Vol. 13. Springer Verlag, 1979.
[18] M. Fliess and D. Normand-Cyrot, “Vers une approche alg´
ebrique des
syst`
emes non lin´
eaires en temps discret,” in Analysis and Optimization
of Systems. Lecture Notes in Control and Information Sciences, vol. 28,
A. Bensoussan and J. Lions, Eds. Springer Berlin Heidelberg, 1980.
[19] I. W. Sandberg, “Approximation theorems for discrete-time systems,
IEEE Transactions on Circuits and Systems, vol. 38, no. 5, pp. 564–
566, 1991.
[20] ——, “Structure theorems for nonlinear systems,” Multidimensional
Systems and Signal Processing, vol. 2, pp. 267–286, 1991.
[21] P. C. Perryman, “Approximation Theory for Deterministic and Stochastic
Nonlinear Systems,” Ph.D. dissertation, University of California, Irvine,
1996.
[22] A. Stubberud and P. Perryman, “Current state of system approximation
for deterministic and stochastic systems,” in Conference Record of The
Thirtieth Asilomar Conference on Signals, Systems and Computers,
vol. 1. IEEE Comput. Soc. Press, 1997, pp. 141–145.
[23] B. Hammer and P. Tino, “Recurrent neural networks with small weights
implement definite memory machines,” Neural Computation, vol. 15,
no. 8, pp. 1897–1929, aug 2003.
[24] P. Tino, B. Hammer, and M. Bod´
en, “Markovian bias of neural-based
architectures with feedback connections,” in Perspectives of Neural-
Symbolic Integration. Studies in Computational Intelligence, vol 77.,
Hammer B. and Hitzler P., Eds. Springer, Berlin, Heidelberg, 2007,
pp. 95–133.
[25] L. Grigoryeva and J.-P. Ortega, “Universal discrete-time reservoir com-
puters with stochastic inputs and linear readouts using non-homogeneous
state-affine systems,Journal of Machine Learning Research, vol. 19,
no. 24, pp. 1–40, 2018.
[26] ——, “Echo state networks are universal,” Neural Networks, vol. 108,
pp. 495–508, 2018.
[27] H. Jaeger, “The ’echo state’ approach to analysing and training recurrent
neural networks with an erratum note,” German National Research
Center for Information Technology, 2010.
[28] I. B. Yildiz, H. Jaeger, and S. J. Kiebel, “Re-visiting the echo state
property.” Neural networks : the official journal of the International
Neural Network Society, vol. 35, pp. 1–9, nov 2012.
[29] G. Manjunath and H. Jaeger, “Echo state property linked to an input:
exploring a fundamental characteristic of recurrent neural networks,”
Neural Computation, vol. 25, no. 3, pp. 671–696, 2013.
[30] O. Kallenberg, Foundations of Modern Probability, ser. Probability and
Its Applications. Springer New York, 2002.
[31] C. Berg and J. P. R. Christensen, “Density questions in the classical
theory of moments,” Annales de l’Institut Fourier, vol. 31, no. 3, pp.
99–114, 1981.
[32] L. C. Petersen, “On the relation between the multidimensional moment
problem and the one-dimensional moment problem,” Mathematica Scan-
dinavica, vol. 51, no. 2, pp. 361–366, 1983.
[33] R. Durrett, Probability: Theory and Examples, 4th ed., ser. Cambridge
Series in Statistical and Probabilistic Mathematics. Cambridge: Cam-
bridge University Press, 2010.
[34] Ernst, Oliver G., Mugler, Antje, Starkloff, Hans-J¨
org, and Ullmann,
Elisabeth, “On the convergence of generalized polynomial chaos ex-
pansions,” ESAIM: M2AN, vol. 46, no. 2, pp. 317–339, 2012.
[35] C. C. Heyde, “On a property of the lognormal distribution,” The Journal
of the Royal Statistical Society Series B (Methodological), vol. 25, no. 2,
pp. 392–393, 1963.
[36] G. Freud, Orthogonal Polynomials. Pergamon Press, 1971.
[37] W. Rudin, Real and Complex Analysis, 3rd ed. McGraw-Hill, 1987.
[38] H. Chen, F. Tang, P. Tino, and X. Yao, “Model-based kernel for
efficient time series analysis,” in Proceedings of the 19th ACM SIGKDD
international conference on Knowledge discovery and data mining -
KDD ’13, 2013.
[39] H. Chen, P. Tino, A. Rodan, and X. Yao, “Learning in the model space
for cognitive fault diagnosis,IEEE Transactions on Neural Networks
and Learning Systems, 2014.
[40] G.-B. G.-B. Huang, Q.-Y. Q.-Y. Zhu, C.-k. C.-K. C.-K. Siew, G.-b. H. ˜
A,
Q.-Y. Q.-Y. Zhu, and C.-k. C.-K. C.-K. Siew, “Extreme learning machine
: Theory and applications,” Neurocomputing, 2006.
[41] C. Francq and J.-M. Zakoian, GARCH Models: Structure, Statistical
Inference and Financial Applications. Wiley, 2010.
[42] P. J. Brockwell and R. A. Davis, Time Series: Theory and Methods.
Springer-Verlag, 2006.
[43] R. F. Engle, “Autoregressive conditional heteroscedasticity with esti-
mates of the variance of United Kingdom inflation,Econometrica,
vol. 50, no. 4, pp. 987–1007, 1982.
[44] T. Bollerslev, “Generalized autoregressive conditional heteroskedastic-
ity,Journal of Econometrics, vol. 31, no. 3, pp. 307–327, 1986.
GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 13
[45] T. Bollerslev, R. F. Engle, and J. M. Wooldridge, “A capital asset pricing
model with time varying covariances,Journal of Political Economy,
vol. 96, pp. 116–131, 1988.
[46] R. F. Engle and F. K. Kroner, “Multivariate simultaneous generalized
ARCH,” Econometric Theory, vol. 11, pp. 122–150, 1995.
[47] T. Bollerslev, “Modelling the coherence in short-run nominal exchange
rates: A multivariate generalized ARCH model,Review of Economics
and Statistics, vol. 72, no. 3, pp. 498–505, 1990.
[48] Y. K. Tse and A. K. C. Tsui, “A multivariate GARCH with time-varying
correlations,” Journal of Business and Economic Statistics, vol. 20, pp.
351–362, 2002.
[49] R. F. Engle, “Dynamic conditional correlation: a simple class of multi-
variate GARCH models,Journal of Business and Economic Statistics,
vol. 20, pp. 339–350, 2002.
[50] F. K. Kroner and V. K. Ng, “Modelling asymmetric comovements of
asset returns,” The Review of Financial Studies, vol. 11, pp. 817–844,
1998.
[51] S. J. Taylor, “Financial returns modelled by the product of two stochastic
processes, a study of daily sugar prices,” in Time series analysis: theory
and practice I, B. D. Anderson, Ed., 1982, pp. 1961–1979.
[52] A. C. Harvey, E. Ruiz, and N. Shephard, “Multivariate stochastic
variance models,Review of Economic Studies, vol. 61, pp. 247–264,
1994.
[53] L. C. G. Rogers and D. Williams, Diffusions, Markov Processes, and
Martingales, 2nd ed. Cambridge University Press, 2000, vol. 1.
... Not limited to neural networks, a linear combination of nonlinear dynamical systems can be used to approximate the relationship between input and output time series and is called a reservoir computing (RC) system [7]. RC systems can also approximate the relationship between input and output time series with arbitrary accuracy [8,9]. Since no optimization is performed other than the linear transformation, an RC system is often inferior in performance to LSTM and other methods. ...
... In this section, we estimate the IPC for infinitely long data using the asymptotic expansion. First, we summarize the RC dynamics by referring to Ref. [10,8,9]. From Eq. (1), the values of the hidden nodes of the RC system at time t are determined by the initial value x −τ and the input sequence (u −τ , . . . , u t ). ...
... Next, we consider the stochasticity of the input sequence. Following Ref. [10,9], we assume that U = (U t ) t∈Z is a stationary ergodic process. We use the notation U ′ t = (U s ) s≤t , and in particular U ′ = (U t ) t≤0 . ...
Preprint
The squared error normalized by the target output is known as the information processing capacity (IPC) and is used to evaluate the performance of reservoir computing (RC). Since RC aims to learn the relationship between input and output time series, we should evaluate the IPC for infinitely long data rather than the IPC for finite-length data. To evaluate the IPC for infinitely long data using the IPC for finite-length data, we use an asymptotic expansion of the IPC and the least-squares method. Then, we show the validity of our method by numerical simulations.
... This makes of filter injectivity an important architecture constraint that has to be kept in mind at the time of QRC systems design. At this point, it is important to emphasize that filter injectivity is not related to either the (short term) memory [39][40][41] or the expressivity [42][43][44][45] of the model, two different questions that have also been extensively explored in depth in the literature in connection with reservoir design. Filter injectivity only encompasses the set-theoretical one-to-one correspondence between input and output sequences. ...
... The linear map (44) is, in this case, given by ...
... Indeed, let z 0 ∈ R be arbitrary, let z 0 be the corresponding constant sequence, and x 0 = U (z 0 ) 0 . Then, the linear map (44) becomes ...
Preprint
Full-text available
Quantum reservoir computing is an emergent field in which quantum dynamical systems are exploited for temporal information processing. In previous work, it was found a feature that makes a quantum reservoir valuable: contractive dynamics of the quantum reservoir channel toward input-dependent fixed points. These results are enhanced in this paper by finding conditions that guarantee a crucial aspect of the reservoir's design: distinguishing between different input sequences to ensure a faithful representation of temporal input data. This is implemented by finding a condition that guarantees injectivity in reservoir computing filters, with a special emphasis on the quantum case. We provide several examples and focus on a family of quantum reservoirs that is much used in the literature; it consists of an input-encoding quantum channel followed by a strictly contractive channel that enforces the echo state and the fading memory properties. This work contributes to analyzing valuable quantum reservoirs in terms of their input dependence.
... However, conventional reservoir computing is a black-box model that lacks interpretability. Next generation reservoir computing (NG-RC) [25][26][27] replaces the traditional complex reservoir network with a single-layer feedforward neural network structure while maintaining excellent performance. NG-RC shows great potential in time series prediction and chaotic system modeling due to its advantages of simple training and low data requirements. ...
Article
Full-text available
Time series forecasting plays a crucial role in contemporary production and daily life by analyzing the historical data to predict future trends, patterns, and behaviors across various phenomena. However, forecasting chaotic time series remains a formidable challenge, primarily attributed to the high nonlinearity of chaotic dynamical systems and their extreme sensitivity to noise and initial conditions. In general, predicting through discrete-time models and continuous-time models each have their advantages, but there have been few explorations of their integration. In this paper, we delve into the fusion of these two models by leveraging deep neural networks to achieve a notable improvement in the length and accuracy of chaotic time series prediction. The discrete-time reservoir computing model is utilized to make initial predictions of the chaotic time series, while continuous-time differential equations and Physically Informed Neural Network (PINN) are then adopted to refine the results and mine hidden information from multiple perspectives. The effectiveness of the method is verified by applying it to the Lorenz system. Moreover, we discuss in detail the prediction performance of the method under extreme conditions such as noise, sparse sampling, etc., which reveals satisfactorily results even when a single model fails. By applying the method to an experimental double pendulum system, we further demonstrate the superiority of the method in long-term prediction of real chaotic dynamical systems. Overall, the proposed method represents a new paradigm of multi-method and multi-model fusion, providing innovative ideas for improving the accuracy of chaotic time series prediction.
... From Eq. (1), it can be noticed that |Q k,lin ⟩ consists of d × u elements, including the prior data of the time series. According to the general theory of universal approximators [72,73], to predict future data normally requires a great deal of prior data so that has to be nearly infinite. However, by using the state defined by Eq. (1), such a requirement turns out to be unnecessary, and truncating u to a smaller value does not result in significant errors. ...
Preprint
Full-text available
Quantum reservoir computing (QRC) exploits the information-processing capabilities of quantum systems to tackle time-series forecasting tasks, which is expected to be superior to their classical counterparts. By far, many QRC schemes have been theoretically proposed. However, most of these schemes involves long-time evolution of quantum systems or networks with quantum gates. This poses a challenge for practical implementation of these schemes, as precise manipulation of quantum systems is crucial, and this level of control is currently hard to achieve with the existing state of quantum technology. Here, we propose a different way of QRC scheme, which is friendly to experimental realization. It implements the quantum version of nonlinear vector autoregression, extracting linear and nonlinear features of quantum data by measurements. Thus, the evolution of complex networks of quantum gates can be avoided. Compared to other QRC schemes, our proposal also achieves an advance by effectively reducing the necessary training data for reliable predictions in time-series forecasting tasks. Furthermore, we experimentally verify our proposal by performing the forecasting tasks, and the observation matches well with the theorectial ones. Our work opens up a new way toward complex tasks to be solved by using the QRC, which can herald the next generation of the QRC.
... Universal Approximation Theorems: RC networks have garnered attention for their universal approximation capabilities, particularly for discretetime processes with fading memory [355,356]. These theorems validate the use of RCs with ESP and fading memory for approximating a wide range of dynamical systems, thereby reinforcing the theoretical foundation for RC design and optimization [357,358,359,360,361]. ...
Thesis
Full-text available
The human brain’s complexity presents significant challenges for understanding cognitive processes and emergent functionalities. Recent advancements in empirical brain networks elucidate the relationship between network topology and physiological functions utilizing advanced neuroimaging technologies to provide a more comprehensible and simplified yet effective representation of brain networks. Despite their slow and non-ideal processing units, biological brains manage complex functions through highly adaptive and robust network-level synchronization and the processing of noisy, nonlinear data. Chimera states, a form of neural synchronization, are particularly intriguing due to their experimental relation with cognitive functions and neurological disorders. They not only model certain brain functionalities and vulnerabilities but also have the potential to be used for the development of bio-inspired computing systems. However, the computational challenges in simulating chimera states in complex neuromorphic networks in software, necessitate the development of scalable and specialized neuromorphic systems that can emulate these states with high fidelity. Neuromorphic computing provides a versatile platform, which involves the design of analog and digital-level neuromorphic circuits, that simulate neuron and synapse models. Advanced neuromorphic chips aim to address both scalability and power efficiency challenges. However, they not only reproduce ideal neuromorphic features, but also follow the Von Neumann architecture. Adopting a top-down approach, this doctoral dissertation develops digital, analog and molecular neuromorphic networks to progressively incorporate and evaluate the effects of intrinsic device and system imperfections on chimera states. Within the scope of this doctoral dissertation digital neuromorphic networks with reconfigurable Field Programmable Gate Arrays (FPGAs) are designed, focusing on simple, biologically plausible neuron implementations. The design of digital neurons was focused on easy configuration, hardware resource minimization, and asynchronous communication, while providing the simulation of neuromorphic responses of cortical neurons. Then, exploiting parallel processing capabilities of FPGA, a digital neuromorphic network was designed to study synchronization and demonstrate chimera states by evaluating the effect of asynchronous communication between neurons. The results demonstrated the versatility of digital neuromorphic networks in reproducing chimera states, confirming the feasibility of using FPGAs to study and reproduce chimera states, providing a platform for further experimental exploration. However, for a more realistic simulation of neuromorphic responses with higher fidelity and accuracy, we had to pursue another alternative. Subsequently, the dissertation shifted towards analog neuromorphic networks using nano-electronic non-volatile memristors in crossbar arrays. These arrays more closely mimic the synaptic functionality of biological brains. The circuits progressively integrated intrinsic mechanisms and device imperfections, enhancing our understanding of how physical imperfections can induce chimera states under specific conditions in both neuromorphic networks of Fitzhugh-Nagumo neurons and Chua circuit networks. Starting from ideal devices, we evaluated the effect of switching threshold mechanisms, device-to-device variability and sneak-path currents on the emergence of chimera states. Extensive circuit-level simulations in LTSpice confirmed the synchronization phenomena and the potential for controlling synchronization by adjusting the states of the memristors through crossbar reprogramming is investigated. Memristor devices are then experimentally validated, with particular focus on the silicon nitride (SiN) devices We performed endurance tests on properly tuned SiN devices, which are then calibrated to examine the non-linearities effect o n synchronization. Our results demonstrated chimera states despite these conditions. However, recognizing the problem of extensive scalability and integration, of both digital and analog circuits, we proceeded with an alternative system-level approach. Finally, the dissertation expanded to molecular neuromorphic networks, utilizing the three-dimensional structure of Verotoxin proteins, to further scale up and simulate neuromorphic functions at the system level. These proteins, providing a biological and highly adaptable structure, resembling biological neural networks in self-organization, offering b io-compatibility, a nd a llowing t he c onfiguration of complex and dense networks at the atomic level, enabled the emergence of synchronized and desynchronized domains, essential for chimera states. The study further explored the potential of molecular networks in bio-inspired computation, demonstrating their effectiveness in classification tasks within reservoir computing frameworks.
... Employing WDM further increases the dimension of the output vector from M to M × P. Figure 2a shows the photonic chip used in our experiment, built on a silicon-on-insulator (SOI) platform. According to the general theory of universal approximations, the input size is ideally considered to be infinitely large 42 . However, empirical observation suggests that the Volterra series converges quickly, even with a relatively small input dimension 35 . ...
Article
Full-text available
Reservoir computing (RC) is a powerful machine learning algorithm for information processing. Despite numerous optical implementations, its speed and scalability remain limited by the need to establish recurrent connections and achieve efficient optical nonlinearities. This work proposes a streamlined photonic RC design based on a new paradigm, called next-generation RC, which overcomes these limitations. Our design leads to a compact silicon photonic computing engine with an experimentally demonstrated processing speed of over 60 GHz. Experimental results demonstrate state-of-the-art performance in prediction, emulation, and classification tasks across various machine learning applications. Compared to traditional RC systems, our silicon photonic RC engine offers several key advantages, including no speed limitations, a compact footprint, and a high tolerance to fabrication errors. This work lays the foundation for ultrafast on-chip photonic RC, representing significant progress toward developing next-generation high-speed photonic computing and signal processing.
Preprint
Full-text available
Time delays increase the effective dimensionality of reservoirs, thus suggesting that time delays in reservoirs can enhance their performance, particularly their memory and prediction abilities. We find new closed-form expressions for memory and prediction functions of linear time-delayed reservoirs in terms of the power spectrum of the input and the reservoir transfer function. We confirm this relationship numerically for some time-delayed reservoirs using simulations, including when the reservoir can be linearized but is actually nonlinear. Finally, we use these closed-form formulae to address the utility of multiple time delays in linear reservoirs in order to perform memory and prediction, finding similar results to previous work on nonlinear reservoirs. We hope these closed-form formulae can be used to understand memory and predictive capabilities in time-delayed reservoirs.
Article
Reservoir computing, using nonlinear dynamical systems, offers a cost-effective alternative to neural networks for complex tasks involving processing of sequential data, time series modeling, and system identification. Echo state networks (ESNs), a type of reservoir computer, mirror neural networks but simplify training. They apply fixed, random linear transformations to the internal state, followed by nonlinear changes. This process, guided by input signals and linear regression, adapts the system to match target characteristics, reducing computational demands. A potential drawback of ESNs is that the fixed reservoir may not offer the complexity needed for specific problems. While directly altering (training) the internal ESN would reintroduce the computational burden, an indirect modification can be achieved by redirecting some output as input. This feedback can influence the internal reservoir state, yielding ESNs with enhanced complexity suitable for broader challenges. In this paper, we demonstrate that by feeding some component of the reservoir state back into the network through the input, we can drastically improve upon the performance of a given ESN. We rigorously prove that, for any given ESN, feedback will almost always improve the accuracy of the output. For a set of three tasks, each representing different problem classes, we find that with feedback the average error measures are reduced by 30%-60%. Remarkably, feedback provides at least an equivalent performance boost to doubling the initial number of computational nodes, a computationally expensive and technologically challenging alternative. These results demonstrate the broad applicability and substantial usefulness of this feedback scheme.
Article
Full-text available
This paper shows that echo state networks are universal uniform approximants in the context of discrete-time fading memory filters with uniformly bounded inputs defined on negative infinite times. This result guarantees that any fading memory input/output system in discrete time can be realized as a simple finite-dimensional neural network-type state-space model with a static linear readout map. This approximation is valid for infinite time intervals. The proof of this statement is based on fundamental results, also presented in this work, about the topological nature of the fading memory property and about reservoir computing systems generated by continuous reservoir maps.
Article
Full-text available
A new class of non-homogeneous state-affine systems is introduced. Sufficient conditions are identified that guarantee first, that the associated reservoir computers with linear readouts are causal, time-invariant, and satisfy the fading memory property and second, that a subset of this class is universal in the category of fading memory filters with stochastic almost surely bounded inputs. This means that any discrete-time filter that satisfies the fading memory property with random inputs of that type can be uniformly approximated by elements in the non-homogeneous state-affine family.
Article
It is clear that the learning speed of feedforward neural networks is in general far slower than required and it has been a major bottleneck in their applications for past decades. Two key reasons behind may be: (1) the slow gradient-based learning algorithms are extensively used to train neural networks, and (2) all the parameters of the networks are tuned iteratively by using such learning algorithms. Unlike these conventional implementations, this paper proposes a new learning algorithm called extreme learning machine (ELM) for single-hidden layer feedforward neural networks (SLFNs) which randomly chooses hidden nodes and analytically determines the output weights of SLFNs. In theory, this algorithm tends to provide good generalization performance at extremely fast learning speed. The experimental results based on a few artificial and real benchmark function approximation and classification problems including very large complex applications show that the new algorithm can produce good generalization performance in most cases and can learn thousands of times faster than conventional popular learning algorithms for feedforward neural networks.1
Article
In this paper it is established that the lognormal distribution is not determined by its moments. Some brief comments are made on the set of distributions having the same moments as a lognormal distribution.
Article
Existing time-varying covariance models usually impose strong restrictions on how past shocks affect the forecasted covariance matrix. In this article we compare the restrictions imposed by the four most popular multivariate GARCH models, and introduce a set of robust conditional moment tests to detect misspecification. We demonstrate that the choice of a multivariate volatility model can lead to substantially different conclusions in any application that involves forecasting dynamic covariance matrices (like estimating the optimal hedge ratio or deriving the risk minimizing portfolio). We therefore introduce a general model which nests these four models and their natural "asymmetric" extensions. The new model is applied to study the dynamic relation between large and small firm returns.
Article
Time varying correlations are often estimated with multivariate generalized autoregressive conditional heteroskedasticity (GARCH) models that are linear in squares and cross products of the data. A new class of multivariate models called dynamic conditional correlation models is proposed. These have the flexibility of univariate GARCH models coupled with parsimonious parametric models for the correlations. They are not linear but can often be estimated very simply with univariate or two-step methods based on the likelihood function. It is shown that they perform well in a variety of situations and provide sensible empirical results.
Article
Let μ\mu be a positive Radon measure on the real line having moments of all orders. We prove that the set P of polynomials is note dense in Lp(R,μ)L^p({\bf R},\mu ) for any p>2p>2, if μ\mu is indeterminate. If μ\mu is determinate, then P is dense in Lp(R,μ)L^p({\bf R},\mu ) for 1p21\le p \le 2, but not necessarily for p>2p>2. The compact convex set of positive Radon measures with same moments as μ\mu is studied in some details.
Article
The goal of learning theory is to approximate a function from sample values. To attain this goal learning theory draws on a variety of diverse subjects, specifically statistics, approximation theory, and algorithmics. Ideas from all these areas blended to form a subject whose many successful applications have triggered a rapid growth during the last two decades. This is the first book to give a general overview of the theoretical foundations of the subject emphasizing the approximation theory, while still giving a balanced overview. It is based on courses taught by the authors, and is reasonably self-contained so will appeal to a broad spectrum of researchers in learning theory and adjacent fields. It will also serve as an introduction for graduate students and others entering the field, who wish to see how the problems raised in learning theory relate to other disciplines.