Content uploaded by Juan-Pablo Ortega
Author content
All content in this area was uploaded by Juan-Pablo Ortega on Feb 13, 2019
Content may be subject to copyright.
GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 1
Reservoir Computing Universality With Stochastic
Inputs
Lukas Gonon and Juan-Pablo Ortega
Abstract—The universal approximation properties with respect
to Lp-type criteria of three important families of reservoir
computers with stochastic discrete-time semi-infinite inputs are
shown. First, it is proved that linear reservoir systems with either
polynomial or neural network readout maps are universal. More
importantly, it is proved that the same property holds for two
families with linear readouts, namely, trigonometric state-affine
systems and echo state networks, which are the most widely used
reservoir systems in applications. The linearity in the readouts
is a key feature in supervised machine learning applications. It
guarantees that these systems can be used in high-dimensional
situations and in the presence of large datasets. The Lpcriteria
used in this paper allow the formulation of universality results
that do not necessarily impose almost sure uniform boundedness
in the inputs or the fading memory property in the filter that
needs to be approximated.
Index Terms—Reservoir computing, echo state network, ESN,
machine learning, uniform system approximation, stochastic
input, universality.
I. INTRODUCTION
AUNIVERSALITY statement in relation to a machine
learning paradigm refers to its versatility at the time of
reproducing a rich number of patterns obtained by modifying
only a limited number of hyperparameters. In the language
of learning theory, universality amounts to the possibility of
making approximation errors as small as one wants [1]–[3].
Well-known universality results are, for example, the uniform
approximation properties of feedforward neural networks es-
tablished in [4], [5] for deterministic inputs and, later on,
extended in [6] to accommodate random inputs.
This paper is a generalization of the universality statements
in [6] to a discrete-time dynamical context. More specifically,
we are interested in the learning not of functions but of
filters that transform semi-infinite random input sequences
parameterized by time into outputs that depend on those inputs
in a causal and time-invariant manner. The approximants
used are small subfamilies of reservoir computers (RC) [7],
[8] or reservoir systems. Reservoir computers (also referred
to in the literature as liquid state machines [9], [10]) are
filters generated by nonlinear state-space transformations that
constitute special types of recurrent neural networks. They are
determined by two maps, namely a reservoir F:RN×Rn−→
RN,n, N ∈N, and a readout map h:RN→Rthat under
certain hypotheses transform (or filter) an infinite discrete-time
L. Gonon and J.-P. Ortega are with the Faculty of Mathematics and
Statistics, Universit¨
at Sankt Gallen, Sankt Gallen, Switzerland. L. Gonon is
also affiliated with the Department of Mathematics, ETH Z¨
urich, Switzerland.
J.-P. Ortega is also affiliated with the Centre National de la Recherche
Scientifique (CNRS), France.
input z= (. . . , z−1,z0,z1, . . .)∈(Rn)Zinto an output signal
y∈RZof the same type using a state-space transformation
given by:
xt=F(xt−1,zt),
yt=h(xt),
(1)
(2)
where t∈Zand the dimension N∈Nof the state vectors
xt∈RNis referred to as the number of virtual neurons
of the system. In supervised machine learning applications
the reservoir map is very often randomly generated and the
memoryless readout is trained so that the output matches a
given teaching signal. An important particular case of the RC
systems in (1)-(2) are echo state networks (ESN) introduced,
in different contexts, in [8], [11], [12], and that are built using
the transformations
(xt=σ(Axt−1+Czt+ζ),
yt=w>xt,(3)
with A∈MN,C∈MN,n,ζ∈RN, and w∈RN. The map
σ:RN→RNis obtained via the componentwise application
of a given activation function σ:R→Rthat is denoted
with the same symbol. ESNs have as an important feature
the linearity of the readout specified by the vector w∈RN
that is estimated using linear regression methods based on a
training dataset. This is done once the other parameters in the
model (A,C, and ζ) have been randomly generated and their
scale has been adapted to the problem in question by tuning
a limited number of hyperparameters (like the sparsity or the
spectral radius of the matrix A).
Families of reservoir systems of the type (1)-(2) have
already been proved to be universal in different contexts. In the
continuous-time setup, it was shown in [13] that linear reser-
voir systems with polynomial readouts or bilinear reservoirs
with linear readouts are able to uniformly approximate any
fading memory filter with uniformly bounded and equicon-
tinuous inputs. The fading memory property is a continuity
feature exhibited by many filters encountered in applications.
See also [9], [10], [14], [15] for other contributions to the RC
universality problem in the continuous-time setup.
In the discrete-time setup, several universality statements
were already part of classical systems theory statements for
inputs defined on a finite number of time points [16]–[18].
In the more general context of semi-infinite inputs, various
universality results have been formulated for systems with
approximate finite memory [11], [12], [19]–[22]. More re-
cently, it has been shown in [23], [24], that RCs generated
by contractive reservoir maps (similar to the ESNs introduced
GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 2
above) exhibit universality properties in the approximate finite
memory category.
These universality results have been recently extended to
the causal and fading memory category in [25], [26]. In those
works the universality of two important families of reservoir
systems with linear readouts has been established, namely,
the so called state affine systems (SAS) and the echo state
networks (ESN) that we just introduced in (3). Moreover, the
universality of the SAS family was established in [25] both for
uniformly bounded deterministic inputs, as well as for almost
surely uniformly bounded stochastic ones. This last statement
was shown to be a corollary of a general transfer theorem
that proves that very important features of causal and time-
invariant filters like the fading memory property or universality
are naturally inherited by reservoir systems with almost surely
uniformly bounded stochastic inputs from their counterparts
with deterministic inputs.
Unfortunately, almost surely bounded random inputs are
not always appropriate for many applications. For example,
most parametric time series models use as driving innova-
tions random variables whose distributions are not compactly
supported (Gaussian, for example) in order to ensure ade-
quate levels of performance. The main goal of this work
is formulating universality results in the stochastic context
that do not impose almost sure uniform boundedness in the
inputs. This is achieved by using a density criterion (which is
the mathematical characterization of universality) based not
on L∞-type norms, like in [25], [26], but on Lpnorms,
p∈[1,∞). This approach follows the pattern introduced in
the static case in [6].
This strategy allows to cover a more general class of input
signals and filters, but it also creates some differences in
the type of approximation results that are obtained. More
specifically, in the stochastic universality statements in [25],
for example, universal families are presented that uniformly
approximate any given filter for any input in a given class of
stochastic processes. In contrast with this statement and like
in [6], we fix here first a discrete-time stochastic process that
models the data generating process (DGP) behind the system
inputs that are being considered. Subsequently, families of
reservoir filters are spelled out whose images of the DGP
are dense in the Lpsense. Equivalently, the image of the
DGP by any measurable causal and time invariant filter can
be approximated by the image of one of the members of the
universal family with respect to an Lpnorm defined using the
law of the prefixed DGP.
It is important to point out that this approach allows us to
formulate universality results for filters that do not necessarily
have the fading memory property since only measurability is
imposed as a hypothesis.
The paper contains three main universality statements. The
first one shows that linear reservoir systems with either poly-
nomial or neural network readout maps are universal in the
Lpsense. More importantly, two other families with linear
readouts are shown to also have this property, namely, trigono-
metric state-affine systems and echo state networks, which
are the most widely used reservoir systems in applications.
The linearity of the readout is a key feature of these systems
since in supervised machine learning applications it reduces
the training task to the solution of a linear regression problem,
which can be implemented efficiently also in high-dimensional
situations and in the presence of large datasets.
We emphasize that, from a learning theoretical perspective,
the results in this paper only establish the possibility of
making the approximation error arbitrarily small when using
the proposed RC families in a specific learning task. We do
not provide bounds neither for the approximation nor the
corresponding estimation errors using finite random samples.
Even though some results in this direction already exist in the
literature [23], [24], we plan to address this important subject
in a forthcoming paper where the same degree of generality
as in the present paper will be adopted.
II. PRELIMINARIES
In this section we introduce some notation and collect
general facts about filters, reservoir systems, and stochastic
input signals.
A. Notation
We write N={0,1, . . .}and Z−={. . . , −1,0}. The
elements of the Euclidean spaces Rnwill be written as column
vectors and will be denoted in bold. Given a vector v∈Rn, we
denote its entries by vior by v(i), with i∈ {1, . . . , n}.(Rn)Z
and (Rn)Z−denote the sets of infinite Rn-valued sequences of
the type (. . . , z−1,z0,z1, . . .)and (. . . , z−1,z0)with zi∈Rn
for i∈Zand i∈Z−, respectively. Additionally, we denote
by z(k)
ithe k-th component of zi. The elements in these
sequence spaces will also be written in bold, for example,
z:= (. . . , z−1,z0)∈(Rn)Z−. We denote by Mn,m the space
of real n×mmatrices with m, n ∈N. When n=m, we
use the symbol Mnto refer to the space of square matrices
of order n. Random variables and stochastic processes will be
denoted using upper case characters that will be bold when
they are vector valued.
B. Filters and functionals
A filter is a map U: (Rn)Z→RZ. It is called causal, if
for any z,w∈(Rn)Zwhich satisfy zτ=wτfor all τ≤t
for a given t∈Z, one has that U(z)t=U(w)t. Denote
by T−τ: (Rn)Z→(Rn)Zthe time delay operator defined by
T−τ(z)t:= zt+τ, for any τ∈Z. A filter Uis called time-
invariant, if T−τ◦U=U◦T−τfor all τ∈Z.
Causal and time-invariant filters can be equivalently de-
scribed using their naturally associated functionals. We refer
to a map H: (Rn)Z−→Ras a functional. Given a causal
and time-invariant filter U, one defines the functional HU
associated to it by setting HU(z) := U(ze)0. Here zeis an
arbitrary extension of z∈(Rn)Z−to (Rn)Z.HUdoes not
depend on the choice of this extension since Uis causal.
Conversely, given a functional Hone may define a causal and
time-invariant filter UH: (Rn)Z→RZby setting UH(z)t:=
H(πZ−◦T−t(z)), where πZ−: (Rn)Z→(Rn)Z−is the natural
projection. One may verify that any causal and time-invariant
filter can be recovered from its associated functional and
GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 3
conversely. Equivalently, U=UHUand H=HUH. We refer
to [13] for further details.
If Uis causal and time-invariant, then for any z∈(Rn)Z
the sequence U(z)restricted to Z−only depends on (zt)t∈Z−.
Thus we may also consider Uas a map U: (Rn)Z−→RZ−,
but when we do so this will always be clear from the context.
C. Reservoir computing systems
A specific class of filters can be obtained using the reservoir
computing systems or reservoir computers (RC) introduced in
(1)-(2) when they satisfy the so called echo state property
(ESP) given by the following statement (see [27]–[29]): for
any z∈(Rn)Zthere exists a unique x∈(RN)Zsuch that (1)
holds. In the presence of the ESP, the RC system gives rise
to a well-defined filter UF
hthat is constructed by associating
to any z∈(Rn)Zthe unique x∈(RN)Zsatisfying (1)
and by mapping xsubsequently to the output in (2), that
is, UF
h(z)t:= yt. Furthermore, it can be shown (see [26,
Proposition 2.1]) that UF
his necessarily causal and time-
invariant and hence we may associate to UF
ha reservoir
functional HF
h: (Rn)Z−→Rdefined as HF
h(z) := UF
h(z)0.
As seen above, the causal and time-invariant filter UF
his
uniquely determined by the reservoir functional HF
h. Since
the latter is determined by the restriction of the RC system to
Z−, we will sometimes consider the system (1)-(2) only for
t∈Z−.
D. Deterministic filters with stochastic inputs
We are interested in feeding the filters and the systems
that we just introduced with stochastic processes as inputs.
More explicitly, given a causal and time-invariant filter U
that satisfies certain measurability hypotheses, any stochastic
process Z= (Zt)t∈Z−is mapped to a new stochastic process
(U(Z)t)t∈Z−. The main contributions in this article address
the question of approximating U(Z)by reservoir filters in an
Lpsense. We now introduce the precise framework to achieve
this goal.
1) Probabilistic framework: Consider a probability space
(Ω,F,P)on which all random variables are defined. Recall
that the sample space Ωis an arbitrary set representing
possible outcomes, the σ-algebra Fis a collection of sub-
sets of Ωdescribing the set of events to be considered,
and P:F → [0,1] is a probability measure that assigns a
probability of occurrence to each event. The input signal is
modeled as a discrete-time stochastic process Z= (Zt)t∈Z−
with values in Rn. For each outcome ω∈Ωwe denote
by Z(ω) = (Zt(ω))t∈Z−the realization or sample path of
Z. Thus Zmay be viewed as a random sequence in Rn
and when dealing with stochastic processes we will make no
distinctions between the assignment Z:Z−×Ω→Rnand
the corresponding map into path space Z: Ω →(Rn)Z−. We
recall that Zis a stochastic process when the corresponding
map Z: Ω →(Rn)Z−is measurable. Here (Rn)Z−is equipped
with the product σ-algebra ⊗t∈Z−B(Rn)(which coincides
with the Borel σ-algebra of (Rn)Z−equipped with the product
topology by [30, Lemma 1.2]), where B(Rn)is the Borel σ-
algebra on Rn.
We denote by Ft:= σ(Z0,...,Zt),t∈Z−, the σ-algebra
generated by {Z0,...,Zt}and write F−∞ := σ(Zt:t∈Z−).
Thus Ftmodels the information contained in the input stream
at times 0,−1, . . . , t. For p∈[1,∞]we denote by Lp(Ω,F,P)
the Banach space formed by the real-valued random variables
in (Ω,F,P)that have a finite usual Lpnorm k·kp.
We say that the process Zis stationary when for any
{t1, . . . , tk} ⊂ Z−,h∈Z−, and At1, . . . , Atk∈ B(Rn),
we have that
P(Zt1∈At1,...,Ztk∈Atk)
=P(Zt1+h∈At1,...,Ztk+h∈Atk).
2) Measurable functionals and filters: We say that a func-
tional His measurable when the map between measurable
spaces H:(Rn)Z−,⊗t∈Z−B(Rn)→(R,B(R)) is measur-
able. When His measurable then so is H(Z) : (Ω,F)→
(R,B(R)) since H(Z) = H◦Zis the composition of
measurable maps and hence H(Z)is a random variable on
(Ω,F,P).
Analogously, we will say that a causal, time-invariant filter
Uis measurable when the map between measurable spaces
U:(Rn)Z,⊗t∈ZB(Rn)→RZ,⊗t∈ZB(R)is measurable.
In that case, also the restriction of Uto Z−(see above) is
measurable and so U(Z)is a real-valued stochastic process.
As discussed above, causal, time-invariant filters and func-
tionals are in a one-to-one correspondence. This relation
is compatible with the measurability condition, that is, a
causal and time-invariant filter is measurable if and only if
the associated functional is measurable. In order to prove
this statement we show first that the operator πZ−◦T−t:
(Rn)Z,⊗t∈ZB(Rn)−→ (Rn)Z−,⊗t∈Z−B(Rn)is a mea-
surable map, for any t∈Z−. Indeed, notice first that the
projections pi:(Rn)Z,⊗t∈ZB(Rn)−→ (Rn,B(Rn)),
i∈Z−, given by pi(z) = ziare measurable. Thus πZ−◦T−t
can be written as the Cartesian product of measurable maps,
i.e. for each k∈Z−one has that (πZ−◦T−t)k=pt+kis
measurable. This yields that πZ−◦T−tis measurable [30,
Lemma 1.8].
Now, if His a measurable functional, this implies that the
associated filter UHis also measurable, since for each t∈Z−,
(UH)t=H◦πZ−◦T−t,(4)
is a composition of measurable functions and hence also
measurable. Conversely, if Uis causal, time-invariant, and
measurable, then so is the associated functional HU=p0◦U.
3) Lp-norm for functionals: Fix p∈[1,∞)and let Hbe
a measurable functional such that H(Z)∈Lp(Ω,F,P). The
functionals which satisfy that
kH(Z)kp:= E[|H(Z)|p]1/p <∞(5)
will be referred to as p-integrable with respect to the input
process Z.
Let us now consider the expression (5) from an alternative
point of view. Denote by µZ:= P◦Z−1the law of Zwhen
viewed as a (Rn)Z−-valued random variable as above. Thus
µZis a probability measure on (Rn)Z−such that for any
measurable set A⊂(Rn)Z−one has µZ(A) = P(Z∈A).
GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 4
The requirement H(Z)∈Lp(Ω,F,P)then translates to
H∈Lp((Rn)Z−, µZ)and (5) is equal [30, Lemma 1.22] to
kHkµZ
p:= "Z(Rn)Z
−
|H(z)|pµZ(dz)#1/p
=kH(Z)kp.
Thus, the results formulated later on in the paper for
functionals with random inputs can also be seen as statements
for functionals with deterministic inputs in (Rn)Z−, where
the closeness between them is measured using the norm in
Lp((Rn)Z−, µZ). Following the terminology used by [6] we
will refer to µZas the input environment measure.
We emphasize that these two points of view are equivalent.
Given any probability measure µZon (Rn)Z−one may set
Ω=(Rn)Z−,F=⊗t∈Z−B(Rn),P=µZand define Zt(z) :=
ztfor all z∈Ω. We will switch between these two viewpoints
throughout the paper without much warning to the reader.
4) Lp-norm for filters: Fix p∈[1,∞). A causal, time-
invariant, measurable filter Uis said to be p-integrable, if
kU(Z)kp:= sup
t∈Z−nE[|U(Z)t|p]1/po<∞.(6)
It is easy to see that if Uis p-integrable, then so is the
corresponding functional HUdue to the following inequality
kHU(Z)kp=E[|HU(Z)|p]1/p =E[|U(Z)0|p]1/p
≤sup
t∈Z−nE[|U(Z)t|p]1/po=kU(Z)kp<∞.
The converse implication holds true when the input process
is stationary. In order to show this fact, notice first that if µt
is the law of πZ−◦T−t(Z),t∈Z−, and Zis by hypothesis
stationary then, for any {t1, . . . , tk} ⊂ Z−and At1, . . . , Atk∈
B(Rn), we have that
P(πZ−◦T−t(Z))t1∈At1,...,(πZ−◦T−t(Z))tk∈Atk
=P(Zt1+t∈At1,...,Ztk+t∈Atk)
=P(Zt1∈At1,...,Ztk∈Atk),
which proves that
µZ=µt,for all t∈Z−.(7)
This identity, together with (4), implies that for any p-
integrable functional H:
kUH(Z)kp= sup
t∈Z−nE[|UH(Z)t|p]1/po
= sup
t∈Z−nE|H(πZ−◦T−t(Z))|p1/po
= sup
t∈Z−
"Z(Rn)Z
−
|H(z)|pµt(dz)#1/p
= sup
t∈Z−
"Z(Rn)Z
−
|H(z)|pµZ(dz)#1/p
=kH(Z)kp<∞,
(8)
which proves the p-integrability of the associated filter UH.
III. Lp-UNIVERSALITY RESULTS
Fix p∈[1,∞),Zan input process, and a functional H
such that H(Z)∈Lp(Ω,F,P). The goal of this section is
finding simple families of reservoir systems that are able to
approximate H(Z)as accurately as needed in the Lpsense.
The first part contains a result that shows that linear reservoir
maps with polynomial readouts are able to carry this out. As
we already pointed out in the introduction, a result for the
same type of reservoir systems has been proved in [25] in the
L∞setting for both deterministic and almost surely uniformly
bounded stochastic inputs. The second part presents a family
that is able to achieve universality using only linear readouts,
which is of major importance for applications since in that
case the training effort reduces to solving a linear regression.
Finally, we prove the universality of echo state networks which
is the most widely used family of reservoir systems with linear
readouts.
A. Linear reservoirs with nonlinear readouts
Consider a reservoir system with linear reservoir map and
a polynomial readout. More precisely, given A∈MN,c∈
MN,n, and h∈PolNa real-valued polynomial in Nvariables,
consider the system
(xt=Axt−1+czt, t ∈Z−,
yt=h(xt), t ∈Z−,(9)
for any z∈(Rn)Z−. If the matrix Ais chosen so that
σmax(A)<1, then this system has the echo state property
and the corresponding reservoir filter UA,c
his causal and time-
invariant [25]. We denote by HA,c
hthe associated functional.
We are interested in the approximation capabilities that can be
achieved by using processes of the type HA,c
h(Z), where Zis
a fixed input process and HA,c
h(Z) = Y0, with Y0obviously
determined by the stochastic reservoir system
(Xt=AXt−1+cZt, t ∈Z−,
Yt=h(Xt), t ∈Z−.(10)
Proposition III.1. Fix p∈[1,∞), let Zbe a fixed Rn-valued
input process, and let Hbe a functional such that H(Z)∈
Lp(Ω,F,P). Suppose that for any K∈Nthere exists α > 0
such that
E"exp α
K
X
k=0
n
X
i=1
|Z(i)
−k|!#<∞,(11)
where Z(i)
−kdenotes the i-th component of Z−k. Then, for any
ε > 0there exists N∈N,A∈MN,c∈MN,n, and h∈PolN
such that (9)has the echo state property, the corresponding
filter is causal and time-invariant, the associated functional
satisfies HA,c
h(Z)∈Lp(Ω,F,P), and
kH(Z)−HA,c
h(Z)kp< ε. (12)
If the input process Zis stationary then
kUH(Z)−UA,c
h(Z)kp< ε. (13)
GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 5
Proof. The proof consists of two steps: In the first one we
use assumption (11) and classical results in the literature to
establish that
Poln(K+1) is dense in Lp(Rn(K+1), µK),for all K∈N,
(14)
where µKis the law of (Z(1)
0, Z(2)
0, . . . , Z(n−1)
−K, Z(n)
−K)on
Rn(K+1) under P. In the second step we then use (14) to
construct a linear RC system of the type in (9) that yields the
approximation statement (12).
Step 1: Denote by µKthe law of
(Z(1)
0, Z(2)
0, . . . , Z(n−1)
−K, Z(n)
−K)on RNunder P, where
N:= n(K+ 1). By (11) there exists α > 0such that
RRNexp(αkzk1)µK(dz)<∞, where here and in the rest
of this proof k · k1denotes the Euclidean 1-norm. Denoting
by µj
Kthe j-th marginal distribution of µK, this implies for
j= 1, . . . , N that
ZR
exp(α|z(j)|)µj
K(dz(j))≤ZRN
exp(αkzk1)µK(dz)<∞.
Consequently, by [31, Theorem 6], Pol1is dense in Lp(R, µj
K)
for any p∈[1,∞),j= 1, . . . , N . By [32, Proposition page
364] this implies that PolNis dense in Lp(RN, µK), where
we note that µKindeed satisfies the moment assumption in
[32, Page 361]: since x2m≤exp(αx)for any x≥0,m∈N,
one has
ZRN
kzk2m
2µK(dz)≤ZRN
exp(αkzk2)µK(dz)
≤ZRN
exp(αkzk1)µK(dz)<∞.
Step 2: Let ε > 0. By Lemma A.1 in the appendix there
exists K∈Nsuch that
kH(Z)−E[H(Z)|F−K]kp<ε
2(15)
where F−K:= σ(Z0,...,Z−K). In the following para-
graphs we will establish the approximation statement (12) for
E[H(Z)|FK]instead of H(Z). Combining this with (15) will
then yield (12).
Let N:= n(K+ 1). By definition, E[H(Z)|F−K]is
F−K-measurable and hence there exists [30, Lemma 1.13] a
measurable function gK:RN→Rsuch that E[H(Z)|F−K] =
gK(Z0,...,Z−K). Furthermore,
ZRN
|gK(z)|pµK(dz)
=E[|E[H(Z)|F−K]|p]≤E[|H(Z)|p]<∞,
by standard properties of conditional expectations (see, for in-
stance, [33, Theorem 5.1.4]) and the assumption that H(Z)∈
Lp(Ω,F,P). Thus, gK∈Lp(RN, µK)and using the statement
(14) established in Step 1, there exists h∈PolNsuch that
kE[H(Z)|F−K]−h(Z>
0,...,Z>
−K)kp
=kgK−hkLp(RN,µK)<ε
2.(16)
Define now a reservoir system of the type (10) with inputs
given by the random variables Zt,t∈Z−and reservoir
matrices A∈MNand c∈MN,n with all entries equal to
0except Ai,i−n= 1 for i=n+ 1, . . . , N and ci,i = 1 for
i= 1, . . . , n, that is
A=0n,nK 0n,n
InK 0n,n ,and c=In
0nK,n .
This system has the echo state property (all the eigenval-
ues of Aequal zero) and has a unique causal and time
invariant solution associated to the reservoir states Xt:=
Z>
t,Z>
t−1,...,Z>
t−K>,t∈Z−. It is easy to verify that
the corresponding reservoir functional is given by
HA,c
h(Z) = h(Z>
0,...,Z>
−K).(17)
Now the triangle inequality and (15), (16) and (17) allow us
to conclude (12).
The statement in (13) in the presence of the stationarity
hypothesis for Zis a straightforward consequence of (7) and
the equality (8).
Remark III.2.It is important to point out that the reservoir
systems used in the proof of Proposition III.1 all have finite
memory. Thus, this proof shows that it is possible to obtain
universality in the Lpsense with that type of finite memory
systems and that, in particular, they can be used to approximate
infinite memory filters. A key ingredient in this statement
is, apart from the hypothesis (11), the Lemma A.1 in the
Appendix. The other universal systems introduced later on in
the paper (trigonometric state-affine systems and echo state
networks) also share this feature. Similar statements have
also been proved for linear reservoir systems with polynomial
readouts and state-affine systems with linear readouts in the
L∞setup for both deterministic and almost surely uniformly
bounded stochastic inputs (see, for instance, [25, Corollary 11,
Theorem 19]). This phenomenon has also been observed in the
in the context of approximation of deterministic filters using
Volterra series operators (see [13, Theorems 3 and 4]).
Remark III.3.A simple situation in which condition (11) is
satisfied is when for any t∈Z−the random variable Ztis
bounded, i.e. for any t∈Z−there exists Ct≥0such that
kZtk ≤ Ct,P-a.s. However, as the next remark shows, there
are also practically relevant examples of input streams with
unbounded support, for which (11) is satisfied.
Remark III.4.A sufficient condition for (11) to hold is that
the random variables {Zt:t∈Z−}are independent and
that for each t, there exists a constant α > 0such that
E[exp(αPn
i=1 |Z(i)
t|)] <∞. This last condition is satisfied,
for instance, if Ztis normally distributed. For input streams
coming from more heavy-tailed distributions like Student’s t-
distribution, the condition is not satisfied and so one should use
the reservoir systems considered below (see Corollary III.8,
Theorem III.9, and Theorem III.10) instead if universality is
needed.
Remark III.5.Assumption (11) can be replaced by alternative
assumptions but it can not be removed. Even if n= 1 and
{Zt:t∈Z−}are independent and identically distributed with
distribution ν, a condition stronger than the existence of mo-
ments of all orders for νis required. As a counterexample, one
may take for νa lognormal distribution. Then νhas moments
of all orders, but (11) is not satisfied. Let us now argue that
GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 6
the approximation result proved under assumption (11) fails
in this case. The following argument relies on results for the
classical moment problem (see, for example, the collection of
references in [34]).
Indeed, by [35]νis not determinate (there exist other
probability measures with identical moments) and thus (see
e.g. [36, Theorem 4.3]) Pol1is not dense in Lp(R, ν)for
p≥2. In particular, there exists g∈Lp(R, ν)and ε > 0
such that kg−˜
hkp> ε for all ˜
h∈Pol1. Suppose that
we are in the case n= 1 and let {Zt:t∈Z−}be
independent and identically distributed with distribution ν
and H(z) := g(z0)for z∈RZ−. Then, for any choice
of N,A,cand hone has E[HA,c
h(Z)|F0] = ˜
h(Z0), where
˜
h(x) := E[h(AX−1+cx)], x ∈R, is a polynomial. Thus one
may use [33, Theorem 5.1.4] and the fact that by construction
H(Z)is F0-measurable to obtain
kH(Z)−HA,c
h(Z)kp≥ kE[H(Z)|F0]−E[HA,c
h(Z)|F0]kp
=kg−˜
hkp> ε.
Remark III.6.In previous reservoir computing universality
results for both deterministic and stochastic inputs quoted in
the introduction there was an important continuity hypothesis
called the fading memory property that does not play a role
here and that has been replaced by the integrability require-
ment H∈Lp((Rn)Z−, µZ). In particular, the universality
results that we just proved and those that come in the next
section (see Theorem III.9) yield approximations for filters
which do not necessarily have the fading memory property.
Whether or not the approximation results apply depends on the
integrability condition with respect to the input environment
measure µZ. Consider, for example, the functional associated
to the peak-hold operator [13]. In the discrete-time setting, the
associated functional is
H(z) = sup
t≤0
{zt},with z∈RZ−.
We now show that the two possibilities H∈Lp((Rn)Z−, µZ)
and H /∈Lp((Rn)Z−, µZ)are feasible, depending on the
choice of µZ:
•Let Z= (Zt)t∈Z−be one dimensional independent
and identically distributed (i.i.d) random variables with
unbounded support and denote by µZthe law of Zon
RZ−. Denoting by Fthe distribution function of Z1and
using the i.i.d assumption one calculates, for any a∈R,
P(H(Z)> a) = 1 −P(∩t<0{Zt≤a})
= 1 −lim
n→∞ F(a)n= 1.
Hence, we can conclude that H(Z) = ∞,µZ-almost
everywhere and therefore H /∈Lp((Rn)Z−, µZ).
•Consider now the same setup, but assume this time that
the random variables have bounded support, that is, for
some amax ∈Rone has that P(Zt≤amax)=1and
P(Zt> amax) = 0. Then, the same argument shows
that H(Z) = amax,µZ-almost everywhere and therefore
H∈Lp((Rn)Z−, µZ).
Remark III.7.From the proof of Proposition III.1 one sees
that one could replace in its statement PolNby any other
family {HN}N∈Nthat satisfies the density statement (14). In
particular, the following corollary shows that this result can
be obtained with readouts made out of neural networks.
Denote by HNthe set of feedforward one hidden layer
neural networks with inputs in RNthat are constructed with
a fixed activation function σ. More specifically, HNis made
of functions h:RN→Rof the type
h(x) =
k
X
j=1
βjσ(αj·x−θj),(18)
for some k∈N,βj, θj∈R, and αj∈RN, for j= 1, . . . , k.
Corollary III.8. In the setup of Proposition III.1, consider the
family of neural networks h∈ HNconstructed with a fixed
activation function σthat is bounded and non-constant. Then,
for any ε > 0there exists N∈N,A∈MN,c∈MN,n, and a
neural network h∈ HNsuch that the corresponding reservoir
system (9)has the echo state property and has a unique causal
and time-invariant filter associated. Moreover, the correspond-
ing functional satisfies that HA,c
h(Z)∈Lp(Ω,F,P)and
kH(Z)−HA,c
h(Z)kp< ε. (19)
Proof. By [6, Theorem 1] the set HNis dense in Lp(RN, µ)
for any finite measure µon RN. Thus, statement (14) holds
with HNreplacing Poln(K+1). Mimicking line by line the
proof of Step 2 in Proposition III.1 then proves the Corollary.
B. Trigonometric state-affine systems with linear readouts
Fix M, N ∈Nand consider R:Rn→MN ,M defined by
R(z) :=
r
X
k=1
Akcos(uk·z) + Bksin(vk·z),z∈Rn,(20)
for some r∈N,Ak, Bk∈MN,M ,uk,vk∈Rn, for
k= 1, . . . , r. The symbol TrigN,M denotes the set of all
functions of the type (20). We call the elements of TrigN,M
trigonometric polynomials.
We now introduce reservoir systems with linear readouts
and reservoir maps constructed using trigonometric polyno-
mials: let N∈N,w∈RN,P∈TrigN,N ,Q∈TrigN,1and
define, for any z∈(Rn)Z−, the system:
(xt=P(zt)xt−1+Q(zt), t ∈Z−,
yt=w>xt, t ∈Z−.(21)
We call the systems of this type trigonometric state-affine
systems. When such a system has the echo state property and a
unique causal and time-invariant solution for any input, we de-
note by UP,Q
wthe corresponding filter and by HP,Q
w(z) := y0
the associated functional. As in the previous section, we fix
p∈[1,∞),Zan input process, and a functional Hsuch that
H(Z)∈Lp(Ω,F,P)and we are interested in approximating
H(Z)by systems of the form HP,Q
w(Z). Again, we will write
HP,Q
w(Z) = Y0, where Y0is uniquely determined by the
reservoir system with stochastic inputs
(Xt=P(Zt)Xt−1+Q(Zt), t ∈Z−,
Yt=w>Xt, t ∈Z−.(22)
GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 7
Define Aas the set of four-tuples (N, w, P, Q)∈N×RN×
TrigN,N ×TrigN,1whose associated systems (21) have the
echo state property and the unique solutions are causal and
time-invariant. In particular, for such (N, w, P, Q)a reservoir
functional HP,Q
wassociated to (21) exists.
Theorem III.9. Let p∈[1,∞)and let Zbe a fixed Rn-valued
input process. Denote by LZthe set of reservoir functionals
of the type (21)which are p-integrable, that is,
LZ:= {HP,Q
w(Z) : (N, w, P, Q)∈ A} ∩ Lp(Ω,F,P).
Then LZis dense in Lp(Ω,F−∞,P).
In particular, for any functional Hsuch that H(Z)∈
Lp(Ω,F,P)and any ε > 0, there exists N∈N,w∈RN,
P∈TrigN,N and Q∈TrigN,1such that the system (21)
has the echo state property and causal and time-invariant
solutions. Moreover, HP,Q
w(Z)∈Lp(Ω,F,P)and
kH(Z)−HP,Q
w(Z)kp< ε. (23)
If the input process Zis stationary then
kUH(Z)−UP,Q
w(Z)kp< ε. (24)
Proof. We first argue that LZis a linear subspace of
Lp(Ω,F−∞,P). To do this we need to introduce some no-
tation. Given A∈MN1,M1,B∈MN2,M2, we denote by
A⊕B∈MN1+N2,M1+M2the direct sum. Given Ras in
(20) we define R⊕A∈TrigN+N1,M +M1by
R⊕A(z) :=
r
X
k=1
Ak⊕Acos(uk·z) + Bk⊕Asin(vk·z),
and (with the analogous definition for B⊕R) for Ri∈
TrigNi,Mi,i= 1,2we set
R1⊕R2=R1⊕0N2,M2+0N1,M1⊕R2.
One easily verifies that for λ∈Rand (Ni,wi, Pi, Qi)∈ A,
i= 1,2, one has that
(N1+N2,w1⊕λw2, P1⊕P2, Q1⊕Q2)∈ A,
HP1,Q1
w1(Z) + λHP2,Q2
w2(Z) = HP1⊕P2,Q1⊕Q2
w1⊕λw2(Z).
This shows that LZis indeed a linear subspace of
Lp(Ω,F−∞,P).
Secondly, in order to show that LZis dense in
Lp(Ω,F−∞,P), it suffices to prove that if F∈
Lq(Ω,F−∞,P)satisfies E[F H ]=0for all H∈ LZ, then
F= 0,P-almost surely. Here q∈(1,∞]is the H¨
older
conjugate exponent of p. This can be shown by contraposition.
Suppose that LZis not dense in Lp(Ω,F−∞,P). Since LZis
a linear subspace, by the Hahn-Banach theorem there exists
a bounded linear functional Λon Lp(Ω,F−∞,P)such that
Λ(H) = 0 for all H∈ LZ, but Λ6= 0, see e.g. [37,
Theorem 5.19]. Then by [37, Theorem 6.16] there exists
F∈Lq(Ω,F−∞,P)such that Λ(H) = E[F H ]for all
H∈Lp(Ω,F−∞,P)and F6= 0, since Λ6= 0. In particular,
there exists F∈Lq(Ω,F−∞,P)\ {0}such that E[F H ] = 0
for all H∈ LZ.
Thirdly, suppose that F∈Lq(Ω,F−∞,P)satisfies
E[F H ]=0for all H∈ LZ.(25)
If we show that F= 0,P-almost surely, then the statement
in the theorem follows by the argument in the second step.
In order to prove that F= 0,P-almost surely, we first show
that (25) implies the following statement: for any K∈N, any
subset I⊂ IK:= {0, . . . , K}, and any u0,...,uK∈Rnit
holds that
E
FY
j∈I
sin(uj·Zj)Y
k∈IK\I
cos(uk·Zk)
= 0.(26)
We prove this claim by induction on K∈N. For K= 0,
one sets Q1(z) := cos(u0·z)and Q2(z) := sin(u0·z)and
notices that (1,1,0, Qi)∈ A. Moreover, since the sine and
cosine function are bounded, it is easy see that Qi(Z0) =
H0,Qi
1(Z0)∈ LZ, for i∈ {1,2}. Thus (25) implies (26) and
so the statement holds for K= 0. For the induction step, let
K∈N\ {0}and assume the implication holds for K−1.
We now fix Iand u0,...,uK∈Rnas above and prove (26).
To simplify the notation we define for k∈ {0, . . . , K}and
z∈Rnthe function gkby
gk(z) := (sin(uk·z),if k∈I,
cos(uk·z),if k∈ IK\I.
To prove (26), we set N:= K+ 1, for j∈ {1, . . . , K}define
Aj∈MNwith all entries equal to 0except (Aj)j+1,j = 1,
that is, (Aj)k,l =δk,j+1 δl,j ,k, l ∈ {1, . . . , N }. Define now
for z∈Rn
P(z) :=
K−1
X
j=0
AK−jgj(z),
Q(z) := e1gK(z),
w:= eK+1,
(27)
where ejis the j-th unit vector in RN, that is, the only non-
zero entry of ejis a 1in the j-th coordinate. By Lemma A.2
in the appendix, one has AjL· · · Aj0= 0 for any j0, . . . , jL∈
{1, . . . , K}and L≥K, since jL=j0+Lcan not be satisfied.
In other words, any product of more than Kfactors of matrices
A(j)is equal to 0and thus for any L∈Nwith L≥Kand
any z0,...,zL∈Rnone has P(z0). . . P (zL) = 0. Using
this fact and iterating (21), one obtains that the trigonometric
state-affine system defined by the elements in (27) has a unique
solution given by
xt=Q(zt) +
K
X
j=1
P(zt)· · · P(zt−j+1)Q(zt−j).(28)
In particular (N, w, P, Q)∈ A and
HP,Q
w(Z) = X0
=w>
Q(Z0) +
K
X
j=1
P(Z0)· · · P(Z−j+1)Q(Z−j)
.
(29)
The finiteness of the sum in (29) and the boundedness of the
trigonometric polynomials implies that HP,Q
w(Z)∈ LZ.
We conclude the proof of the induction step with the
following chain of equalities that uses (25) in the first one,
GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 8
the representation (29) in the second one, and the choice of
the vector wand the induction hypothesis in the last step:
0 = E[F H P,Q
w(Z)]
=E[Fw>Q(Z0)]
+E[Fw>
K
X
j=1
P(Z0)· · · P(Z−j+1)Q(Z−j)]
=E[Fw>P(Z0)· · · P(Z−K+1)Q(Z−K)].
(30)
However, again by Lemma A.2 in the appendix, the only
non-zero product of matrices AjK−1· · · Aj0for j0,...jK−1∈
{1, . . . , K}takes place when jk=k+ 1 for k∈ {0, . . . , K −
1}. Therefore:
P(Z0)· · · P(Z−K+1)
=AKg0(Z0)AK−1g1(Z−1)· · · A1gK−1(Z−K+1).
Combining this with (30) and using the identity (49) in
Lemma A.2 in the appendix one obtains
0 = E[F e>
K+1AK· · · A1e1
K
Y
k=0
gk(Z−k)]
=E[F
K
Y
k=0
gk(Z−k)],
which is the same as (26).
Fourthly, by standard trigonometric identities, the identity
(26) established in the third step implies that for any K∈N,
E
Fexp
i
K
X
j=0
uj·Zj
= 0 for all u0,...,uK∈Rn.
(31)
We claim that (31) implies F= 0,P-almost surely and
hence the statement in the theorem follows. This fact is
a consequence of the uniqueness theorem for characteristic
functions (which is ultimately a consequence of the Stone-
Weierstrass approximation theorem). See for instance [30,
Theorem 4.3] and the text below that result. To prove F= 0,
P-almost surely, we denote by F+and F−the positive
and negative parts of F. Then by (31) one has E[F]=0,
necessarily. Thus, if it does not hold that F= 0,P-almost
surely, then c:= E[F+] = E[F−]>0and one may
define probability measures Q+and Q−on (Ω,F)by setting
Q+(A) := c−1E[F+A]and Q−(A) := c−1E[F−A]for
A∈ F. Denote by µ+
Kand µ−
Kthe law in Rn(K+1) of the
random variable
ZK:= (Z>
0,Z>
−1,...,Z>
−K)>
under Q+and Q−. Then, the statement (31) implies that for
all u∈Rn(K+1),
ZRn(K+1)
exp(iu ·z)µ+
K(dz) = ZRn(K+1)
exp(iu ·z)µ−
K(dz).
By the uniqueness theorem for characteristic functions (see
e.g. [30, Theorem 4.3] and the text below) this implies
that µ+
K=µ−
K. Translating this statement back to random
variables, this means that for any bounded and measurable
function g:Rn(K+1) →Rone has
0 = cEQ+[g(ZK)] −cEQ−[g(ZK)] = E[F g(ZK)],
which, by definition, means that E[F|F−K]=0,P-almost
surely. Since K∈Nwas arbitrary and F∈L1(Ω,F−∞,P),
one may combine this with limt→−∞ E[F|Ft] = F,P-almost
surely (see Lemma A.1) to conclude F= 0, as desired.
The statement in (24) in the presence of the stationarity
hypothesis for Zis a straightforward consequence of (7) and
the equality (8).
We emphasize that the use in the proof of the theorem
of nilpotent matrices of the type introduced in Lemma A.2
ensures that the the echo state property is automatically
satisfied (see (28)).
C. Echo state networks
We now turn to showing the universality in the Lpsense
of the the most widely used reservoir systems with linear
readouts, namely, echo state networks. An echo state network
is a RC system determined by
(xt=σ(Axt−1+Czt+ζ),
yt=w>xt,(32)
for A∈MN,C∈MN,n,ζ∈RN, and w∈RN. As it
is customary in the neural networks literature, the map σ:
RN→RNis obtained via the componentwise application of
a given activation function σ:R→Rthat is denoted with the
same symbol.
If this system has the echo state property and the resulting
filter is causal and time-invariant, we write as HA,C,ζ
w(z) := y0
the associated functional.
Theorem III.10. Fix p∈[1,∞), let Zbe a fixed Rn-
valued input process, and let Hbe a functional such that
H(Z)∈Lp(Ω,F,P). Suppose that the activation function
σ:R→Ris non-constant, continuous, and has a bounded
image. Then for any ε > 0, there exists N∈N,C∈MN,n ,
ζ∈RN,A∈MN,w∈RNsuch that (32)has the echo state
property, the corresponding filter is causal and time-invariant,
the associated functional satisfies HA,C,ζ
w(Z)∈Lp(Ω,F,P)
and
kH(Z)−HA,C,ζ
w(Z)kp< ε. (33)
Proof. First, by Corollary III.8 and (17) there exists K, N∈
N,w∈RN,A∈MN,n(K+1) , and ζ∈RNsuch that the
neural network
h(z) = w>σ(Az+ζ)
satisfies
kH(Z)−h(Z>
0,...,Z>
−K)kp<ε
2.(34)
Notice that we may rewrite Aas
A= [A(0)A(−1) · · · A(−K)]
GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 9
with A(j)∈MN,n and
H∞(Z):=h(Z>
0,...,Z>
−K)
=w>σ
K
X
j=0
A(−j)Z−j+ζ
.(35)
Second, by the neural network approximation theorem for
continuous functions [6, Theorem 2], for any m∈Nthere
exists a neural network that uniformly approximates the iden-
tity mapping on the hypercube Bm:= {x∈Rn:|xi| ≤
mfor i= 1, . . . , n}. More specifically, [6, Theorem 2] is
formulated for R-valued mappings and we hence apply it
componentwise: for any m∈Nand i= 1, . . . , n there
exists N(m)
i∈N,w(m)
i∈RN(m)
i,A(m)
i∈MN(m)
i,n, and
ζ(m)
i∈RN(m)
i, such that for all i= 1, . . . , n the neural
network
h(m)
i(x) = w(m)
i>σA(m)
ix+ζ(m)
i
satisfies
sup
x∈Bm
{|h(m)
i(x)−xi|} <1
m.(36)
Write h(m)(x)=(h(m)
1(x), . . . , h(m)
n(x))>and for j=
1, . . . , K, denote by [h(m)]j=h(m)◦ · · · ◦ h(m)the jth
composition of h(m). We now claim that for all j= 1, . . . , K
and x∈Rnit holds that
lim
m→∞[h(m)]j(x) = x.(37)
Indeed, let us fix x∈Rnand argue by induction on j. To
prove (37) for j= 1, let ε > 0be given and choose m0∈
Nsatisfying m0>max {|x1|,...,|xn|,1/ε}. Then, for any
m≥m0one has x∈Bmby definition and (36) implies that
for i= 1, . . . , n,
|h(m)
i(x)−xi|<1
m< ε.
Hence (37) indeed holds for j= 1. Now let j≥2
and assume that (37) has been proved for j−1. Define
x(m):= [h(m)]j−1(x). Then, by the induction hypothesis, for
any given ε > 0one finds m0∈Nsuch that for all m≥m0
and i= 1, . . . , n it holds that
|x(m)
i−xi|<ε
2.(38)
Hence, choosing m0∈Nwith m0>max(m0,|x1|+
ε
2,...,|xn|+ε
2,2/ε)one obtains from the triangle inequality
and (38) that x(m)∈Bm0for all m≥m0. In particular for
any m≥m0one may use the triangle inequality in the first
step, x(m)∈Bm0⊂Bmand (38) in the second step and (36)
in the last step to estimate
|[h(m)]j
i(x)−xi|≤|h(m)
i(x(m))−x(m)
i|+|x(m)
i−xi|
≤sup
y∈Bm
{|h(m)
i(y)−yi|} +ε
2
<1
m+ε
2< ε.
This proves (37) for all j= 1, . . . , K.
Thirdly, define
Hm(Z) := w>σ
K
X
j=0
A(−j)[h(m)]j(Z−j) + ζ
with the convention [h(m)]0(x) = x.
Since σis continuous, (37) implies that limm→∞ Hm(Z) =
H∞(Z),P-almost surely, where H∞was defined in (35).
Furthermore, by assumption there exists C > 0such that
|σ(x)| ≤ Cfor all x∈R. Hence one has |H∞(Z)−
Hm(Z)|p≤(2CPN
i=1 |wi|)pfor all m∈N. Thus one may
apply the dominated convergence theorem to obtain
lim
m→∞ kH∞(Z)−Hm(Z)kp
= lim
m→∞ E[|H∞(Z)−Hm(Z)|p]1/p = 0.
In particular for m∈Nlarge enough one has kH∞(Z)−
Hm(Z)kp<ε
2and combining this with the triangle inequality
and (34) one obtains
kH(Z)−Hm(Z)kp≤ kH(Z)−H∞(Z)kp
+kH∞(Z)−Hm(Z)kp< ε. (39)
To conclude the proof we now fix m∈Nlarge enough
(so that (39) holds) and show that Hm(Z) = HA,C,ζ
w(Z)for
suitable choices of A, C, ζand w. To do so, first define NJ:=
N(m)
1+· · · +N(m)
nand the block matrices
WJ:=
(w(m)
1)>0
...
0(w(m)
n)>
∈Mn,NJ,
ζJ:=
ζ(m)
1.
.
.
ζ(m)
n
∈RNJ,and AJ:=
A(m)
1
.
.
.
A(m)
n
∈MNJ,n.
Furthermore, to emphasize that mis fixed and h(m)approxi-
mates the identity, set J(x) := h(m)(x)and note that
J(x) = WJσ(AJx+ζJ).(40)
Now set N:= KNJ+Nand define the block matrix A∈MN
by
A=
0NJ,NJ
AJWJ0NJ,NJ
AJWJ
...
0
0...0NJ,NJ
AJWJ0NJ,NJ
A(−1)WJA(−2) WJ· · · · · · A(−K)WJ0N,N
and ζ∈RN,C∈MN,n and w∈RNby
ζ:=
ζJ
.
.
.
ζJ
ζ
, C :=
AJ
0
.
.
.
0
A(0)
,and w:= 0KNJ,1
w.
GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 10
Furthermore, we partition the reservoir states xtof the corre-
sponding echo state system as
xt:=
x(1)
t
.
.
.
x(K+1)
t
,
with x(j)
t∈RNJ, for j≤K, and x(K+1)
t∈RN. With this
notation for xtand these choices of matrices, the recursions
associated to the echo state reservoir map in (32) read as
x(1)
t=σ(AJzt+ζJ),(41)
x(j)
t=σ(AJWJx(j−1)
t−1+ζJ),for j= 2, . . . , K, (42)
x(K+1)
t=σ(
K
X
j=1
A(−j)WJx(j)
t−1+A(0)zt+ζ).(43)
By iteratively inserting (42) into itself and using (41) one
obtains (recall the definition of Jin (40)) that the unique
solution to (42) is given by
x(j)
t=σ(AJ[J]j−1(zt−j+1) + ζJ).(44)
More formally, one uses induction on j: For j= 1 the two
expressions (44) and (41) coincide. For j= 2, . . . , K one
inserts (44) for j−1(which holds by induction hypothesis)
into (42) to obtain
x(j)
t=σ(AJWJσ(AJ[J]j−2(zt−j+1) + ζJ) + ζJ)
=σ(AJ[J]j−1(zt−j+1) + ζJ),
which is indeed (44). Finally, combining (44) and (43) one
obtains
yt=w>x(K+1)
t=w>σ(
K
X
j=1
A(−j)WJx(j)
t−1+A(0)zt+ζ)
=w>σ(
K
X
j=1
A(−j)[J]j(zt−j) + A(0)zt+ζ).
The statement (44) shows, in particular, that the echo state
network associated to A, C, ζand wsatisfies the echo state
property. Moreover, inserting t= 0 in the previous equality
and comparing with the definition of Hm(Z)one sees that
indeed Hm(Z) = HA,C,ζ
w(Z). The approximation statement
(33) therefore follows from (39).
Remark III.11.In this paper we measure closeness between
filters and functionals in a Lpsense. As we already pointed
out in Remark III.6, this choice allows us to approximate with
the systems used in this paper measurable filters that, unlike
in the L∞case, do not necessarily satisfy the fading memory
property. Therefore, an interesting aspect of the universality
results in Proposition III.1, Corollary III.8, Theorem III.9, and
Theorem III.10 is that it is possible to approximately simulate
any measurable filter that does not necessarily satisfy the fad-
ing memory property using the reservoir systems introduced
in those results that do satisfy the fading memory property.
Remark III.12.The results presented in this article address
the approximation capabilities of echo state networks and
other reservoir computing systems. When these systems are
used in practice not all of their parameters are trained. For
example, the recurrent connections of ESNs do not usually
undergo a training process, that is, the architecture parameters
A, C, ζare randomly drawn from a distribution and only the
readout wis trained by linear regression so as to optimally
fit the given teaching signal. Subsequently, an optimization
over a few hyperparameters (for instance, the spectral radius
of A) is carried out. In addition, in many situations the same
reservoir matrix Acan be used for different input time series
and different learning tasks and only the input-to-reservoir
matrices C, ζand the readout wneed to be modified (see,
for instance, the approach taken in [38], [39] to define time
series kernels). This feature is key in the implementation of
the notion of multi-tasking in the RC context (see [10]). Thus,
the empirically observed robustness of ESNs with respect
to these parameter choices is not entirely explained by the
universality results presented here. While in the static setting
of feedforward neural networks such questions have already
been tackled (see, for instance, [40]) for echo state networks
a full explanation is not available yet and these questions are
the subject of ongoing research.
D. An alternative viewpoint
So far all the universality results have been formulated
for functionals and filters with random inputs. Equivalently,
we may formulate them as Lp-approximation results on the
sequence space (Rn)Z−endowed with any measure µthat
makes p-integrable the filter that we want to approximate.
Theorem III.13. Let H: (Rn)Z−→Rbe a measurable
functional. Then, for any probability measure µon (Rn)Z−
with H∈Lp((Rn)Z−, µ)and any ε > 0there exists a
reservoir system that has the echo state property and such that
the corresponding filter is causal and time-invariant, the as-
sociated functional HRC satisfies that HRC ∈Lp((Rn)Z−, µ)
and
kH−HRCkLp((Rn)Z
−,µ)< ε. (45)
The reservoir functional HRC may be chosen as coming from
any of the following systems:
•Linear reservoir with polynomial readout, that is, (9)for
some N∈N,A∈MN,c∈MN,n, and a polynomial h∈
PolN, if the measure µsatisfies the following condition:
for any K∈N,
Z(Rn)Z
−
exp α
K
X
k=0
n
X
i=1
|z(i)
−k|!µ(dz)<∞.
•Linear reservoir with neural network readout, that is, (9)
for some N∈N,A∈MN,c∈MN,n, and a neural
network h∈ HN.
•Trigonometric state-affine system with linear readout, that
is, (21)for some N∈N,w∈RN,P∈TrigN,N and
Q∈TrigN,1.
•Echo state network with linear readout, that is, (32)for
some N∈N,C∈MN,n,ζ∈RN,A∈MN,w∈RN,
where we assume that σ:R→Remployed in (32)is
bounded, continuous and non-constant.
GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 11
Proof. Set Ω=(Rn)Z−,F=⊗t∈Z−B(Rn),P=µ
and define Zt(z) := ztfor all z∈Ω,t∈Z−. Then
F=σ(Zt:t∈Z−) = F−∞ and Zis the identity
mapping on (Rn)Z−. One may now apply Proposition III.1,
Corollary III.8, Theorem III.9 and Theorem III.10 with this
choice of probability space (Ω,F,P)and input process Z. The
statement of Theorem III.13 then precisely coincides with the
statement of Proposition III.1, Corollary III.8, Theorem III.9
and Theorem III.10, respectively.
E. Approximation of stationary strong time series models
Most parametric time series models commonly used in
financial, macroeconometrics, and forecasting applications are
specified by relations of the type
Xt=G(Xt−1,Zt,θ),(46)
where θ∈Rkare the parameters of the model and the vector
Xt∈RNis built so that it contains in its components the
time series of interest and that, at the same time, allows for a
Markovian representation of the model as in (46). The model
is driven by the innovations process Z= (Zt)t∈Z∈(Rn)Z.
When the innovations are made out of independent and iden-
tically distributed random variables we say that the model is
strong [41]. It is customary in the time series literature to
impose constraints on the parameters vector θso that the
relation (46) has a unique second-order stationary solution or,
in the language of this paper, the system (46) satisfies the echo
state property and the associated filter UG: (Rn)Z→RNZ
satisfies that
E[UG(Z)t] =: µand EUG(Z)tUG(Z)>
t+h=: Σh, t, h ∈Z−,
(47)
with µ∈RNand Σh∈MNconstants that do not depend
on t∈Z−. The Wold decomposition theorem [42, Theorem
5.7.1] shows that any such filter can be uniquely written as
the sum of a linear and a deterministic process.
It is obvious that for strong models the stationarity condition
(7) holds and that, moreover, the condition (47) implies that
kUG(Z)k2= sup
t∈Z−nE|UG(Z)t|21/2o= trace (Σ0)1/2<∞.
(48)
This integrability condition guarantees that the approximation
results in Proposition III.1, Corollary III.8, and Theorems
III.9 and III.10 hold for second-order stationary strong time
series models with p= 2. More specifically, the processes
determined by this kind of models can be approximated in
the L2sense by linear processes with polynomial or neural
network readouts (when the condition in Remark III.4 is
satisfied), by trigonometric state-affine systems with linear
readouts, or by echo state networks.
Important families of models to which this approximation
statement can be applied are, among many others, (see the
references for the meaning of the acronyms) GARCH [43],
[44], VEC [45], BEKK [46], CCC [47], DCC [48], [49],
GDC [50], and ARSV [51], [52].
IV. CONCLUSION
We have shown the universality of three different families
of reservoir computers with respect to the Lpnorm associated
to any given discrete-time semi-infinite input process.
On the one hand we proved that linear reservoir systems
with either neural network or polynomial readout maps (in
this case the input process needs to satisfy the exponential
moments condition (11)) are universal.
On the other hand we showed that the exponential moment
condition (11), which was required in the case of polynomial
readouts, can be dropped by considering two different reservoir
families with linear readouts, namely, trigonometric state-
affine systems and echo state networks. The latter are the most
widely used reservoir systems in applications. The linearity in
the readouts is a key feature in supervised machine learning
applications of these systems. It guarantees that they can be
used in high-dimensional situations and in the presence of
large datasets, since the training in that case is reduced to a
linear regression.
We emphasize that, unlike existing results in the literature
[25], [26] dealing with uniform universal approximation, the
Lpcriteria used in this paper allow to formulate universality
statements that do not necessarily impose almost sure uniform
boundedness on the inputs or the fading memory property on
the filter that needs to be approximated.
APPENDIX
A. Auxiliary Lemmas
Lemma A.1. Let Z:Z×Ω→Rnbe a stochastic process
and let Ft:= σ(Z0,...,Zt),t∈Z−, and F−∞ := σ(Zt:t∈
Z−)}. Let F∈Lp(Ω,F−∞,P). Then E[F|Ft]converges to
Fas t→ −∞, both P-almost surely and in norm k·kp, for
any p∈[1,∞).
Proof. Since F−t⊂ F−t−1⊂ F−∞, for all t∈N,
and F∈Lp(Ω,F−∞,P)⊂L1(Ω,F−∞ ,P), one has by
L´
evy’s Upward Theorem (see, for instance, [53, II.50.3] or
[33, Theorem 5.5.7]) that Ft:= E[F|Ft]converges for
t→ −∞ to Fin k·k1and P-almost surely. If p= 1 this
already implies the claim. For p > 1one has by standard
properties of conditional expectations (see, for instance, [33,
Theorem 5.1.4]) that supt∈N{E[|Ft|p]} ≤ E[|F|p]. Hence [33,
Theorem 5.4.5] implies that Ftconverges for t→ −∞ to
some ˜
F∈Lp(Ω,F−∞,P)both in k·kpand P-almost surely.
But this identifies ˜
F= limt→−∞ Ft=F,P-almost surely
and hence Ftconverges for t→ −∞ to Falso in k · kp.
Lemma A.2. For N∈N\{0,1}and j= 1, . . . , N −1define
Aj∈MNby (Aj)k,l =δk,j+1 δl,j for k, l ∈ {1, . . . , N}. Then
for L∈N,j0, . . . , jL∈ {1, . . . , N −1}it holds that
(AjL· · · Aj0)k,l =δk,jL+1δl,j0
L
Y
i=1
δji,ji−1+1.(49)
In particular AjL· · · Aj06= 0 if and only if ji=j0+ifor
i∈ {1, . . . , L}.
Proof. The last statement directly follows from (49). To prove
(49) we proceed by induction on L. Indeed, for L= 0 the
GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 12
formula (49) is just the definition of Aj0. For the induction
step, one assumes that (49) holds for L−1and calculates
(AjL· · · Aj0)k,l
=
N
X
r=1
δk,jL+1δr,jL(AjL−1· · · Aj0)r,l
=
N
X
r=1
δk,jL+1δr,jLδr,jL−1+1δl,j0
L−1
Y
i=1
δji,ji−1+1,
which is indeed (49).
ACKNOWLEDGMENT
The authors thank Lyudmila Grigoryeva and Josef Teich-
mann for helpful discussions and remarks and acknowledge
partial financial support coming from the Research Com-
mission of the Universit¨
at Sankt Gallen, the Swiss National
Science Foundation (grants number 175801/1 and 179114),
and the French ANR “BIPHOPROC” project (ANR-14-OHRI-
0018-02).
REFERENCES
[1] F. Cucker and S. Smale, “On the mathematical foundations of learning,”
Bulletin of the American Mathematical Society, vol. 39, no. 1, pp. 1–49,
2002.
[2] S. Smale and D.-X. Zhou, “Estimating the approximation error in
learning theory,” Analysis and Applications, vol. 01, no. 01, pp. 17–41,
jan 2003.
[3] F. Cucker and D.-X. Zhou, Learning Theory : An Approximation Theory
Viewpoint. Cambridge University Press, 2007.
[4] G. Cybenko, “Approximation by superpositions of a sigmoidal function,”
pp. 303–314, dec 1989.
[5] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward
networks are universal approximators,” Neural Networks, vol. 2, no. 5,
pp. 359–366, 1989.
[6] K. Hornik, “Approximation capabilities of muitilayer feedforward net-
works,” Neural Networks, vol. 4, no. 1989, pp. 251–257, 1991.
[7] W. Maass, T. Natschl¨
ager, and H. Markram, “Real-time computing
without stable states: a new framework for neural computation based
on perturbations,” Neural Computation, vol. 14, pp. 2531–2560, 2002.
[8] H. Jaeger and H. Haas, “Harnessing Nonlinearity: Predicting Chaotic
Systems and Saving Energy in Wireless Communication,” Science, vol.
304, no. 5667, pp. 78–80, 2004.
[9] W. Maass and H. Markram, “On the computational power of circuits of
spiking neurons,” Journal of Computer and System Sciences, vol. 69,
no. 4, pp. 593–616, 2004.
[10] W. Maass, “Liquid state machines: motivation, theory, and applications,”
in Computability In Context: Computation and Logic in the Real World,
S. S. Barry Cooper and A. Sorbi, Eds., 2011, ch. 8, pp. 275–296.
[11] M. B. Matthews, “On the Uniform Approximation of Nonlinear
Discrete-Time Fading-Memory Systems Using Neural Network Mod-
els,” Ph.D. dissertation, ETH Z¨
urich, 1992.
[12] ——, “Approximating nonlinear fading-memory operators using neural
network models,” Circuits, Systems, and Signal Processing, vol. 12,
no. 2, pp. 279–307, jun 1993.
[13] S. Boyd and L. Chua, “Fading memory and the problem of approxi-
mating nonlinear operators with Volterra series,” IEEE Transactions on
Circuits and Systems, vol. 32, no. 11, pp. 1150–1161, 1985.
[14] K.-i. Funahashi and Y. Nakamura, “Approximation of dynamical systems
by continuous time recurrent neural networks,” Neural Networks, vol. 6,
no. 6, pp. 801–806, jan 1993.
[15] W. Maass, P. Joshi, and E. D. Sontag, “Computational aspects of
feedback in neural circuits,” PLoS Computational Biology, vol. 3, no. 1,
p. e165, 2007.
[16] E. Sontag, “Realization theory of discrete-time nonlinear systems: Part I-
The bounded case,” IEEE Transactions on Circuits and Systems, vol. 26,
no. 5, pp. 342–356, may 1979.
[17] E. D. Sontag, “Polynomial Response Maps,” in Lecture Notes Control
in Control and Information Sciences. Vol. 13. Springer Verlag, 1979.
[18] M. Fliess and D. Normand-Cyrot, “Vers une approche alg´
ebrique des
syst`
emes non lin´
eaires en temps discret,” in Analysis and Optimization
of Systems. Lecture Notes in Control and Information Sciences, vol. 28,
A. Bensoussan and J. Lions, Eds. Springer Berlin Heidelberg, 1980.
[19] I. W. Sandberg, “Approximation theorems for discrete-time systems,”
IEEE Transactions on Circuits and Systems, vol. 38, no. 5, pp. 564–
566, 1991.
[20] ——, “Structure theorems for nonlinear systems,” Multidimensional
Systems and Signal Processing, vol. 2, pp. 267–286, 1991.
[21] P. C. Perryman, “Approximation Theory for Deterministic and Stochastic
Nonlinear Systems,” Ph.D. dissertation, University of California, Irvine,
1996.
[22] A. Stubberud and P. Perryman, “Current state of system approximation
for deterministic and stochastic systems,” in Conference Record of The
Thirtieth Asilomar Conference on Signals, Systems and Computers,
vol. 1. IEEE Comput. Soc. Press, 1997, pp. 141–145.
[23] B. Hammer and P. Tino, “Recurrent neural networks with small weights
implement definite memory machines,” Neural Computation, vol. 15,
no. 8, pp. 1897–1929, aug 2003.
[24] P. Tino, B. Hammer, and M. Bod´
en, “Markovian bias of neural-based
architectures with feedback connections,” in Perspectives of Neural-
Symbolic Integration. Studies in Computational Intelligence, vol 77.,
Hammer B. and Hitzler P., Eds. Springer, Berlin, Heidelberg, 2007,
pp. 95–133.
[25] L. Grigoryeva and J.-P. Ortega, “Universal discrete-time reservoir com-
puters with stochastic inputs and linear readouts using non-homogeneous
state-affine systems,” Journal of Machine Learning Research, vol. 19,
no. 24, pp. 1–40, 2018.
[26] ——, “Echo state networks are universal,” Neural Networks, vol. 108,
pp. 495–508, 2018.
[27] H. Jaeger, “The ’echo state’ approach to analysing and training recurrent
neural networks with an erratum note,” German National Research
Center for Information Technology, 2010.
[28] I. B. Yildiz, H. Jaeger, and S. J. Kiebel, “Re-visiting the echo state
property.” Neural networks : the official journal of the International
Neural Network Society, vol. 35, pp. 1–9, nov 2012.
[29] G. Manjunath and H. Jaeger, “Echo state property linked to an input:
exploring a fundamental characteristic of recurrent neural networks,”
Neural Computation, vol. 25, no. 3, pp. 671–696, 2013.
[30] O. Kallenberg, Foundations of Modern Probability, ser. Probability and
Its Applications. Springer New York, 2002.
[31] C. Berg and J. P. R. Christensen, “Density questions in the classical
theory of moments,” Annales de l’Institut Fourier, vol. 31, no. 3, pp.
99–114, 1981.
[32] L. C. Petersen, “On the relation between the multidimensional moment
problem and the one-dimensional moment problem,” Mathematica Scan-
dinavica, vol. 51, no. 2, pp. 361–366, 1983.
[33] R. Durrett, Probability: Theory and Examples, 4th ed., ser. Cambridge
Series in Statistical and Probabilistic Mathematics. Cambridge: Cam-
bridge University Press, 2010.
[34] Ernst, Oliver G., Mugler, Antje, Starkloff, Hans-J¨
org, and Ullmann,
Elisabeth, “On the convergence of generalized polynomial chaos ex-
pansions,” ESAIM: M2AN, vol. 46, no. 2, pp. 317–339, 2012.
[35] C. C. Heyde, “On a property of the lognormal distribution,” The Journal
of the Royal Statistical Society Series B (Methodological), vol. 25, no. 2,
pp. 392–393, 1963.
[36] G. Freud, Orthogonal Polynomials. Pergamon Press, 1971.
[37] W. Rudin, Real and Complex Analysis, 3rd ed. McGraw-Hill, 1987.
[38] H. Chen, F. Tang, P. Tino, and X. Yao, “Model-based kernel for
efficient time series analysis,” in Proceedings of the 19th ACM SIGKDD
international conference on Knowledge discovery and data mining -
KDD ’13, 2013.
[39] H. Chen, P. Tino, A. Rodan, and X. Yao, “Learning in the model space
for cognitive fault diagnosis,” IEEE Transactions on Neural Networks
and Learning Systems, 2014.
[40] G.-B. G.-B. Huang, Q.-Y. Q.-Y. Zhu, C.-k. C.-K. C.-K. Siew, G.-b. H. ˜
A,
Q.-Y. Q.-Y. Zhu, and C.-k. C.-K. C.-K. Siew, “Extreme learning machine
: Theory and applications,” Neurocomputing, 2006.
[41] C. Francq and J.-M. Zakoian, GARCH Models: Structure, Statistical
Inference and Financial Applications. Wiley, 2010.
[42] P. J. Brockwell and R. A. Davis, Time Series: Theory and Methods.
Springer-Verlag, 2006.
[43] R. F. Engle, “Autoregressive conditional heteroscedasticity with esti-
mates of the variance of United Kingdom inflation,” Econometrica,
vol. 50, no. 4, pp. 987–1007, 1982.
[44] T. Bollerslev, “Generalized autoregressive conditional heteroskedastic-
ity,” Journal of Econometrics, vol. 31, no. 3, pp. 307–327, 1986.
GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 13
[45] T. Bollerslev, R. F. Engle, and J. M. Wooldridge, “A capital asset pricing
model with time varying covariances,” Journal of Political Economy,
vol. 96, pp. 116–131, 1988.
[46] R. F. Engle and F. K. Kroner, “Multivariate simultaneous generalized
ARCH,” Econometric Theory, vol. 11, pp. 122–150, 1995.
[47] T. Bollerslev, “Modelling the coherence in short-run nominal exchange
rates: A multivariate generalized ARCH model,” Review of Economics
and Statistics, vol. 72, no. 3, pp. 498–505, 1990.
[48] Y. K. Tse and A. K. C. Tsui, “A multivariate GARCH with time-varying
correlations,” Journal of Business and Economic Statistics, vol. 20, pp.
351–362, 2002.
[49] R. F. Engle, “Dynamic conditional correlation: a simple class of multi-
variate GARCH models,” Journal of Business and Economic Statistics,
vol. 20, pp. 339–350, 2002.
[50] F. K. Kroner and V. K. Ng, “Modelling asymmetric comovements of
asset returns,” The Review of Financial Studies, vol. 11, pp. 817–844,
1998.
[51] S. J. Taylor, “Financial returns modelled by the product of two stochastic
processes, a study of daily sugar prices,” in Time series analysis: theory
and practice I, B. D. Anderson, Ed., 1982, pp. 1961–1979.
[52] A. C. Harvey, E. Ruiz, and N. Shephard, “Multivariate stochastic
variance models,” Review of Economic Studies, vol. 61, pp. 247–264,
1994.
[53] L. C. G. Rogers and D. Williams, Diffusions, Markov Processes, and
Martingales, 2nd ed. Cambridge University Press, 2000, vol. 1.