Content uploaded by Juan-Pablo Ortega

Author content

All content in this area was uploaded by Juan-Pablo Ortega on Feb 13, 2019

Content may be subject to copyright.

GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 1

Reservoir Computing Universality With Stochastic

Inputs

Lukas Gonon and Juan-Pablo Ortega

Abstract—The universal approximation properties with respect

to Lp-type criteria of three important families of reservoir

computers with stochastic discrete-time semi-inﬁnite inputs are

shown. First, it is proved that linear reservoir systems with either

polynomial or neural network readout maps are universal. More

importantly, it is proved that the same property holds for two

families with linear readouts, namely, trigonometric state-afﬁne

systems and echo state networks, which are the most widely used

reservoir systems in applications. The linearity in the readouts

is a key feature in supervised machine learning applications. It

guarantees that these systems can be used in high-dimensional

situations and in the presence of large datasets. The Lpcriteria

used in this paper allow the formulation of universality results

that do not necessarily impose almost sure uniform boundedness

in the inputs or the fading memory property in the ﬁlter that

needs to be approximated.

Index Terms—Reservoir computing, echo state network, ESN,

machine learning, uniform system approximation, stochastic

input, universality.

I. INTRODUCTION

AUNIVERSALITY statement in relation to a machine

learning paradigm refers to its versatility at the time of

reproducing a rich number of patterns obtained by modifying

only a limited number of hyperparameters. In the language

of learning theory, universality amounts to the possibility of

making approximation errors as small as one wants [1]–[3].

Well-known universality results are, for example, the uniform

approximation properties of feedforward neural networks es-

tablished in [4], [5] for deterministic inputs and, later on,

extended in [6] to accommodate random inputs.

This paper is a generalization of the universality statements

in [6] to a discrete-time dynamical context. More speciﬁcally,

we are interested in the learning not of functions but of

ﬁlters that transform semi-inﬁnite random input sequences

parameterized by time into outputs that depend on those inputs

in a causal and time-invariant manner. The approximants

used are small subfamilies of reservoir computers (RC) [7],

[8] or reservoir systems. Reservoir computers (also referred

to in the literature as liquid state machines [9], [10]) are

ﬁlters generated by nonlinear state-space transformations that

constitute special types of recurrent neural networks. They are

determined by two maps, namely a reservoir F:RN×Rn−→

RN,n, N ∈N, and a readout map h:RN→Rthat under

certain hypotheses transform (or ﬁlter) an inﬁnite discrete-time

L. Gonon and J.-P. Ortega are with the Faculty of Mathematics and

Statistics, Universit¨

at Sankt Gallen, Sankt Gallen, Switzerland. L. Gonon is

also afﬁliated with the Department of Mathematics, ETH Z¨

urich, Switzerland.

J.-P. Ortega is also afﬁliated with the Centre National de la Recherche

Scientiﬁque (CNRS), France.

input z= (. . . , z−1,z0,z1, . . .)∈(Rn)Zinto an output signal

y∈RZof the same type using a state-space transformation

given by:

xt=F(xt−1,zt),

yt=h(xt),

(1)

(2)

where t∈Zand the dimension N∈Nof the state vectors

xt∈RNis referred to as the number of virtual neurons

of the system. In supervised machine learning applications

the reservoir map is very often randomly generated and the

memoryless readout is trained so that the output matches a

given teaching signal. An important particular case of the RC

systems in (1)-(2) are echo state networks (ESN) introduced,

in different contexts, in [8], [11], [12], and that are built using

the transformations

(xt=σ(Axt−1+Czt+ζ),

yt=w>xt,(3)

with A∈MN,C∈MN,n,ζ∈RN, and w∈RN. The map

σ:RN→RNis obtained via the componentwise application

of a given activation function σ:R→Rthat is denoted

with the same symbol. ESNs have as an important feature

the linearity of the readout speciﬁed by the vector w∈RN

that is estimated using linear regression methods based on a

training dataset. This is done once the other parameters in the

model (A,C, and ζ) have been randomly generated and their

scale has been adapted to the problem in question by tuning

a limited number of hyperparameters (like the sparsity or the

spectral radius of the matrix A).

Families of reservoir systems of the type (1)-(2) have

already been proved to be universal in different contexts. In the

continuous-time setup, it was shown in [13] that linear reser-

voir systems with polynomial readouts or bilinear reservoirs

with linear readouts are able to uniformly approximate any

fading memory ﬁlter with uniformly bounded and equicon-

tinuous inputs. The fading memory property is a continuity

feature exhibited by many ﬁlters encountered in applications.

See also [9], [10], [14], [15] for other contributions to the RC

universality problem in the continuous-time setup.

In the discrete-time setup, several universality statements

were already part of classical systems theory statements for

inputs deﬁned on a ﬁnite number of time points [16]–[18].

In the more general context of semi-inﬁnite inputs, various

universality results have been formulated for systems with

approximate ﬁnite memory [11], [12], [19]–[22]. More re-

cently, it has been shown in [23], [24], that RCs generated

by contractive reservoir maps (similar to the ESNs introduced

GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 2

above) exhibit universality properties in the approximate ﬁnite

memory category.

These universality results have been recently extended to

the causal and fading memory category in [25], [26]. In those

works the universality of two important families of reservoir

systems with linear readouts has been established, namely,

the so called state afﬁne systems (SAS) and the echo state

networks (ESN) that we just introduced in (3). Moreover, the

universality of the SAS family was established in [25] both for

uniformly bounded deterministic inputs, as well as for almost

surely uniformly bounded stochastic ones. This last statement

was shown to be a corollary of a general transfer theorem

that proves that very important features of causal and time-

invariant ﬁlters like the fading memory property or universality

are naturally inherited by reservoir systems with almost surely

uniformly bounded stochastic inputs from their counterparts

with deterministic inputs.

Unfortunately, almost surely bounded random inputs are

not always appropriate for many applications. For example,

most parametric time series models use as driving innova-

tions random variables whose distributions are not compactly

supported (Gaussian, for example) in order to ensure ade-

quate levels of performance. The main goal of this work

is formulating universality results in the stochastic context

that do not impose almost sure uniform boundedness in the

inputs. This is achieved by using a density criterion (which is

the mathematical characterization of universality) based not

on L∞-type norms, like in [25], [26], but on Lpnorms,

p∈[1,∞). This approach follows the pattern introduced in

the static case in [6].

This strategy allows to cover a more general class of input

signals and ﬁlters, but it also creates some differences in

the type of approximation results that are obtained. More

speciﬁcally, in the stochastic universality statements in [25],

for example, universal families are presented that uniformly

approximate any given ﬁlter for any input in a given class of

stochastic processes. In contrast with this statement and like

in [6], we ﬁx here ﬁrst a discrete-time stochastic process that

models the data generating process (DGP) behind the system

inputs that are being considered. Subsequently, families of

reservoir ﬁlters are spelled out whose images of the DGP

are dense in the Lpsense. Equivalently, the image of the

DGP by any measurable causal and time invariant ﬁlter can

be approximated by the image of one of the members of the

universal family with respect to an Lpnorm deﬁned using the

law of the preﬁxed DGP.

It is important to point out that this approach allows us to

formulate universality results for ﬁlters that do not necessarily

have the fading memory property since only measurability is

imposed as a hypothesis.

The paper contains three main universality statements. The

ﬁrst one shows that linear reservoir systems with either poly-

nomial or neural network readout maps are universal in the

Lpsense. More importantly, two other families with linear

readouts are shown to also have this property, namely, trigono-

metric state-afﬁne systems and echo state networks, which

are the most widely used reservoir systems in applications.

The linearity of the readout is a key feature of these systems

since in supervised machine learning applications it reduces

the training task to the solution of a linear regression problem,

which can be implemented efﬁciently also in high-dimensional

situations and in the presence of large datasets.

We emphasize that, from a learning theoretical perspective,

the results in this paper only establish the possibility of

making the approximation error arbitrarily small when using

the proposed RC families in a speciﬁc learning task. We do

not provide bounds neither for the approximation nor the

corresponding estimation errors using ﬁnite random samples.

Even though some results in this direction already exist in the

literature [23], [24], we plan to address this important subject

in a forthcoming paper where the same degree of generality

as in the present paper will be adopted.

II. PRELIMINARIES

In this section we introduce some notation and collect

general facts about ﬁlters, reservoir systems, and stochastic

input signals.

A. Notation

We write N={0,1, . . .}and Z−={. . . , −1,0}. The

elements of the Euclidean spaces Rnwill be written as column

vectors and will be denoted in bold. Given a vector v∈Rn, we

denote its entries by vior by v(i), with i∈ {1, . . . , n}.(Rn)Z

and (Rn)Z−denote the sets of inﬁnite Rn-valued sequences of

the type (. . . , z−1,z0,z1, . . .)and (. . . , z−1,z0)with zi∈Rn

for i∈Zand i∈Z−, respectively. Additionally, we denote

by z(k)

ithe k-th component of zi. The elements in these

sequence spaces will also be written in bold, for example,

z:= (. . . , z−1,z0)∈(Rn)Z−. We denote by Mn,m the space

of real n×mmatrices with m, n ∈N. When n=m, we

use the symbol Mnto refer to the space of square matrices

of order n. Random variables and stochastic processes will be

denoted using upper case characters that will be bold when

they are vector valued.

B. Filters and functionals

A ﬁlter is a map U: (Rn)Z→RZ. It is called causal, if

for any z,w∈(Rn)Zwhich satisfy zτ=wτfor all τ≤t

for a given t∈Z, one has that U(z)t=U(w)t. Denote

by T−τ: (Rn)Z→(Rn)Zthe time delay operator deﬁned by

T−τ(z)t:= zt+τ, for any τ∈Z. A ﬁlter Uis called time-

invariant, if T−τ◦U=U◦T−τfor all τ∈Z.

Causal and time-invariant ﬁlters can be equivalently de-

scribed using their naturally associated functionals. We refer

to a map H: (Rn)Z−→Ras a functional. Given a causal

and time-invariant ﬁlter U, one deﬁnes the functional HU

associated to it by setting HU(z) := U(ze)0. Here zeis an

arbitrary extension of z∈(Rn)Z−to (Rn)Z.HUdoes not

depend on the choice of this extension since Uis causal.

Conversely, given a functional Hone may deﬁne a causal and

time-invariant ﬁlter UH: (Rn)Z→RZby setting UH(z)t:=

H(πZ−◦T−t(z)), where πZ−: (Rn)Z→(Rn)Z−is the natural

projection. One may verify that any causal and time-invariant

ﬁlter can be recovered from its associated functional and

GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 3

conversely. Equivalently, U=UHUand H=HUH. We refer

to [13] for further details.

If Uis causal and time-invariant, then for any z∈(Rn)Z

the sequence U(z)restricted to Z−only depends on (zt)t∈Z−.

Thus we may also consider Uas a map U: (Rn)Z−→RZ−,

but when we do so this will always be clear from the context.

C. Reservoir computing systems

A speciﬁc class of ﬁlters can be obtained using the reservoir

computing systems or reservoir computers (RC) introduced in

(1)-(2) when they satisfy the so called echo state property

(ESP) given by the following statement (see [27]–[29]): for

any z∈(Rn)Zthere exists a unique x∈(RN)Zsuch that (1)

holds. In the presence of the ESP, the RC system gives rise

to a well-deﬁned ﬁlter UF

hthat is constructed by associating

to any z∈(Rn)Zthe unique x∈(RN)Zsatisfying (1)

and by mapping xsubsequently to the output in (2), that

is, UF

h(z)t:= yt. Furthermore, it can be shown (see [26,

Proposition 2.1]) that UF

his necessarily causal and time-

invariant and hence we may associate to UF

ha reservoir

functional HF

h: (Rn)Z−→Rdeﬁned as HF

h(z) := UF

h(z)0.

As seen above, the causal and time-invariant ﬁlter UF

his

uniquely determined by the reservoir functional HF

h. Since

the latter is determined by the restriction of the RC system to

Z−, we will sometimes consider the system (1)-(2) only for

t∈Z−.

D. Deterministic ﬁlters with stochastic inputs

We are interested in feeding the ﬁlters and the systems

that we just introduced with stochastic processes as inputs.

More explicitly, given a causal and time-invariant ﬁlter U

that satisﬁes certain measurability hypotheses, any stochastic

process Z= (Zt)t∈Z−is mapped to a new stochastic process

(U(Z)t)t∈Z−. The main contributions in this article address

the question of approximating U(Z)by reservoir ﬁlters in an

Lpsense. We now introduce the precise framework to achieve

this goal.

1) Probabilistic framework: Consider a probability space

(Ω,F,P)on which all random variables are deﬁned. Recall

that the sample space Ωis an arbitrary set representing

possible outcomes, the σ-algebra Fis a collection of sub-

sets of Ωdescribing the set of events to be considered,

and P:F → [0,1] is a probability measure that assigns a

probability of occurrence to each event. The input signal is

modeled as a discrete-time stochastic process Z= (Zt)t∈Z−

with values in Rn. For each outcome ω∈Ωwe denote

by Z(ω) = (Zt(ω))t∈Z−the realization or sample path of

Z. Thus Zmay be viewed as a random sequence in Rn

and when dealing with stochastic processes we will make no

distinctions between the assignment Z:Z−×Ω→Rnand

the corresponding map into path space Z: Ω →(Rn)Z−. We

recall that Zis a stochastic process when the corresponding

map Z: Ω →(Rn)Z−is measurable. Here (Rn)Z−is equipped

with the product σ-algebra ⊗t∈Z−B(Rn)(which coincides

with the Borel σ-algebra of (Rn)Z−equipped with the product

topology by [30, Lemma 1.2]), where B(Rn)is the Borel σ-

algebra on Rn.

We denote by Ft:= σ(Z0,...,Zt),t∈Z−, the σ-algebra

generated by {Z0,...,Zt}and write F−∞ := σ(Zt:t∈Z−).

Thus Ftmodels the information contained in the input stream

at times 0,−1, . . . , t. For p∈[1,∞]we denote by Lp(Ω,F,P)

the Banach space formed by the real-valued random variables

in (Ω,F,P)that have a ﬁnite usual Lpnorm k·kp.

We say that the process Zis stationary when for any

{t1, . . . , tk} ⊂ Z−,h∈Z−, and At1, . . . , Atk∈ B(Rn),

we have that

P(Zt1∈At1,...,Ztk∈Atk)

=P(Zt1+h∈At1,...,Ztk+h∈Atk).

2) Measurable functionals and ﬁlters: We say that a func-

tional His measurable when the map between measurable

spaces H:(Rn)Z−,⊗t∈Z−B(Rn)→(R,B(R)) is measur-

able. When His measurable then so is H(Z) : (Ω,F)→

(R,B(R)) since H(Z) = H◦Zis the composition of

measurable maps and hence H(Z)is a random variable on

(Ω,F,P).

Analogously, we will say that a causal, time-invariant ﬁlter

Uis measurable when the map between measurable spaces

U:(Rn)Z,⊗t∈ZB(Rn)→RZ,⊗t∈ZB(R)is measurable.

In that case, also the restriction of Uto Z−(see above) is

measurable and so U(Z)is a real-valued stochastic process.

As discussed above, causal, time-invariant ﬁlters and func-

tionals are in a one-to-one correspondence. This relation

is compatible with the measurability condition, that is, a

causal and time-invariant ﬁlter is measurable if and only if

the associated functional is measurable. In order to prove

this statement we show ﬁrst that the operator πZ−◦T−t:

(Rn)Z,⊗t∈ZB(Rn)−→ (Rn)Z−,⊗t∈Z−B(Rn)is a mea-

surable map, for any t∈Z−. Indeed, notice ﬁrst that the

projections pi:(Rn)Z,⊗t∈ZB(Rn)−→ (Rn,B(Rn)),

i∈Z−, given by pi(z) = ziare measurable. Thus πZ−◦T−t

can be written as the Cartesian product of measurable maps,

i.e. for each k∈Z−one has that (πZ−◦T−t)k=pt+kis

measurable. This yields that πZ−◦T−tis measurable [30,

Lemma 1.8].

Now, if His a measurable functional, this implies that the

associated ﬁlter UHis also measurable, since for each t∈Z−,

(UH)t=H◦πZ−◦T−t,(4)

is a composition of measurable functions and hence also

measurable. Conversely, if Uis causal, time-invariant, and

measurable, then so is the associated functional HU=p0◦U.

3) Lp-norm for functionals: Fix p∈[1,∞)and let Hbe

a measurable functional such that H(Z)∈Lp(Ω,F,P). The

functionals which satisfy that

kH(Z)kp:= E[|H(Z)|p]1/p <∞(5)

will be referred to as p-integrable with respect to the input

process Z.

Let us now consider the expression (5) from an alternative

point of view. Denote by µZ:= P◦Z−1the law of Zwhen

viewed as a (Rn)Z−-valued random variable as above. Thus

µZis a probability measure on (Rn)Z−such that for any

measurable set A⊂(Rn)Z−one has µZ(A) = P(Z∈A).

GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 4

The requirement H(Z)∈Lp(Ω,F,P)then translates to

H∈Lp((Rn)Z−, µZ)and (5) is equal [30, Lemma 1.22] to

kHkµZ

p:= "Z(Rn)Z

−

|H(z)|pµZ(dz)#1/p

=kH(Z)kp.

Thus, the results formulated later on in the paper for

functionals with random inputs can also be seen as statements

for functionals with deterministic inputs in (Rn)Z−, where

the closeness between them is measured using the norm in

Lp((Rn)Z−, µZ). Following the terminology used by [6] we

will refer to µZas the input environment measure.

We emphasize that these two points of view are equivalent.

Given any probability measure µZon (Rn)Z−one may set

Ω=(Rn)Z−,F=⊗t∈Z−B(Rn),P=µZand deﬁne Zt(z) :=

ztfor all z∈Ω. We will switch between these two viewpoints

throughout the paper without much warning to the reader.

4) Lp-norm for ﬁlters: Fix p∈[1,∞). A causal, time-

invariant, measurable ﬁlter Uis said to be p-integrable, if

kU(Z)kp:= sup

t∈Z−nE[|U(Z)t|p]1/po<∞.(6)

It is easy to see that if Uis p-integrable, then so is the

corresponding functional HUdue to the following inequality

kHU(Z)kp=E[|HU(Z)|p]1/p =E[|U(Z)0|p]1/p

≤sup

t∈Z−nE[|U(Z)t|p]1/po=kU(Z)kp<∞.

The converse implication holds true when the input process

is stationary. In order to show this fact, notice ﬁrst that if µt

is the law of πZ−◦T−t(Z),t∈Z−, and Zis by hypothesis

stationary then, for any {t1, . . . , tk} ⊂ Z−and At1, . . . , Atk∈

B(Rn), we have that

P(πZ−◦T−t(Z))t1∈At1,...,(πZ−◦T−t(Z))tk∈Atk

=P(Zt1+t∈At1,...,Ztk+t∈Atk)

=P(Zt1∈At1,...,Ztk∈Atk),

which proves that

µZ=µt,for all t∈Z−.(7)

This identity, together with (4), implies that for any p-

integrable functional H:

kUH(Z)kp= sup

t∈Z−nE[|UH(Z)t|p]1/po

= sup

t∈Z−nE|H(πZ−◦T−t(Z))|p1/po

= sup

t∈Z−

"Z(Rn)Z

−

|H(z)|pµt(dz)#1/p

= sup

t∈Z−

"Z(Rn)Z

−

|H(z)|pµZ(dz)#1/p

=kH(Z)kp<∞,

(8)

which proves the p-integrability of the associated ﬁlter UH.

III. Lp-UNIVERSALITY RESULTS

Fix p∈[1,∞),Zan input process, and a functional H

such that H(Z)∈Lp(Ω,F,P). The goal of this section is

ﬁnding simple families of reservoir systems that are able to

approximate H(Z)as accurately as needed in the Lpsense.

The ﬁrst part contains a result that shows that linear reservoir

maps with polynomial readouts are able to carry this out. As

we already pointed out in the introduction, a result for the

same type of reservoir systems has been proved in [25] in the

L∞setting for both deterministic and almost surely uniformly

bounded stochastic inputs. The second part presents a family

that is able to achieve universality using only linear readouts,

which is of major importance for applications since in that

case the training effort reduces to solving a linear regression.

Finally, we prove the universality of echo state networks which

is the most widely used family of reservoir systems with linear

readouts.

A. Linear reservoirs with nonlinear readouts

Consider a reservoir system with linear reservoir map and

a polynomial readout. More precisely, given A∈MN,c∈

MN,n, and h∈PolNa real-valued polynomial in Nvariables,

consider the system

(xt=Axt−1+czt, t ∈Z−,

yt=h(xt), t ∈Z−,(9)

for any z∈(Rn)Z−. If the matrix Ais chosen so that

σmax(A)<1, then this system has the echo state property

and the corresponding reservoir ﬁlter UA,c

his causal and time-

invariant [25]. We denote by HA,c

hthe associated functional.

We are interested in the approximation capabilities that can be

achieved by using processes of the type HA,c

h(Z), where Zis

a ﬁxed input process and HA,c

h(Z) = Y0, with Y0obviously

determined by the stochastic reservoir system

(Xt=AXt−1+cZt, t ∈Z−,

Yt=h(Xt), t ∈Z−.(10)

Proposition III.1. Fix p∈[1,∞), let Zbe a ﬁxed Rn-valued

input process, and let Hbe a functional such that H(Z)∈

Lp(Ω,F,P). Suppose that for any K∈Nthere exists α > 0

such that

E"exp α

K

X

k=0

n

X

i=1

|Z(i)

−k|!#<∞,(11)

where Z(i)

−kdenotes the i-th component of Z−k. Then, for any

ε > 0there exists N∈N,A∈MN,c∈MN,n, and h∈PolN

such that (9)has the echo state property, the corresponding

ﬁlter is causal and time-invariant, the associated functional

satisﬁes HA,c

h(Z)∈Lp(Ω,F,P), and

kH(Z)−HA,c

h(Z)kp< ε. (12)

If the input process Zis stationary then

kUH(Z)−UA,c

h(Z)kp< ε. (13)

GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 5

Proof. The proof consists of two steps: In the ﬁrst one we

use assumption (11) and classical results in the literature to

establish that

Poln(K+1) is dense in Lp(Rn(K+1), µK),for all K∈N,

(14)

where µKis the law of (Z(1)

0, Z(2)

0, . . . , Z(n−1)

−K, Z(n)

−K)on

Rn(K+1) under P. In the second step we then use (14) to

construct a linear RC system of the type in (9) that yields the

approximation statement (12).

Step 1: Denote by µKthe law of

(Z(1)

0, Z(2)

0, . . . , Z(n−1)

−K, Z(n)

−K)on RNunder P, where

N:= n(K+ 1). By (11) there exists α > 0such that

RRNexp(αkzk1)µK(dz)<∞, where here and in the rest

of this proof k · k1denotes the Euclidean 1-norm. Denoting

by µj

Kthe j-th marginal distribution of µK, this implies for

j= 1, . . . , N that

ZR

exp(α|z(j)|)µj

K(dz(j))≤ZRN

exp(αkzk1)µK(dz)<∞.

Consequently, by [31, Theorem 6], Pol1is dense in Lp(R, µj

K)

for any p∈[1,∞),j= 1, . . . , N . By [32, Proposition page

364] this implies that PolNis dense in Lp(RN, µK), where

we note that µKindeed satisﬁes the moment assumption in

[32, Page 361]: since x2m≤exp(αx)for any x≥0,m∈N,

one has

ZRN

kzk2m

2µK(dz)≤ZRN

exp(αkzk2)µK(dz)

≤ZRN

exp(αkzk1)µK(dz)<∞.

Step 2: Let ε > 0. By Lemma A.1 in the appendix there

exists K∈Nsuch that

kH(Z)−E[H(Z)|F−K]kp<ε

2(15)

where F−K:= σ(Z0,...,Z−K). In the following para-

graphs we will establish the approximation statement (12) for

E[H(Z)|FK]instead of H(Z). Combining this with (15) will

then yield (12).

Let N:= n(K+ 1). By deﬁnition, E[H(Z)|F−K]is

F−K-measurable and hence there exists [30, Lemma 1.13] a

measurable function gK:RN→Rsuch that E[H(Z)|F−K] =

gK(Z0,...,Z−K). Furthermore,

ZRN

|gK(z)|pµK(dz)

=E[|E[H(Z)|F−K]|p]≤E[|H(Z)|p]<∞,

by standard properties of conditional expectations (see, for in-

stance, [33, Theorem 5.1.4]) and the assumption that H(Z)∈

Lp(Ω,F,P). Thus, gK∈Lp(RN, µK)and using the statement

(14) established in Step 1, there exists h∈PolNsuch that

kE[H(Z)|F−K]−h(Z>

0,...,Z>

−K)kp

=kgK−hkLp(RN,µK)<ε

2.(16)

Deﬁne now a reservoir system of the type (10) with inputs

given by the random variables Zt,t∈Z−and reservoir

matrices A∈MNand c∈MN,n with all entries equal to

0except Ai,i−n= 1 for i=n+ 1, . . . , N and ci,i = 1 for

i= 1, . . . , n, that is

A=0n,nK 0n,n

InK 0n,n ,and c=In

0nK,n .

This system has the echo state property (all the eigenval-

ues of Aequal zero) and has a unique causal and time

invariant solution associated to the reservoir states Xt:=

Z>

t,Z>

t−1,...,Z>

t−K>,t∈Z−. It is easy to verify that

the corresponding reservoir functional is given by

HA,c

h(Z) = h(Z>

0,...,Z>

−K).(17)

Now the triangle inequality and (15), (16) and (17) allow us

to conclude (12).

The statement in (13) in the presence of the stationarity

hypothesis for Zis a straightforward consequence of (7) and

the equality (8).

Remark III.2.It is important to point out that the reservoir

systems used in the proof of Proposition III.1 all have ﬁnite

memory. Thus, this proof shows that it is possible to obtain

universality in the Lpsense with that type of ﬁnite memory

systems and that, in particular, they can be used to approximate

inﬁnite memory ﬁlters. A key ingredient in this statement

is, apart from the hypothesis (11), the Lemma A.1 in the

Appendix. The other universal systems introduced later on in

the paper (trigonometric state-afﬁne systems and echo state

networks) also share this feature. Similar statements have

also been proved for linear reservoir systems with polynomial

readouts and state-afﬁne systems with linear readouts in the

L∞setup for both deterministic and almost surely uniformly

bounded stochastic inputs (see, for instance, [25, Corollary 11,

Theorem 19]). This phenomenon has also been observed in the

in the context of approximation of deterministic ﬁlters using

Volterra series operators (see [13, Theorems 3 and 4]).

Remark III.3.A simple situation in which condition (11) is

satisﬁed is when for any t∈Z−the random variable Ztis

bounded, i.e. for any t∈Z−there exists Ct≥0such that

kZtk ≤ Ct,P-a.s. However, as the next remark shows, there

are also practically relevant examples of input streams with

unbounded support, for which (11) is satisﬁed.

Remark III.4.A sufﬁcient condition for (11) to hold is that

the random variables {Zt:t∈Z−}are independent and

that for each t, there exists a constant α > 0such that

E[exp(αPn

i=1 |Z(i)

t|)] <∞. This last condition is satisﬁed,

for instance, if Ztis normally distributed. For input streams

coming from more heavy-tailed distributions like Student’s t-

distribution, the condition is not satisﬁed and so one should use

the reservoir systems considered below (see Corollary III.8,

Theorem III.9, and Theorem III.10) instead if universality is

needed.

Remark III.5.Assumption (11) can be replaced by alternative

assumptions but it can not be removed. Even if n= 1 and

{Zt:t∈Z−}are independent and identically distributed with

distribution ν, a condition stronger than the existence of mo-

ments of all orders for νis required. As a counterexample, one

may take for νa lognormal distribution. Then νhas moments

of all orders, but (11) is not satisﬁed. Let us now argue that

GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 6

the approximation result proved under assumption (11) fails

in this case. The following argument relies on results for the

classical moment problem (see, for example, the collection of

references in [34]).

Indeed, by [35]νis not determinate (there exist other

probability measures with identical moments) and thus (see

e.g. [36, Theorem 4.3]) Pol1is not dense in Lp(R, ν)for

p≥2. In particular, there exists g∈Lp(R, ν)and ε > 0

such that kg−˜

hkp> ε for all ˜

h∈Pol1. Suppose that

we are in the case n= 1 and let {Zt:t∈Z−}be

independent and identically distributed with distribution ν

and H(z) := g(z0)for z∈RZ−. Then, for any choice

of N,A,cand hone has E[HA,c

h(Z)|F0] = ˜

h(Z0), where

˜

h(x) := E[h(AX−1+cx)], x ∈R, is a polynomial. Thus one

may use [33, Theorem 5.1.4] and the fact that by construction

H(Z)is F0-measurable to obtain

kH(Z)−HA,c

h(Z)kp≥ kE[H(Z)|F0]−E[HA,c

h(Z)|F0]kp

=kg−˜

hkp> ε.

Remark III.6.In previous reservoir computing universality

results for both deterministic and stochastic inputs quoted in

the introduction there was an important continuity hypothesis

called the fading memory property that does not play a role

here and that has been replaced by the integrability require-

ment H∈Lp((Rn)Z−, µZ). In particular, the universality

results that we just proved and those that come in the next

section (see Theorem III.9) yield approximations for ﬁlters

which do not necessarily have the fading memory property.

Whether or not the approximation results apply depends on the

integrability condition with respect to the input environment

measure µZ. Consider, for example, the functional associated

to the peak-hold operator [13]. In the discrete-time setting, the

associated functional is

H(z) = sup

t≤0

{zt},with z∈RZ−.

We now show that the two possibilities H∈Lp((Rn)Z−, µZ)

and H /∈Lp((Rn)Z−, µZ)are feasible, depending on the

choice of µZ:

•Let Z= (Zt)t∈Z−be one dimensional independent

and identically distributed (i.i.d) random variables with

unbounded support and denote by µZthe law of Zon

RZ−. Denoting by Fthe distribution function of Z1and

using the i.i.d assumption one calculates, for any a∈R,

P(H(Z)> a) = 1 −P(∩t<0{Zt≤a})

= 1 −lim

n→∞ F(a)n= 1.

Hence, we can conclude that H(Z) = ∞,µZ-almost

everywhere and therefore H /∈Lp((Rn)Z−, µZ).

•Consider now the same setup, but assume this time that

the random variables have bounded support, that is, for

some amax ∈Rone has that P(Zt≤amax)=1and

P(Zt> amax) = 0. Then, the same argument shows

that H(Z) = amax,µZ-almost everywhere and therefore

H∈Lp((Rn)Z−, µZ).

Remark III.7.From the proof of Proposition III.1 one sees

that one could replace in its statement PolNby any other

family {HN}N∈Nthat satisﬁes the density statement (14). In

particular, the following corollary shows that this result can

be obtained with readouts made out of neural networks.

Denote by HNthe set of feedforward one hidden layer

neural networks with inputs in RNthat are constructed with

a ﬁxed activation function σ. More speciﬁcally, HNis made

of functions h:RN→Rof the type

h(x) =

k

X

j=1

βjσ(αj·x−θj),(18)

for some k∈N,βj, θj∈R, and αj∈RN, for j= 1, . . . , k.

Corollary III.8. In the setup of Proposition III.1, consider the

family of neural networks h∈ HNconstructed with a ﬁxed

activation function σthat is bounded and non-constant. Then,

for any ε > 0there exists N∈N,A∈MN,c∈MN,n, and a

neural network h∈ HNsuch that the corresponding reservoir

system (9)has the echo state property and has a unique causal

and time-invariant ﬁlter associated. Moreover, the correspond-

ing functional satisﬁes that HA,c

h(Z)∈Lp(Ω,F,P)and

kH(Z)−HA,c

h(Z)kp< ε. (19)

Proof. By [6, Theorem 1] the set HNis dense in Lp(RN, µ)

for any ﬁnite measure µon RN. Thus, statement (14) holds

with HNreplacing Poln(K+1). Mimicking line by line the

proof of Step 2 in Proposition III.1 then proves the Corollary.

B. Trigonometric state-afﬁne systems with linear readouts

Fix M, N ∈Nand consider R:Rn→MN ,M deﬁned by

R(z) :=

r

X

k=1

Akcos(uk·z) + Bksin(vk·z),z∈Rn,(20)

for some r∈N,Ak, Bk∈MN,M ,uk,vk∈Rn, for

k= 1, . . . , r. The symbol TrigN,M denotes the set of all

functions of the type (20). We call the elements of TrigN,M

trigonometric polynomials.

We now introduce reservoir systems with linear readouts

and reservoir maps constructed using trigonometric polyno-

mials: let N∈N,w∈RN,P∈TrigN,N ,Q∈TrigN,1and

deﬁne, for any z∈(Rn)Z−, the system:

(xt=P(zt)xt−1+Q(zt), t ∈Z−,

yt=w>xt, t ∈Z−.(21)

We call the systems of this type trigonometric state-afﬁne

systems. When such a system has the echo state property and a

unique causal and time-invariant solution for any input, we de-

note by UP,Q

wthe corresponding ﬁlter and by HP,Q

w(z) := y0

the associated functional. As in the previous section, we ﬁx

p∈[1,∞),Zan input process, and a functional Hsuch that

H(Z)∈Lp(Ω,F,P)and we are interested in approximating

H(Z)by systems of the form HP,Q

w(Z). Again, we will write

HP,Q

w(Z) = Y0, where Y0is uniquely determined by the

reservoir system with stochastic inputs

(Xt=P(Zt)Xt−1+Q(Zt), t ∈Z−,

Yt=w>Xt, t ∈Z−.(22)

GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 7

Deﬁne Aas the set of four-tuples (N, w, P, Q)∈N×RN×

TrigN,N ×TrigN,1whose associated systems (21) have the

echo state property and the unique solutions are causal and

time-invariant. In particular, for such (N, w, P, Q)a reservoir

functional HP,Q

wassociated to (21) exists.

Theorem III.9. Let p∈[1,∞)and let Zbe a ﬁxed Rn-valued

input process. Denote by LZthe set of reservoir functionals

of the type (21)which are p-integrable, that is,

LZ:= {HP,Q

w(Z) : (N, w, P, Q)∈ A} ∩ Lp(Ω,F,P).

Then LZis dense in Lp(Ω,F−∞,P).

In particular, for any functional Hsuch that H(Z)∈

Lp(Ω,F,P)and any ε > 0, there exists N∈N,w∈RN,

P∈TrigN,N and Q∈TrigN,1such that the system (21)

has the echo state property and causal and time-invariant

solutions. Moreover, HP,Q

w(Z)∈Lp(Ω,F,P)and

kH(Z)−HP,Q

w(Z)kp< ε. (23)

If the input process Zis stationary then

kUH(Z)−UP,Q

w(Z)kp< ε. (24)

Proof. We ﬁrst argue that LZis a linear subspace of

Lp(Ω,F−∞,P). To do this we need to introduce some no-

tation. Given A∈MN1,M1,B∈MN2,M2, we denote by

A⊕B∈MN1+N2,M1+M2the direct sum. Given Ras in

(20) we deﬁne R⊕A∈TrigN+N1,M +M1by

R⊕A(z) :=

r

X

k=1

Ak⊕Acos(uk·z) + Bk⊕Asin(vk·z),

and (with the analogous deﬁnition for B⊕R) for Ri∈

TrigNi,Mi,i= 1,2we set

R1⊕R2=R1⊕0N2,M2+0N1,M1⊕R2.

One easily veriﬁes that for λ∈Rand (Ni,wi, Pi, Qi)∈ A,

i= 1,2, one has that

(N1+N2,w1⊕λw2, P1⊕P2, Q1⊕Q2)∈ A,

HP1,Q1

w1(Z) + λHP2,Q2

w2(Z) = HP1⊕P2,Q1⊕Q2

w1⊕λw2(Z).

This shows that LZis indeed a linear subspace of

Lp(Ω,F−∞,P).

Secondly, in order to show that LZis dense in

Lp(Ω,F−∞,P), it sufﬁces to prove that if F∈

Lq(Ω,F−∞,P)satisﬁes E[F H ]=0for all H∈ LZ, then

F= 0,P-almost surely. Here q∈(1,∞]is the H¨

older

conjugate exponent of p. This can be shown by contraposition.

Suppose that LZis not dense in Lp(Ω,F−∞,P). Since LZis

a linear subspace, by the Hahn-Banach theorem there exists

a bounded linear functional Λon Lp(Ω,F−∞,P)such that

Λ(H) = 0 for all H∈ LZ, but Λ6= 0, see e.g. [37,

Theorem 5.19]. Then by [37, Theorem 6.16] there exists

F∈Lq(Ω,F−∞,P)such that Λ(H) = E[F H ]for all

H∈Lp(Ω,F−∞,P)and F6= 0, since Λ6= 0. In particular,

there exists F∈Lq(Ω,F−∞,P)\ {0}such that E[F H ] = 0

for all H∈ LZ.

Thirdly, suppose that F∈Lq(Ω,F−∞,P)satisﬁes

E[F H ]=0for all H∈ LZ.(25)

If we show that F= 0,P-almost surely, then the statement

in the theorem follows by the argument in the second step.

In order to prove that F= 0,P-almost surely, we ﬁrst show

that (25) implies the following statement: for any K∈N, any

subset I⊂ IK:= {0, . . . , K}, and any u0,...,uK∈Rnit

holds that

E

FY

j∈I

sin(uj·Zj)Y

k∈IK\I

cos(uk·Zk)

= 0.(26)

We prove this claim by induction on K∈N. For K= 0,

one sets Q1(z) := cos(u0·z)and Q2(z) := sin(u0·z)and

notices that (1,1,0, Qi)∈ A. Moreover, since the sine and

cosine function are bounded, it is easy see that Qi(Z0) =

H0,Qi

1(Z0)∈ LZ, for i∈ {1,2}. Thus (25) implies (26) and

so the statement holds for K= 0. For the induction step, let

K∈N\ {0}and assume the implication holds for K−1.

We now ﬁx Iand u0,...,uK∈Rnas above and prove (26).

To simplify the notation we deﬁne for k∈ {0, . . . , K}and

z∈Rnthe function gkby

gk(z) := (sin(uk·z),if k∈I,

cos(uk·z),if k∈ IK\I.

To prove (26), we set N:= K+ 1, for j∈ {1, . . . , K}deﬁne

Aj∈MNwith all entries equal to 0except (Aj)j+1,j = 1,

that is, (Aj)k,l =δk,j+1 δl,j ,k, l ∈ {1, . . . , N }. Deﬁne now

for z∈Rn

P(z) :=

K−1

X

j=0

AK−jgj(z),

Q(z) := e1gK(z),

w:= eK+1,

(27)

where ejis the j-th unit vector in RN, that is, the only non-

zero entry of ejis a 1in the j-th coordinate. By Lemma A.2

in the appendix, one has AjL· · · Aj0= 0 for any j0, . . . , jL∈

{1, . . . , K}and L≥K, since jL=j0+Lcan not be satisﬁed.

In other words, any product of more than Kfactors of matrices

A(j)is equal to 0and thus for any L∈Nwith L≥Kand

any z0,...,zL∈Rnone has P(z0). . . P (zL) = 0. Using

this fact and iterating (21), one obtains that the trigonometric

state-afﬁne system deﬁned by the elements in (27) has a unique

solution given by

xt=Q(zt) +

K

X

j=1

P(zt)· · · P(zt−j+1)Q(zt−j).(28)

In particular (N, w, P, Q)∈ A and

HP,Q

w(Z) = X0

=w>

Q(Z0) +

K

X

j=1

P(Z0)· · · P(Z−j+1)Q(Z−j)

.

(29)

The ﬁniteness of the sum in (29) and the boundedness of the

trigonometric polynomials implies that HP,Q

w(Z)∈ LZ.

We conclude the proof of the induction step with the

following chain of equalities that uses (25) in the ﬁrst one,

GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 8

the representation (29) in the second one, and the choice of

the vector wand the induction hypothesis in the last step:

0 = E[F H P,Q

w(Z)]

=E[Fw>Q(Z0)]

+E[Fw>

K

X

j=1

P(Z0)· · · P(Z−j+1)Q(Z−j)]

=E[Fw>P(Z0)· · · P(Z−K+1)Q(Z−K)].

(30)

However, again by Lemma A.2 in the appendix, the only

non-zero product of matrices AjK−1· · · Aj0for j0,...jK−1∈

{1, . . . , K}takes place when jk=k+ 1 for k∈ {0, . . . , K −

1}. Therefore:

P(Z0)· · · P(Z−K+1)

=AKg0(Z0)AK−1g1(Z−1)· · · A1gK−1(Z−K+1).

Combining this with (30) and using the identity (49) in

Lemma A.2 in the appendix one obtains

0 = E[F e>

K+1AK· · · A1e1

K

Y

k=0

gk(Z−k)]

=E[F

K

Y

k=0

gk(Z−k)],

which is the same as (26).

Fourthly, by standard trigonometric identities, the identity

(26) established in the third step implies that for any K∈N,

E

Fexp

i

K

X

j=0

uj·Zj

= 0 for all u0,...,uK∈Rn.

(31)

We claim that (31) implies F= 0,P-almost surely and

hence the statement in the theorem follows. This fact is

a consequence of the uniqueness theorem for characteristic

functions (which is ultimately a consequence of the Stone-

Weierstrass approximation theorem). See for instance [30,

Theorem 4.3] and the text below that result. To prove F= 0,

P-almost surely, we denote by F+and F−the positive

and negative parts of F. Then by (31) one has E[F]=0,

necessarily. Thus, if it does not hold that F= 0,P-almost

surely, then c:= E[F+] = E[F−]>0and one may

deﬁne probability measures Q+and Q−on (Ω,F)by setting

Q+(A) := c−1E[F+A]and Q−(A) := c−1E[F−A]for

A∈ F. Denote by µ+

Kand µ−

Kthe law in Rn(K+1) of the

random variable

ZK:= (Z>

0,Z>

−1,...,Z>

−K)>

under Q+and Q−. Then, the statement (31) implies that for

all u∈Rn(K+1),

ZRn(K+1)

exp(iu ·z)µ+

K(dz) = ZRn(K+1)

exp(iu ·z)µ−

K(dz).

By the uniqueness theorem for characteristic functions (see

e.g. [30, Theorem 4.3] and the text below) this implies

that µ+

K=µ−

K. Translating this statement back to random

variables, this means that for any bounded and measurable

function g:Rn(K+1) →Rone has

0 = cEQ+[g(ZK)] −cEQ−[g(ZK)] = E[F g(ZK)],

which, by deﬁnition, means that E[F|F−K]=0,P-almost

surely. Since K∈Nwas arbitrary and F∈L1(Ω,F−∞,P),

one may combine this with limt→−∞ E[F|Ft] = F,P-almost

surely (see Lemma A.1) to conclude F= 0, as desired.

The statement in (24) in the presence of the stationarity

hypothesis for Zis a straightforward consequence of (7) and

the equality (8).

We emphasize that the use in the proof of the theorem

of nilpotent matrices of the type introduced in Lemma A.2

ensures that the the echo state property is automatically

satisﬁed (see (28)).

C. Echo state networks

We now turn to showing the universality in the Lpsense

of the the most widely used reservoir systems with linear

readouts, namely, echo state networks. An echo state network

is a RC system determined by

(xt=σ(Axt−1+Czt+ζ),

yt=w>xt,(32)

for A∈MN,C∈MN,n,ζ∈RN, and w∈RN. As it

is customary in the neural networks literature, the map σ:

RN→RNis obtained via the componentwise application of

a given activation function σ:R→Rthat is denoted with the

same symbol.

If this system has the echo state property and the resulting

ﬁlter is causal and time-invariant, we write as HA,C,ζ

w(z) := y0

the associated functional.

Theorem III.10. Fix p∈[1,∞), let Zbe a ﬁxed Rn-

valued input process, and let Hbe a functional such that

H(Z)∈Lp(Ω,F,P). Suppose that the activation function

σ:R→Ris non-constant, continuous, and has a bounded

image. Then for any ε > 0, there exists N∈N,C∈MN,n ,

ζ∈RN,A∈MN,w∈RNsuch that (32)has the echo state

property, the corresponding ﬁlter is causal and time-invariant,

the associated functional satisﬁes HA,C,ζ

w(Z)∈Lp(Ω,F,P)

and

kH(Z)−HA,C,ζ

w(Z)kp< ε. (33)

Proof. First, by Corollary III.8 and (17) there exists K, N∈

N,w∈RN,A∈MN,n(K+1) , and ζ∈RNsuch that the

neural network

h(z) = w>σ(Az+ζ)

satisﬁes

kH(Z)−h(Z>

0,...,Z>

−K)kp<ε

2.(34)

Notice that we may rewrite Aas

A= [A(0)A(−1) · · · A(−K)]

GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 9

with A(j)∈MN,n and

H∞(Z):=h(Z>

0,...,Z>

−K)

=w>σ

K

X

j=0

A(−j)Z−j+ζ

.(35)

Second, by the neural network approximation theorem for

continuous functions [6, Theorem 2], for any m∈Nthere

exists a neural network that uniformly approximates the iden-

tity mapping on the hypercube Bm:= {x∈Rn:|xi| ≤

mfor i= 1, . . . , n}. More speciﬁcally, [6, Theorem 2] is

formulated for R-valued mappings and we hence apply it

componentwise: for any m∈Nand i= 1, . . . , n there

exists N(m)

i∈N,w(m)

i∈RN(m)

i,A(m)

i∈MN(m)

i,n, and

ζ(m)

i∈RN(m)

i, such that for all i= 1, . . . , n the neural

network

h(m)

i(x) = w(m)

i>σA(m)

ix+ζ(m)

i

satisﬁes

sup

x∈Bm

{|h(m)

i(x)−xi|} <1

m.(36)

Write h(m)(x)=(h(m)

1(x), . . . , h(m)

n(x))>and for j=

1, . . . , K, denote by [h(m)]j=h(m)◦ · · · ◦ h(m)the jth

composition of h(m). We now claim that for all j= 1, . . . , K

and x∈Rnit holds that

lim

m→∞[h(m)]j(x) = x.(37)

Indeed, let us ﬁx x∈Rnand argue by induction on j. To

prove (37) for j= 1, let ε > 0be given and choose m0∈

Nsatisfying m0>max {|x1|,...,|xn|,1/ε}. Then, for any

m≥m0one has x∈Bmby deﬁnition and (36) implies that

for i= 1, . . . , n,

|h(m)

i(x)−xi|<1

m< ε.

Hence (37) indeed holds for j= 1. Now let j≥2

and assume that (37) has been proved for j−1. Deﬁne

x(m):= [h(m)]j−1(x). Then, by the induction hypothesis, for

any given ε > 0one ﬁnds m0∈Nsuch that for all m≥m0

and i= 1, . . . , n it holds that

|x(m)

i−xi|<ε

2.(38)

Hence, choosing m0∈Nwith m0>max(m0,|x1|+

ε

2,...,|xn|+ε

2,2/ε)one obtains from the triangle inequality

and (38) that x(m)∈Bm0for all m≥m0. In particular for

any m≥m0one may use the triangle inequality in the ﬁrst

step, x(m)∈Bm0⊂Bmand (38) in the second step and (36)

in the last step to estimate

|[h(m)]j

i(x)−xi|≤|h(m)

i(x(m))−x(m)

i|+|x(m)

i−xi|

≤sup

y∈Bm

{|h(m)

i(y)−yi|} +ε

2

<1

m+ε

2< ε.

This proves (37) for all j= 1, . . . , K.

Thirdly, deﬁne

Hm(Z) := w>σ

K

X

j=0

A(−j)[h(m)]j(Z−j) + ζ

with the convention [h(m)]0(x) = x.

Since σis continuous, (37) implies that limm→∞ Hm(Z) =

H∞(Z),P-almost surely, where H∞was deﬁned in (35).

Furthermore, by assumption there exists C > 0such that

|σ(x)| ≤ Cfor all x∈R. Hence one has |H∞(Z)−

Hm(Z)|p≤(2CPN

i=1 |wi|)pfor all m∈N. Thus one may

apply the dominated convergence theorem to obtain

lim

m→∞ kH∞(Z)−Hm(Z)kp

= lim

m→∞ E[|H∞(Z)−Hm(Z)|p]1/p = 0.

In particular for m∈Nlarge enough one has kH∞(Z)−

Hm(Z)kp<ε

2and combining this with the triangle inequality

and (34) one obtains

kH(Z)−Hm(Z)kp≤ kH(Z)−H∞(Z)kp

+kH∞(Z)−Hm(Z)kp< ε. (39)

To conclude the proof we now ﬁx m∈Nlarge enough

(so that (39) holds) and show that Hm(Z) = HA,C,ζ

w(Z)for

suitable choices of A, C, ζand w. To do so, ﬁrst deﬁne NJ:=

N(m)

1+· · · +N(m)

nand the block matrices

WJ:=

(w(m)

1)>0

...

0(w(m)

n)>

∈Mn,NJ,

ζJ:=

ζ(m)

1.

.

.

ζ(m)

n

∈RNJ,and AJ:=

A(m)

1

.

.

.

A(m)

n

∈MNJ,n.

Furthermore, to emphasize that mis ﬁxed and h(m)approxi-

mates the identity, set J(x) := h(m)(x)and note that

J(x) = WJσ(AJx+ζJ).(40)

Now set N:= KNJ+Nand deﬁne the block matrix A∈MN

by

A=

0NJ,NJ

AJWJ0NJ,NJ

AJWJ

...

0

0...0NJ,NJ

AJWJ0NJ,NJ

A(−1)WJA(−2) WJ· · · · · · A(−K)WJ0N,N

and ζ∈RN,C∈MN,n and w∈RNby

ζ:=

ζJ

.

.

.

ζJ

ζ

, C :=

AJ

0

.

.

.

0

A(0)

,and w:= 0KNJ,1

w.

GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 10

Furthermore, we partition the reservoir states xtof the corre-

sponding echo state system as

xt:=

x(1)

t

.

.

.

x(K+1)

t

,

with x(j)

t∈RNJ, for j≤K, and x(K+1)

t∈RN. With this

notation for xtand these choices of matrices, the recursions

associated to the echo state reservoir map in (32) read as

x(1)

t=σ(AJzt+ζJ),(41)

x(j)

t=σ(AJWJx(j−1)

t−1+ζJ),for j= 2, . . . , K, (42)

x(K+1)

t=σ(

K

X

j=1

A(−j)WJx(j)

t−1+A(0)zt+ζ).(43)

By iteratively inserting (42) into itself and using (41) one

obtains (recall the deﬁnition of Jin (40)) that the unique

solution to (42) is given by

x(j)

t=σ(AJ[J]j−1(zt−j+1) + ζJ).(44)

More formally, one uses induction on j: For j= 1 the two

expressions (44) and (41) coincide. For j= 2, . . . , K one

inserts (44) for j−1(which holds by induction hypothesis)

into (42) to obtain

x(j)

t=σ(AJWJσ(AJ[J]j−2(zt−j+1) + ζJ) + ζJ)

=σ(AJ[J]j−1(zt−j+1) + ζJ),

which is indeed (44). Finally, combining (44) and (43) one

obtains

yt=w>x(K+1)

t=w>σ(

K

X

j=1

A(−j)WJx(j)

t−1+A(0)zt+ζ)

=w>σ(

K

X

j=1

A(−j)[J]j(zt−j) + A(0)zt+ζ).

The statement (44) shows, in particular, that the echo state

network associated to A, C, ζand wsatisﬁes the echo state

property. Moreover, inserting t= 0 in the previous equality

and comparing with the deﬁnition of Hm(Z)one sees that

indeed Hm(Z) = HA,C,ζ

w(Z). The approximation statement

(33) therefore follows from (39).

Remark III.11.In this paper we measure closeness between

ﬁlters and functionals in a Lpsense. As we already pointed

out in Remark III.6, this choice allows us to approximate with

the systems used in this paper measurable ﬁlters that, unlike

in the L∞case, do not necessarily satisfy the fading memory

property. Therefore, an interesting aspect of the universality

results in Proposition III.1, Corollary III.8, Theorem III.9, and

Theorem III.10 is that it is possible to approximately simulate

any measurable ﬁlter that does not necessarily satisfy the fad-

ing memory property using the reservoir systems introduced

in those results that do satisfy the fading memory property.

Remark III.12.The results presented in this article address

the approximation capabilities of echo state networks and

other reservoir computing systems. When these systems are

used in practice not all of their parameters are trained. For

example, the recurrent connections of ESNs do not usually

undergo a training process, that is, the architecture parameters

A, C, ζare randomly drawn from a distribution and only the

readout wis trained by linear regression so as to optimally

ﬁt the given teaching signal. Subsequently, an optimization

over a few hyperparameters (for instance, the spectral radius

of A) is carried out. In addition, in many situations the same

reservoir matrix Acan be used for different input time series

and different learning tasks and only the input-to-reservoir

matrices C, ζand the readout wneed to be modiﬁed (see,

for instance, the approach taken in [38], [39] to deﬁne time

series kernels). This feature is key in the implementation of

the notion of multi-tasking in the RC context (see [10]). Thus,

the empirically observed robustness of ESNs with respect

to these parameter choices is not entirely explained by the

universality results presented here. While in the static setting

of feedforward neural networks such questions have already

been tackled (see, for instance, [40]) for echo state networks

a full explanation is not available yet and these questions are

the subject of ongoing research.

D. An alternative viewpoint

So far all the universality results have been formulated

for functionals and ﬁlters with random inputs. Equivalently,

we may formulate them as Lp-approximation results on the

sequence space (Rn)Z−endowed with any measure µthat

makes p-integrable the ﬁlter that we want to approximate.

Theorem III.13. Let H: (Rn)Z−→Rbe a measurable

functional. Then, for any probability measure µon (Rn)Z−

with H∈Lp((Rn)Z−, µ)and any ε > 0there exists a

reservoir system that has the echo state property and such that

the corresponding ﬁlter is causal and time-invariant, the as-

sociated functional HRC satisﬁes that HRC ∈Lp((Rn)Z−, µ)

and

kH−HRCkLp((Rn)Z

−,µ)< ε. (45)

The reservoir functional HRC may be chosen as coming from

any of the following systems:

•Linear reservoir with polynomial readout, that is, (9)for

some N∈N,A∈MN,c∈MN,n, and a polynomial h∈

PolN, if the measure µsatisﬁes the following condition:

for any K∈N,

Z(Rn)Z

−

exp α

K

X

k=0

n

X

i=1

|z(i)

−k|!µ(dz)<∞.

•Linear reservoir with neural network readout, that is, (9)

for some N∈N,A∈MN,c∈MN,n, and a neural

network h∈ HN.

•Trigonometric state-afﬁne system with linear readout, that

is, (21)for some N∈N,w∈RN,P∈TrigN,N and

Q∈TrigN,1.

•Echo state network with linear readout, that is, (32)for

some N∈N,C∈MN,n,ζ∈RN,A∈MN,w∈RN,

where we assume that σ:R→Remployed in (32)is

bounded, continuous and non-constant.

GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 11

Proof. Set Ω=(Rn)Z−,F=⊗t∈Z−B(Rn),P=µ

and deﬁne Zt(z) := ztfor all z∈Ω,t∈Z−. Then

F=σ(Zt:t∈Z−) = F−∞ and Zis the identity

mapping on (Rn)Z−. One may now apply Proposition III.1,

Corollary III.8, Theorem III.9 and Theorem III.10 with this

choice of probability space (Ω,F,P)and input process Z. The

statement of Theorem III.13 then precisely coincides with the

statement of Proposition III.1, Corollary III.8, Theorem III.9

and Theorem III.10, respectively.

E. Approximation of stationary strong time series models

Most parametric time series models commonly used in

ﬁnancial, macroeconometrics, and forecasting applications are

speciﬁed by relations of the type

Xt=G(Xt−1,Zt,θ),(46)

where θ∈Rkare the parameters of the model and the vector

Xt∈RNis built so that it contains in its components the

time series of interest and that, at the same time, allows for a

Markovian representation of the model as in (46). The model

is driven by the innovations process Z= (Zt)t∈Z∈(Rn)Z.

When the innovations are made out of independent and iden-

tically distributed random variables we say that the model is

strong [41]. It is customary in the time series literature to

impose constraints on the parameters vector θso that the

relation (46) has a unique second-order stationary solution or,

in the language of this paper, the system (46) satisﬁes the echo

state property and the associated ﬁlter UG: (Rn)Z→RNZ

satisﬁes that

E[UG(Z)t] =: µand EUG(Z)tUG(Z)>

t+h=: Σh, t, h ∈Z−,

(47)

with µ∈RNand Σh∈MNconstants that do not depend

on t∈Z−. The Wold decomposition theorem [42, Theorem

5.7.1] shows that any such ﬁlter can be uniquely written as

the sum of a linear and a deterministic process.

It is obvious that for strong models the stationarity condition

(7) holds and that, moreover, the condition (47) implies that

kUG(Z)k2= sup

t∈Z−nE|UG(Z)t|21/2o= trace (Σ0)1/2<∞.

(48)

This integrability condition guarantees that the approximation

results in Proposition III.1, Corollary III.8, and Theorems

III.9 and III.10 hold for second-order stationary strong time

series models with p= 2. More speciﬁcally, the processes

determined by this kind of models can be approximated in

the L2sense by linear processes with polynomial or neural

network readouts (when the condition in Remark III.4 is

satisﬁed), by trigonometric state-afﬁne systems with linear

readouts, or by echo state networks.

Important families of models to which this approximation

statement can be applied are, among many others, (see the

references for the meaning of the acronyms) GARCH [43],

[44], VEC [45], BEKK [46], CCC [47], DCC [48], [49],

GDC [50], and ARSV [51], [52].

IV. CONCLUSION

We have shown the universality of three different families

of reservoir computers with respect to the Lpnorm associated

to any given discrete-time semi-inﬁnite input process.

On the one hand we proved that linear reservoir systems

with either neural network or polynomial readout maps (in

this case the input process needs to satisfy the exponential

moments condition (11)) are universal.

On the other hand we showed that the exponential moment

condition (11), which was required in the case of polynomial

readouts, can be dropped by considering two different reservoir

families with linear readouts, namely, trigonometric state-

afﬁne systems and echo state networks. The latter are the most

widely used reservoir systems in applications. The linearity in

the readouts is a key feature in supervised machine learning

applications of these systems. It guarantees that they can be

used in high-dimensional situations and in the presence of

large datasets, since the training in that case is reduced to a

linear regression.

We emphasize that, unlike existing results in the literature

[25], [26] dealing with uniform universal approximation, the

Lpcriteria used in this paper allow to formulate universality

statements that do not necessarily impose almost sure uniform

boundedness on the inputs or the fading memory property on

the ﬁlter that needs to be approximated.

APPENDIX

A. Auxiliary Lemmas

Lemma A.1. Let Z:Z×Ω→Rnbe a stochastic process

and let Ft:= σ(Z0,...,Zt),t∈Z−, and F−∞ := σ(Zt:t∈

Z−)}. Let F∈Lp(Ω,F−∞,P). Then E[F|Ft]converges to

Fas t→ −∞, both P-almost surely and in norm k·kp, for

any p∈[1,∞).

Proof. Since F−t⊂ F−t−1⊂ F−∞, for all t∈N,

and F∈Lp(Ω,F−∞,P)⊂L1(Ω,F−∞ ,P), one has by

L´

evy’s Upward Theorem (see, for instance, [53, II.50.3] or

[33, Theorem 5.5.7]) that Ft:= E[F|Ft]converges for

t→ −∞ to Fin k·k1and P-almost surely. If p= 1 this

already implies the claim. For p > 1one has by standard

properties of conditional expectations (see, for instance, [33,

Theorem 5.1.4]) that supt∈N{E[|Ft|p]} ≤ E[|F|p]. Hence [33,

Theorem 5.4.5] implies that Ftconverges for t→ −∞ to

some ˜

F∈Lp(Ω,F−∞,P)both in k·kpand P-almost surely.

But this identiﬁes ˜

F= limt→−∞ Ft=F,P-almost surely

and hence Ftconverges for t→ −∞ to Falso in k · kp.

Lemma A.2. For N∈N\{0,1}and j= 1, . . . , N −1deﬁne

Aj∈MNby (Aj)k,l =δk,j+1 δl,j for k, l ∈ {1, . . . , N}. Then

for L∈N,j0, . . . , jL∈ {1, . . . , N −1}it holds that

(AjL· · · Aj0)k,l =δk,jL+1δl,j0

L

Y

i=1

δji,ji−1+1.(49)

In particular AjL· · · Aj06= 0 if and only if ji=j0+ifor

i∈ {1, . . . , L}.

Proof. The last statement directly follows from (49). To prove

(49) we proceed by induction on L. Indeed, for L= 0 the

GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 12

formula (49) is just the deﬁnition of Aj0. For the induction

step, one assumes that (49) holds for L−1and calculates

(AjL· · · Aj0)k,l

=

N

X

r=1

δk,jL+1δr,jL(AjL−1· · · Aj0)r,l

=

N

X

r=1

δk,jL+1δr,jLδr,jL−1+1δl,j0

L−1

Y

i=1

δji,ji−1+1,

which is indeed (49).

ACKNOWLEDGMENT

The authors thank Lyudmila Grigoryeva and Josef Teich-

mann for helpful discussions and remarks and acknowledge

partial ﬁnancial support coming from the Research Com-

mission of the Universit¨

at Sankt Gallen, the Swiss National

Science Foundation (grants number 175801/1 and 179114),

and the French ANR “BIPHOPROC” project (ANR-14-OHRI-

0018-02).

REFERENCES

[1] F. Cucker and S. Smale, “On the mathematical foundations of learning,”

Bulletin of the American Mathematical Society, vol. 39, no. 1, pp. 1–49,

2002.

[2] S. Smale and D.-X. Zhou, “Estimating the approximation error in

learning theory,” Analysis and Applications, vol. 01, no. 01, pp. 17–41,

jan 2003.

[3] F. Cucker and D.-X. Zhou, Learning Theory : An Approximation Theory

Viewpoint. Cambridge University Press, 2007.

[4] G. Cybenko, “Approximation by superpositions of a sigmoidal function,”

pp. 303–314, dec 1989.

[5] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward

networks are universal approximators,” Neural Networks, vol. 2, no. 5,

pp. 359–366, 1989.

[6] K. Hornik, “Approximation capabilities of muitilayer feedforward net-

works,” Neural Networks, vol. 4, no. 1989, pp. 251–257, 1991.

[7] W. Maass, T. Natschl¨

ager, and H. Markram, “Real-time computing

without stable states: a new framework for neural computation based

on perturbations,” Neural Computation, vol. 14, pp. 2531–2560, 2002.

[8] H. Jaeger and H. Haas, “Harnessing Nonlinearity: Predicting Chaotic

Systems and Saving Energy in Wireless Communication,” Science, vol.

304, no. 5667, pp. 78–80, 2004.

[9] W. Maass and H. Markram, “On the computational power of circuits of

spiking neurons,” Journal of Computer and System Sciences, vol. 69,

no. 4, pp. 593–616, 2004.

[10] W. Maass, “Liquid state machines: motivation, theory, and applications,”

in Computability In Context: Computation and Logic in the Real World,

S. S. Barry Cooper and A. Sorbi, Eds., 2011, ch. 8, pp. 275–296.

[11] M. B. Matthews, “On the Uniform Approximation of Nonlinear

Discrete-Time Fading-Memory Systems Using Neural Network Mod-

els,” Ph.D. dissertation, ETH Z¨

urich, 1992.

[12] ——, “Approximating nonlinear fading-memory operators using neural

network models,” Circuits, Systems, and Signal Processing, vol. 12,

no. 2, pp. 279–307, jun 1993.

[13] S. Boyd and L. Chua, “Fading memory and the problem of approxi-

mating nonlinear operators with Volterra series,” IEEE Transactions on

Circuits and Systems, vol. 32, no. 11, pp. 1150–1161, 1985.

[14] K.-i. Funahashi and Y. Nakamura, “Approximation of dynamical systems

by continuous time recurrent neural networks,” Neural Networks, vol. 6,

no. 6, pp. 801–806, jan 1993.

[15] W. Maass, P. Joshi, and E. D. Sontag, “Computational aspects of

feedback in neural circuits,” PLoS Computational Biology, vol. 3, no. 1,

p. e165, 2007.

[16] E. Sontag, “Realization theory of discrete-time nonlinear systems: Part I-

The bounded case,” IEEE Transactions on Circuits and Systems, vol. 26,

no. 5, pp. 342–356, may 1979.

[17] E. D. Sontag, “Polynomial Response Maps,” in Lecture Notes Control

in Control and Information Sciences. Vol. 13. Springer Verlag, 1979.

[18] M. Fliess and D. Normand-Cyrot, “Vers une approche alg´

ebrique des

syst`

emes non lin´

eaires en temps discret,” in Analysis and Optimization

of Systems. Lecture Notes in Control and Information Sciences, vol. 28,

A. Bensoussan and J. Lions, Eds. Springer Berlin Heidelberg, 1980.

[19] I. W. Sandberg, “Approximation theorems for discrete-time systems,”

IEEE Transactions on Circuits and Systems, vol. 38, no. 5, pp. 564–

566, 1991.

[20] ——, “Structure theorems for nonlinear systems,” Multidimensional

Systems and Signal Processing, vol. 2, pp. 267–286, 1991.

[21] P. C. Perryman, “Approximation Theory for Deterministic and Stochastic

Nonlinear Systems,” Ph.D. dissertation, University of California, Irvine,

1996.

[22] A. Stubberud and P. Perryman, “Current state of system approximation

for deterministic and stochastic systems,” in Conference Record of The

Thirtieth Asilomar Conference on Signals, Systems and Computers,

vol. 1. IEEE Comput. Soc. Press, 1997, pp. 141–145.

[23] B. Hammer and P. Tino, “Recurrent neural networks with small weights

implement deﬁnite memory machines,” Neural Computation, vol. 15,

no. 8, pp. 1897–1929, aug 2003.

[24] P. Tino, B. Hammer, and M. Bod´

en, “Markovian bias of neural-based

architectures with feedback connections,” in Perspectives of Neural-

Symbolic Integration. Studies in Computational Intelligence, vol 77.,

Hammer B. and Hitzler P., Eds. Springer, Berlin, Heidelberg, 2007,

pp. 95–133.

[25] L. Grigoryeva and J.-P. Ortega, “Universal discrete-time reservoir com-

puters with stochastic inputs and linear readouts using non-homogeneous

state-afﬁne systems,” Journal of Machine Learning Research, vol. 19,

no. 24, pp. 1–40, 2018.

[26] ——, “Echo state networks are universal,” Neural Networks, vol. 108,

pp. 495–508, 2018.

[27] H. Jaeger, “The ’echo state’ approach to analysing and training recurrent

neural networks with an erratum note,” German National Research

Center for Information Technology, 2010.

[28] I. B. Yildiz, H. Jaeger, and S. J. Kiebel, “Re-visiting the echo state

property.” Neural networks : the ofﬁcial journal of the International

Neural Network Society, vol. 35, pp. 1–9, nov 2012.

[29] G. Manjunath and H. Jaeger, “Echo state property linked to an input:

exploring a fundamental characteristic of recurrent neural networks,”

Neural Computation, vol. 25, no. 3, pp. 671–696, 2013.

[30] O. Kallenberg, Foundations of Modern Probability, ser. Probability and

Its Applications. Springer New York, 2002.

[31] C. Berg and J. P. R. Christensen, “Density questions in the classical

theory of moments,” Annales de l’Institut Fourier, vol. 31, no. 3, pp.

99–114, 1981.

[32] L. C. Petersen, “On the relation between the multidimensional moment

problem and the one-dimensional moment problem,” Mathematica Scan-

dinavica, vol. 51, no. 2, pp. 361–366, 1983.

[33] R. Durrett, Probability: Theory and Examples, 4th ed., ser. Cambridge

Series in Statistical and Probabilistic Mathematics. Cambridge: Cam-

bridge University Press, 2010.

[34] Ernst, Oliver G., Mugler, Antje, Starkloff, Hans-J¨

org, and Ullmann,

Elisabeth, “On the convergence of generalized polynomial chaos ex-

pansions,” ESAIM: M2AN, vol. 46, no. 2, pp. 317–339, 2012.

[35] C. C. Heyde, “On a property of the lognormal distribution,” The Journal

of the Royal Statistical Society Series B (Methodological), vol. 25, no. 2,

pp. 392–393, 1963.

[36] G. Freud, Orthogonal Polynomials. Pergamon Press, 1971.

[37] W. Rudin, Real and Complex Analysis, 3rd ed. McGraw-Hill, 1987.

[38] H. Chen, F. Tang, P. Tino, and X. Yao, “Model-based kernel for

efﬁcient time series analysis,” in Proceedings of the 19th ACM SIGKDD

international conference on Knowledge discovery and data mining -

KDD ’13, 2013.

[39] H. Chen, P. Tino, A. Rodan, and X. Yao, “Learning in the model space

for cognitive fault diagnosis,” IEEE Transactions on Neural Networks

and Learning Systems, 2014.

[40] G.-B. G.-B. Huang, Q.-Y. Q.-Y. Zhu, C.-k. C.-K. C.-K. Siew, G.-b. H. ˜

A,

Q.-Y. Q.-Y. Zhu, and C.-k. C.-K. C.-K. Siew, “Extreme learning machine

: Theory and applications,” Neurocomputing, 2006.

[41] C. Francq and J.-M. Zakoian, GARCH Models: Structure, Statistical

Inference and Financial Applications. Wiley, 2010.

[42] P. J. Brockwell and R. A. Davis, Time Series: Theory and Methods.

Springer-Verlag, 2006.

[43] R. F. Engle, “Autoregressive conditional heteroscedasticity with esti-

mates of the variance of United Kingdom inﬂation,” Econometrica,

vol. 50, no. 4, pp. 987–1007, 1982.

[44] T. Bollerslev, “Generalized autoregressive conditional heteroskedastic-

ity,” Journal of Econometrics, vol. 31, no. 3, pp. 307–327, 1986.

GONON AND ORTEGA: RESERVOIR COMPUTING UNIVERSALITY WITH STOCHASTIC INPUTS 13

[45] T. Bollerslev, R. F. Engle, and J. M. Wooldridge, “A capital asset pricing

model with time varying covariances,” Journal of Political Economy,

vol. 96, pp. 116–131, 1988.

[46] R. F. Engle and F. K. Kroner, “Multivariate simultaneous generalized

ARCH,” Econometric Theory, vol. 11, pp. 122–150, 1995.

[47] T. Bollerslev, “Modelling the coherence in short-run nominal exchange

rates: A multivariate generalized ARCH model,” Review of Economics

and Statistics, vol. 72, no. 3, pp. 498–505, 1990.

[48] Y. K. Tse and A. K. C. Tsui, “A multivariate GARCH with time-varying

correlations,” Journal of Business and Economic Statistics, vol. 20, pp.

351–362, 2002.

[49] R. F. Engle, “Dynamic conditional correlation: a simple class of multi-

variate GARCH models,” Journal of Business and Economic Statistics,

vol. 20, pp. 339–350, 2002.

[50] F. K. Kroner and V. K. Ng, “Modelling asymmetric comovements of

asset returns,” The Review of Financial Studies, vol. 11, pp. 817–844,

1998.

[51] S. J. Taylor, “Financial returns modelled by the product of two stochastic

processes, a study of daily sugar prices,” in Time series analysis: theory

and practice I, B. D. Anderson, Ed., 1982, pp. 1961–1979.

[52] A. C. Harvey, E. Ruiz, and N. Shephard, “Multivariate stochastic

variance models,” Review of Economic Studies, vol. 61, pp. 247–264,

1994.

[53] L. C. G. Rogers and D. Williams, Diffusions, Markov Processes, and

Martingales, 2nd ed. Cambridge University Press, 2000, vol. 1.