Content uploaded by Juan-Pablo Ortega

Author content

All content in this area was uploaded by Juan-Pablo Ortega on Jan 26, 2021

Content may be subject to copyright.

Fading memory echo state networks are universal

Lukas Gonon1and Juan-Pablo Ortega2,3

Abstract

Echo state networks (ESNs) have been recently proved to be universal approximants for in-

put/output systems with respect to various Lp-type criteria. When 1 ≤p < ∞, only p-integrability

hypotheses need to be imposed, while in the case p=∞a uniform boundedness hypotheses on

the inputs is required. This note shows that, in the last case, a universal family of ESNs can be

constructed that contains exclusively elements that have the echo state and the fading memory

properties. This conclusion could not be drawn with the results and methods available so far in the

literature.

Key Words: universality, recurrent neural network, reservoir computing, state-space system, echo

state network, ESN, machine learning, echo state property, fading memory property.

1 The problem

Main goal. The objective of this note is showing that the universal families of echo state networks

(ESNs) devised in the main theorem of [Grig 18a] can be chosen to satisfy the echo state and the fading

memory properties. This fact was not established in the original paper and showing it requires a diﬀerent

strategy that will be presented in the proof of the main result later on in Section 2.

Context and motivation. Reservoir computing [Luko 09,Tana 19] in general and ESNs [Matt 93,

Jaeg 04] in particular have exhibited a remarkable success in the learning of the chaotic attractors of

complex nonlinear inﬁnite dimensional dynamical systems [Jaeg 04,Path 17,Path 18,Lu 18,Hart 20,

Grig 20] and in a great variety of empirical classiﬁcation and forecasting applications [Wyﬀ 10,Bute 13,

Grig 14]). These ﬁndings have been an important motivation for the in-depth study of the approximation

capabilities of these machine learning paradigms. The ﬁrst results in this direction have been obtained

in the context of systems theory for input/output systems with either ﬁnite or approximately ﬁnite

memory [Sand 91,Perr 96,Stub 97] in a forward-in-time framework. More recently, those universality

statements were extended to ESNs with semi-inﬁnite inputs from the past. Various Lp-type criteria have

been used to measure the approximation error. The case 1 ≤p < ∞has been considered in [Gono 20c],

where the universality of families of ESNs with a prescribed activation function and stochastic inputs

was established with respect to the Lpnorm determined by the law of a ﬁxed discrete-time input process

deﬁned for all inﬁnite negative times. In this case, the universality is formulated in the category of all

the causal and time-invariant input/output systems with p-integrable outputs. Extensions of some of

these results in the particularly relevant case p= 2 for randomly generated ESNs and, more importantly,

1Ludwig-Maximilians-Universit¨at M¨unchen. Mathematics Institute. Theresienstrasse 39. D-80333 Munich. Germany.

gonon@math.lmu.de

2Universit¨at Sankt Gallen. Faculty of Mathematics and Statistics. Bodanstrasse 6. CH-9000 Sankt Gallen. Switzer-

land. Juan-Pablo.Ortega@unisg.ch

3Centre National de la Recherche Scientiﬁque (CNRS). France.

1

Fading memory echo state networks are universal 2

corresponding approximation and generalization error bounds for regular ﬁlters/functionals have been

derived in [Gono 20b,Gono 20a].

Universality with respect to uniform approximation, that is, the case p=∞, has been studied in

[Grig 18b,Grig 18a] for (almost surely) uniformly bounded inputs in the fading memory category. The

universality of ESNs for uniformly bounded inputs in the fading memory category has been established

in [Grig 18a, Theorem 4.1] using an internal approximation property [Grig 18a, Theorem 3.1] that allows

one to conclude the uniform proximity of input/output systems generated by a state-space system out

of the uniform closeness of the corresponding state maps. Using this observation, it can be shown that

ESNs inherit universality properties out of the universality of neural networks [Cybe 89,Horn 89].

The result that we just described, stated in Theorem 4.1 of [Grig 18a], does not guarantee though

that the approximating ESNs have two important properties that we now brieﬂy recall. The ﬁrst one

is the echo state property (ESP) that holds when every semi-inﬁnite input has one and only one semi-

inﬁnite output associated. The ESP allows, in passing, to associate a ﬁlter to the ESN. The second

one is the fading memory property which, in the presence of uniformly bounded inputs, amounts to the

continuity of the associated ﬁlter when the spaces of inputs and outputs are endowed with the product

topologies. The main theorem in this note shows that, when the activation function in the universal

family of ESNs is Lipschitz-continuous, then this family can be chosen so that all its elements have the

echo state and the fading memory properties. This fact was not established in the original paper and

showing it requires a diﬀerent strategy than in [Grig 18a]. Indeed, instead of the internal approximation

property mentioned above and that was used in [Grig 18a], the new proof uses ﬁrst the possibility to

approximate fading memory ﬁlters by ﬁnite memory ones in order to use on those an approximation

by neural networks. The main result is obtained by showing that the neural network approximation of

the ﬁnite memory system can be in turn approximated by an ESN functional that satisﬁes the required

echo state and fading memory properties.

The relevance in modeling of the echo state and fading memory properties owned by the universal

families of ESNs introduced in this note is a sound theoretical explanation of the excellent empirical

performance of this dynamic machine learning paradigm that has been profusely documented in the

literature.

2 The result

The statement and proof of the main theorem uses a notation similar to that in [Grig 18a]. In particular,

for any M > 0, we denote by Bk·k(0, M ) the Euclidean ball of radius Mand by Bk·k(0, M ) its closure.

The set KM:= Bk·k(0, M )Z−is the Cartesian product of Z−copies of Bk·k(0, M ) endowed with the

product topology. Given M, L > 0 and U:KM⊂RdZ−−→ KL⊂(Rm)Z−a causal and time-

invariant ﬁlter, for some d, m ∈N, we denote by HU:= p0◦U:KM−→ Rmthe associated functional,

with p0: (Rm)Z−−→ Rmthe projection onto the zero entry. The space of ﬁlters and functionals of

the type of Uand HUcan be endowed with a uniform norm |||·|||∞(see (2.16) and (2.17) in [Grig 18a]

for deﬁnitions) with respect to which we shall prove the universality of the echo state family with the

properties described above. We recall that an echo state network is determined by the state-space

system: (xt=σ(Axt−1+Czt+ζ),

yt=Wxt,(2.1)

where the states xt∈RN, for some N∈N,A∈MN,N ,C∈MN,d ,ζ∈RN, and W∈Mm,N . The values

zt∈Rd(respectively, yt∈Rm) are the components of the input sequence z∈KM(respectively, the

output sequence y∈KL). The map σ:RN−→ RNis obtained by componentwise application of an

activation function σ:R−→ Rthat we assume Lσ-Lipschitz-continuous, bounded, and non-constant.

Fading memory echo state networks are universal 3

Theorem 2.1 (Universality of the echo state family) Let M, L > 0, let U:KM−→ KLbe a

causal and time invariant ﬁlter that has the fading memory property, and let σ:R−→ Rbe Lipschitz-

continuous, non-constant, and bounded. Then, for any > 0there exists an echo state network of

the type (2.1)with activation function determined by σthat has the echo state and the fading memory

properties and whose associated ﬁlter UESN :KM−→ (Rm)Z−satisﬁes that

|||U−UESN|||∞< .

Proof. Let HU:KM−→ Rmbe the functional associated to the ﬁlter Uin the statement. There

are various results in the literature (see, for instance, [Boyd 85, Theorems 3 and 4], [Grig 18b, Remark

12], or [Grig 19, Theorem 31]) that guarantee that HUcan be uniformly approximated by a ﬁnite

memory functional. This means that for any > 0 there exists K∈Nand a continuous map G:

Bk·k(0, M )K+1 −→ Rmsuch that

sup

z∈KM

{kHU(z)−G(z−K,z−K+1,...,z0)k} <

3.(2.2)

Let Dd:= Bk·k(0, M )⊂Rd. As Gis continuous and it is deﬁned on a compact set, the neural network

approximation theorems ([Horn 91, Theorem 2], see also [Cybe 89,Horn 89]) imply the existence of a

N∈N, W ∈Mm,N ,A ∈ MN ,(K+1)d, and ζ∈RNsuch that, if we deﬁne X:= Bk·k(0, M )K+1, we have

sup

u∈XkG(u)−Wσ(Au+ζ)k<

3.(2.3)

Let us partition the matrix as A=A(−K)A(−K+1) · · · A(0)with A(−j)∈MN ,d and deﬁne the constant

c=kWkLσPK

i=0kA(−i)ki. Again by the universal approximation theorem [Horn 91] for each j=

1, . . . , K there exists e

Nj∈N,f

Wj∈Md,

e

Nj,e

Aj∈Me

Nj,d, and e

ζj∈R

e

Njsuch that the neural network

Ij(z) := f

Wjσ(e

Ajz+e

ζj) approximates the identity uniformly on Bk·k(0, M + (j−1)

3c)⊂Rd, that is,

sup

z∈Bk·k(0,M +(j−1)

3c)

{kIj(z)−zk} <

3c.(2.4)

Deﬁne Jj=Ij◦ · ·· ◦ I1and J0(z) = z. We now prove inductively that for each j= 1, . . . , K

sup

z∈Bk·k(0,M )

{kJj(z)−zk} < j

3c.(2.5)

For j= 1 this follows from (2.4). For the induction step, we assume (2.5) holds for indices up to j−1

and aim to prove it for j. Firstly, we obtain from the induction hypothesis for i= 1, . . . , j −1

sup

z∈Bk·k(0,M )

{kJi(z)k} ≤ sup

z∈Bk·k(0,M )

{kJi(z)−zk} +M≤i

3c+M

and hence Ji(z)∈Bk·k(0, M +i

3c)⊂Rdfor z∈Dd. This inclusion, the triangle inequality, and (2.4)

imply that

sup

z∈Bk·k(0,M )

{kJj(z)−zk} ≤

j

X

i=1

sup

z∈Bk·k(0,M )

{kIi(Ji−1(z)) − Ji−1(z)k}

≤

j

X

i=1

sup

z∈Bk·k(0,M +(i−1)

3c)

{kIi(z)−zk} < j

3c,

Fading memory echo state networks are universal 4

which proves (2.5). The Lipschitz continuity of σand (2.5) thus allow us to estimate

sup

z∈KM

Wσ

K

X

j=0

A(−j)z−j+ζ

−Wσ

K

X

j=0

A(−j)Jj(z−j) + ζ

≤ kWkLσ

K

X

j=0

kA(−j)ksup

z∈KM

{kz−j− Jj(z−j)k} ≤ kWkLσ

K

X

j=0

kA(−j)kj

3c=

3.(2.6)

Let HESN(z) := WσPK

j=0 A(−j)Jj(z−j) + ζand HFNN(z) := WσPK

j=0 A(−j)z−j+ζ. By the

triangle inequality, (2.2), (2.3), and (2.6), we have

HU−HESN ∞= sup

z∈KM

{

HU(z)−HESN(z)

}

≤ |||HU−G|||∞+G−HFNN ∞+HFNN −HESN ∞<

3+

3+

3=.

In order to conclude the proof, it remains to be shown that HESN is indeed the functional associated to

an echo state network. Let N=e

N1· · · +e

NK+Nand

A=

0e

N1,

e

N1

0e

N1,

e

N2

··· ··· 0e

N1,

e

NK

0e

N1,N

e

A2f

W10e

N2,

e

N2

··· ··· 0e

N2,

e

NK

0e

N2,N

0e

N3,

e

N1e

A3f

W2

......0e

N3,

e

NK

0e

N3,N

.

.

..

.

........

.

..

.

.

0e

NK,

e

N1

0e

NK,

e

N2

··· e

AKf

WK−10e

NK,

e

NK

0e

NK,N

A(−1) f

W1A(−2) f

W2··· A−(K−1) f

WK−1A(−K)f

WK0N,N

∈MN,N , C =

e

A1

0e

N2,d

.

.

.

0e

NK,d

A(0)

∈MN,d ,ζ=

e

ζ1

e

ζ2

.

.

.

e

ζK

ζ

∈RN.

Consider now the state vectors

xt=

x(0)

t

x(1)

t

.

.

.

x(K)

t

∈RNwith x(i)

t∈R

e

Ni+1 ,i∈ {0,1, . . . , K −1},x(K)

t∈RN,

and the state equation in RNdetermined by

xt=σ(Axt−1+Czt+ζ).(2.7)

By our choice of matrices, the solutions xof (2.7) (if they exist) satisfy that

x(0)

t=σ(e

A1zt+e

ζ1)∈R

e

N1,x(j)

t=σe

Aj+1 f

Wjx(j−1)

t−1+e

ζj+1∈R

e

Nj+1 ,with j∈ {1, . . . , K −1},and

x(K)

t=σ

K

X

j=1

A(−j)f

Wjx(j−1)

t−1+A(0)zt+ζ

∈RN,for all t∈Z−.(2.8)

Fading memory echo state networks are universal 5

Iterating the expression (2.8) we obtain that this solution indeed exists, it is unique, and it is given by:

x(0)

t=σ(e

A1zt+e

ζ1),x(1)

t=σ(e

A2I1(zt−1) + e

ζ2), . . . , x(K−1)

t=σ(e

AKJK−1(zt−(K−1)) + e

ζK),and

x(K)

t=σ

K

X

j=1

A(−j)f

Wjσe

AjJj−1(zt−j) + e

ζj+A(0)zt+ζ

=σ

K

X

j=0

A(−j)Jj(zt−j) + ζ

.

Consequently,

Wx(K)

0=Wσ

K

X

j=0

A(−j)Jj(z−j) + ζ

=HESN(z).

This equality shows that HESN is the functional associated to the echo state equation (2.7) and the

linear readout given by W:= Om,N−N|W∈Mm,N , which concludes the proof.

3 A numerical illustration

This section provides a brief numerical experiment to illustrate the universality of the echo state family,

as proved in Theorem 2.1. The unknown input/output system is given in this case by the “stochastic

alpha, beta, rho” (SABR) model [Haga 02]. SABR is a stochastic volatility model frequently used in

the ﬁnancial industry determined by the two-dimensional stochastic diﬀerential equation:

(dFt=αtFβ

tdB1

t,

dαt=ναtdB2

t,

where B1and B2are two Wiener processes correlated by dB1dB2=ρ dt. These equations determine

an input/output system that maps the two-dimensional realizations of (B1, B2) to the two-dimensional

sample path (F, α) that is uniquely determined by the SABR model once initial conditions for these

two variables have been chosen. For our experiment we use the SABR parameters ν= 0.5, β= 0.5,

ρ= 0.7 and the initial values (1,0.5). We generate 100 noise trajectories on the interval [0,1] and use the

Euler discretization method with 105time-steps to simulate the associated SABR trajectories. We then

learn this input/output mapping using echo state networks (ESNs) with σ(x) = tanh(x) and increasing

numbers of neurons N, namely N∈ {10,20,...,90}. Moreover, for each Nwe randomly generate 10

diﬀerent choices for the parameters A, C, ζ of the ESN. For each of these parameter choices we use mean

square error minimization for the reservoir functional and train Wusing a linear regression. Finally,

we choose the best-performing network among the 10 parameter draws. The underlying distribution for

the entries of A,C, and ζis a standard normal distribution and A,C,ζare subsequently scaled by

(ρ(A) + 0.1)−1cwhere ρ(A) is the spectral radius of Aand cis a common factor determined by some

preliminary experiments. For each Nwe now create a box plot of the approximation errors for the 100

input trajectories. These box plots are collected in Figure 1. It is clearly visible from these ﬁgures that

the approximation error decreases to 0 as Nincreases. The maximum error for each Nis represented

in the ﬁgure by the largest outlier (circled). The ﬁgure indicates that also the maximum error tends to

0 as Nincreases.

Acknowledgments: The authors acknowledge partial ﬁnancial support coming from the Research

Commission of the Universit¨at Sankt Gallen, the Swiss National Science Foundation (grant number

200021 175801/1), and the French ANR “BIPHOPROC” project (ANR-14-OHRI-0002-02). The authors

thank the hospitality and the generosity of the Division of Mathematical Sciences of the Nanyang

Technological University, Singapore, and the FIM at ETH Zurich, where a signiﬁcant portion of the

results in this paper was obtained.

Fading memory echo state networks are universal 6

Figure 1: Box plots of the approximation error for the SABR-input/output system using an echo state

network with Nneurons for diﬀerent values of N. It is clearly visible that the approximation error

decreases to 0 as Nincreases. The maximum error for each Nis represented in the ﬁgure by the largest

outlier (circled). The results also indicate that the maximum error tends to 0 as Nincreases.

Fading memory echo state networks are universal 7

References

[Boyd 85] S. Boyd and L. Chua. “Fading memory and the problem of approximating nonlinear oper-

ators with Volterra series”. IEEE Transactions on Circuits and Systems, Vol. 32, No. 11,

pp. 1150–1161, nov 1985.

[Bute 13] P. Buteneers, D. Verstraeten, B. V. Nieuwenhuyse, D. Stroobandt, R. Raedt, K. Vonck,

P. Boon, and B. Schrauwen. “Real-time detection of epileptic seizures in animal models

using reservoir computing”. Epilepsy Research, Vol. 103, No. 2, pp. 124–134, 2013.

[Cybe 89] G. Cybenko. “Approximation by superpositions of a sigmoidal function”. Mathematics of

Control, Signals, and Systems, Vol. 2, No. 4, pp. 303–314, dec 1989.

[Gono 20a] L. Gonon, L. Grigoryeva, and J.-P. Ortega. “Approximation error estimates for random

neural networks and reservoir systems”. arXiv preprint 2002.05933, 2020.

[Gono 20b] L. Gonon, L. Grigoryeva, and J.-P. Ortega. “Risk bounds for reservoir computing”. Journal

of Machine Learning Research (to appear), 2020.

[Gono 20c] L. Gonon and J.-P. Ortega. “Reservoir computing universality with stochastic inputs”.

IEEE Transactions on Neural Networks and Learning Systems, Vol. 31, No. 1, pp. 100–112,

2020.

[Grig 14] L. Grigoryeva, J. Henriques, L. Larger, and J.-P. Ortega. “Stochastic time series forecasting

using time-delay reservoir computers: performance and universality”. Neural Networks,

Vol. 55, pp. 59–71, 2014.

[Grig 18a] L. Grigoryeva and J.-P. Ortega. “Echo state networks are universal”. Neural Networks,

Vol. 108, pp. 495–508, 2018.

[Grig 18b] L. Grigoryeva and J.-P. Ortega. “Universal discrete-time reservoir computers with stochastic

inputs and linear readouts using non-homogeneous state-aﬃne systems”. Journal of Machine

Learning Research, Vol. 19, No. 24, pp. 1–40, 2018.

[Grig 19] L. Grigoryeva and J.-P. Ortega. “Diﬀerentiable reservoir computing”. Journal of Machine

Learning Research, Vol. 20, No. 179, pp. 1–62, 2019.

[Grig 20] L. Grigoryeva, A. G. Hart, and J.-P. Ortega. “Chaos on compact manifolds: Diﬀerentiable

synchronizations beyond Takens”. Preprint arXiv:2010.03218, 2020.

[Haga 02] P. S. Hagan, D. Kumar, A. S. Lesniewski, and D. E. Woodward. “Managing smile risk”.

The Best of Wilmott, Vol. 1, pp. 249–296, 2002.

[Hart 20] A. G. Hart, J. L. Hook, and J. H. P. Dawes. “Embedding and approximation theorems for

echo state networks”. Neural Networks, Vol. 128, pp. 234–247, 2020.

[Horn 89] K. Hornik, M. Stinchcombe, and H. White. “Multilayer feedforward networks are universal

approximators”. Neural Networks, Vol. 2, No. 5, pp. 359–366, 1989.

[Horn 91] K. Hornik. “Approximation capabilities of muitilayer feedforward networks”. Neural Net-

works, Vol. 4, No. 1989, pp. 251–257, 1991.

[Jaeg 04] H. Jaeger and H. Haas. “Harnessing Nonlinearity: Predicting Chaotic Systems and Saving

Energy in Wireless Communication”. Science, Vol. 304, No. 5667, pp. 78–80, 2004.

Fading memory echo state networks are universal 8

[Lu 18] Z. Lu, B. R. Hunt, and E. Ott. “Attractor reconstruction by machine learning”. Chaos,

Vol. 28, No. 6, 2018.

[Luko 09] M. Lukoˇseviˇcius and H. Jaeger. “Reservoir computing approaches to recurrent neural net-

work training”. Computer Science Review, Vol. 3, No. 3, pp. 127–149, 2009.

[Matt 93] M. B. Matthews. “Approximating nonlinear fading-memory operators using neural network

models”. Circuits, Systems, and Signal Processing, Vol. 12, No. 2, pp. 279–307, jun 1993.

[Path 17] J. Pathak, Z. Lu, B. R. Hunt, M. Girvan, and E. Ott. “Using machine learning to replicate

chaotic attractors and calculate Lyapunov exponents from data”. Chaos, Vol. 27, No. 12,

2017.

[Path 18] J. Pathak, B. Hunt, M. Girvan, Z. Lu, and E. Ott. “Model-Free Prediction of Large

Spatiotemporally Chaotic Systems from Data: A Reservoir Computing Approach”. Physical

Review Letters, Vol. 120, No. 2, p. 24102, 2018.

[Perr 96] P. C. Perryman. Approximation Theory for Deterministic and Stochastic Nonlinear Systems.

PhD thesis, University of California, Irvine, 1996.

[Sand 91] I. W. Sandberg. “Approximation theorems for discrete-time systems”. IEEE Transactions

on Circuits and Systems, Vol. 38, No. 5, pp. 564–566, 1991.

[Stub 97] A. Stubberud and P. Perryman. “Current state of system approximation for deterministic

and stochastic systems”. In: Conference Record of The Thirtieth Asilomar Conference on

Signals, Systems and Computers, pp. 141–145, IEEE Comput. Soc. Press, 1997.

[Tana 19] G. Tanaka, T. Yamane, J. B. H´eroux, R. Nakane, N. Kanazawa, S. Takeda, H. Numata,

D. Nakano, and A. Hirose. “Recent advances in physical reservoir computing: A review”.

Neural Networks, Vol. 115, pp. 100–123, 2019.

[Wyﬀ 10] F. Wyﬀels and B. Schrauwen. “A comparative study of Reservoir Computing strategies for

monthly time series prediction”. Neurocomputing, Vol. 73, No. 10, pp. 1958–1964, 2010.