ArticlePDF Available

Echo state networks are universal

Authors:

Abstract

This paper shows that echo state networks are universal uniform approximants in the context of discrete-time fading memory filters with uniformly bounded inputs defined on negative infinite times. This result guarantees that any fading memory input/output system in discrete time can be realized as a simple finite-dimensional neural network-type state-space model with a static linear readout map. This approximation is valid for infinite time intervals. The proof of this statement is based on fundamental results, also presented in this work, about the topological nature of the fading memory property and about reservoir computing systems generated by continuous reservoir maps.
Echo state networks are universal
Lyudmila Grigoryeva1and Juan-Pablo Ortega2,3
Abstract
This paper shows that echo state networks are universal uniform approximants in the context
of discrete-time fading memory filters with uniformly bounded inputs defined on negative infinite
times. This result guarantees that any fading memory input/output system in discrete time can be
realized as a simple finite-dimensional neural network-type state-space model with a static linear
readout map. This approximation is valid for infinite time intervals. The proof of this statement
is based on fundamental results, also presented in this work, about the topological nature of the
fading memory property and about reservoir computing systems generated by continuous reservoir
maps.
Key Words: reservoir computing, universality, echo state networks, ESN, state-affine systems,
SAS, machine learning, fading memory property, echo state property, linear training, uniform system
approximation.
1 Introduction
Many recently introduced machine learning techniques in the context of dynamical problems have much
in common with system identification procedures developed in the last decades for applications in signal
treatment, circuit theory and, in general, systems theory. In these problems, system knowledge is
only available in the form of input-output observations and the task consists in finding or learning a
model that approximates it for mainly forecasting or classification purposes. An important goal in that
context is finding families of transformations that are both computationally feasible and versatile enough
to reproduce a rich number of patterns just by modifying a limited number of procedural parameters.
The versatility or flexibility of a given machine learning paradigm is usually established by proving its
universality. We say that a family of transformations is universal when its elements can approximate
as accurately as one wants all the elements of a sufficiently rich class containing, for example, all
continuous or even all measurable transformations. In the language of learning theory, this is equivalent
to the possibility of making approximation errors arbitrarily small [Cuck 02,Smal 03,Cuck 07]. In
more mathematical terms, the universality of a family amounts to its density in a rich class of the
type mentioned above. Well-known universality results are, for example, the uniform approximation
properties of feedforward neural networks established in [Cybe 89,Horn 89,Horn 91] in the context of
static continuous and, more generally, measurable real functions.
A first solution to this problem in the dynamic context was pioneered in the works of Fr´echet
[Frec 10] and Volterra [Volt 30] one century ago when they proved that finite Volterra series can be
1Department of Mathematics and Statistics. Universit¨at Konstanz. Box 146. D-78457 Konstanz. Germany.
Lyudmila.Grigoryeva@uni-konstanz.de
2Universit¨at Sankt Gallen. Faculty of Mathematics and Statistics. Bodanstrasse 6. CH-9000 Sankt Gallen. Switzer-
land. Juan-Pablo.Ortega@unisg.ch
3Centre National de la Recherche Scientifique (CNRS). France.
1
Echo state networks are universal 2
used to uniformly approximate continuous functionals defined on compact sets of continuous functions.
These results were further extended in the 1950s by the MIT school lead by N. Wiener [Wien 58,Bril 58,
Geor 59] but always under compactness assumptions on the input space and the time interval in which
inputs are defined. A major breakthrough was the generalization to infinite time intervals carried out
by Boyd and Chua in [Boyd 85], who formulated a uniform approximation theorem using Volterra series
for operators endowed with the so called fading memory property on continuous time inputs. An
input/output system is said to have fading memory when the outputs associated to inputs that are close
in the recent past are close, even when those inputs may be very different in the distant past.
In this paper we address the universality or the uniform approximation problem for transformations
or filters of discrete time signals of infinite length that have the fading memory property. The approx-
imating set that we use is generated by nonlinear state-space transformations and that is referred to
as reservoir computers (RC) [Jaeg 10,Jaeg 04,Maas 02,Maas 11,Croo 07,Vers 07,Luko 09] or
reservoir systems. These are special types of recurrent neural networks determined by two maps,
namely a reservoir F:RN×RnRN,n, N N, and a readout map h:RNRdthat under
certain hypotheses transform (or filter) an infinite discrete-time input z= (. . . , z1,z0,z1, . . .)(Rn)Z
into an output signal y(Rd)Zof the same type using the state-space transformation given by:
xt=F(xt1,zt),
yt=h(xt),
(1.1)
(1.2)
where tZand the dimension NNof the state vectors xtRNis referred to as the number of
virtual neurons of the system. When a RC system has a uniquely determined filter associated to it,
we refer to it as the RC filter.
An important advantage of the RC approach is that, under certain hypotheses, intrinsically infinite
dimensional problems regarding filters can be translated into analogous questions related to the reser-
voir and readout maps that generate them and that are defined on much simpler finite dimensional
spaces. This strategy has already been used in the literature in relation to the universality question
in, for instance, [Sand 91a,Sand 91b,Matt 92,Matt 93,Perr 96,Stub 97]. The universal approxima-
tion properties of feedforward neural networks [Kolm 56,Arno 57,Spre 65,Spre 96,Spre 97,Cybe 89,
Horn 89,Horn 90,Horn 91,Horn 93,Rusc 98] was used in those works to find neural networks-based
families of filters that are dense in the set of approximately finite memory filters with inputs defined
in the positive real half-line. Other works in connection with the universality problem in the dynamic
context are [Maas 00,Maas 02,Maas 04,Maas 07] where RC is referred to as Liquid State Machines.
In those references and in the same vein as in [Boyd 85], universal families of RC systems with inputs
defined on infinite continuous time intervals were identified in the fading memory category as a corollary
of the Stone-Weierstrass theorem. This approach required invoking the natural hypotheses associated
to this result, like the pointwise separation property or the compactness of the input space, that was ob-
tained as a consequence of the fading memory property. Another strand of interesting literature that we
will not explore in this work has to with the Turing computability capabilities of the systems of the type
that we just introduced; recent relevant works in this direction are [Kili 96,Sieg 97,Cabe 15,Cabe 16],
and references therein.
The main contribution of this paper is showing that a particularly simple type of RC systems called
echo state networks (ESNs) can be used as universal approximants in the context of discrete-time
fading memory filters with uniformly bounded inputs defined on negative infinite times. ESNs are RC
systems of the form (1.1)-(1.2) given by:
xt=σ(Axt1+Czt+ζ),
yt=Wxt.
(1.3)
(1.4)
In these equations, CMN,n is called the input mask,ζRNis the input shift, and AMN ,N
Echo state networks are universal 3
is referred to as the reservoir matrix. The map σin the state-space equation (1.3) is constructed by
componentwise application of a sigmoid function (like the hyperbolic tangent or the logistic function)
and is called the activation function. Finally, the readout map is linear in this case and implemented
via the readout matrix WMd,N . ESNs already appear in [Matt 92,Matt 93] under the name
of recurrent networks but it was only more recently, in the works of H. Jaeger [Jaeg 04], that their
outstanding performance in machine learning applications was demonstrated.
The strategy that we follow to prove that statement is a combination of what the literature refers
to as internal and external approximation. External approximation is the construction of a RC
filter that approximates a given (not necessarily RC) filter. In the internal approximation problem, one
is given a RC filter and builds another RC filter that approximates it by finding reservoir and readout
maps that are close to those of the given one. In the external part of our proof we use a previous
work [Grig 17] where we constructed a family of RC systems with linear readouts that we called non-
homogeneous state affine systems (SAS). We showed in that paper that the RC filters associated
to SAS systems uniformly approximate any discrete-time fading memory filter with uniformly bounded
inputs defined on negative infinite times. Regarding the internal approximation, we show that any RC
filter, in particular SAS filters, can be approximated by ESN filters using the universal approximation
property of neural networks. These two facts put together allow us to conclude that ESN filters are
capable of uniformly approximating any discrete-time fading memory filter with uniformly bounded
inputs. We emphasize that this result is shown exclusively for deterministic inputs using a uniform
approximation criterion; an extension of this statement that accommodates stochastic inputs and Lp
approximation criteria can be found in [Gono 18].
The paper is structured in three sections:
Section 2introduces the notation that we use all along the paper and, more importantly, specifies
the topologies and Banach space structures that we need in order to talk about continuity in the
context of discrete-time filters. It is worth mentioning that we characterize the fading memory
property as a continuity condition of the filters that have it with respect to the product topology
in the input space. On other words, the fading memory property is not a metric property, as it is
usually presented in the literature, but a topological one. An important conceptual consequence of
this fact is that the fading memory property does not contain any information about the rate at
which systems that have it “forget” inputs. Several corollaries can be formulated as a consequence
of this fact that are very instrumental in the developments in the paper.
Section 3contains a collection of general results in relation with the properties of the RC systems
generated by continuous reservoir maps. In particular, we provide conditions that guarantee that
a unique reservoir filter can be associated to them (the so called echo state property) and we
identify situations in which those filters are themselves continuous (they satisfy automatically
the fading memory property). We also point out large classes of RC systems for which internal
approximation is possible, that is, if the RC systems are close then so are the associated reservoir
filters.
Section 4shows that echo state networks are universal uniform approximants in the category of
discrete-time fading memory filters with uniformly bounded inputs.
2 Continuous and fading memory filters
This section introduces the notation of the paper as well as general facts about filters and functionals
needed in the developments that follow. The new results are contained in Section 2.3, where we charac-
terize the fading memory property as a continuity condition when the sequence spaces where inputs and
Echo state networks are universal 4
outputs are defined are uniformly bounded and are endowed with the product topology. This feature
makes this property independent of the weighting sequences that are usually introduced to define it.
2.1 Notation
Vectors and matrices. A column vector is denoted by a bold lower case symbol like rand r>
indicates its transpose. Given a vector vRn, we denote its entries by vi, with i∈ {1, . . . , n}; we also
write v= (vi)i∈{1,...,n}. We denote by Mn,m the space of real n×mmatrices with m, n N. When
n=m, we use the symbol Mnto refer to the space of square matrices of order n. Given a matrix
AMn,m, we denote its components by Aij and we write A= (Aij), with i∈ {1, . . . , n},j∈ {1,...m}.
Given a vector vRn, the symbol kvkstands for any norm in Rn(they are all equivalent) and is not
necessarily the Euclidean one, unless it is explicitly mentioned. The open balls with respect to a given
norm k·k, center vRn, and radius r > 0 will be denoted by Bk·k(v, r ); their closures by Bk·k(v, r).
For any AMn,m,kAk2denotes its matrix norm induced by the Euclidean norms in Rmand Rn, and
satisfies [Horn 13, Example 5.6.6] that kAk2=σmax(A), with σmax (A) the largest singular value of A.
kAk2is sometimes referred to as the spectral norm of A. The symbol |||·||| is reserved for the norms of
operators or functionals defined on infinite dimensional spaces.
Sequence spaces. Ndenotes the set of natural numbers with the zero element included. Z(re-
spectively, Z+and Z) are the integers (respectively, the positive and the negative integers). The
symbol (Rn)Zdenotes the set of infinite real sequences of the form z= (. . . , z1,z0,z1, . . .), ziRn,
iZ; (Rn)Zand (Rn)Z+are the subspaces consisting of, respectively, left and right infinite sequences:
(Rn)Z={z= (. . . , z2,z1,z0)|ziRn, i Z}, (Rn)Z+={z= (z0,z1,z2, . . .)|ziRn, i Z+}.
Analogously, (Dn)Z, (Dn)Z, and (Dn)Z+stand for (semi-)infinite sequences with elements in the sub-
set DnRn. In most cases we endow these infinite product spaces with the Banach space structures
associated to one of the following two norms:
The supremum norm: define kzk:= suptZ{kztk}. The symbols (Rn) and
±(Rn) are
used to denote the Banach spaces formed by the elements in the corresponding infinite product
spaces that have a finite supremum norm.
The weighted norm: let w:N(0,1] be a decreasing sequence with zero limit. We define the
associated weighted norm k · kwon (Rn)Zassociated to the weighting sequence was the
map:
k·kw: (Rn)ZR+
z7−→ kzkw:= suptZ{kztwtk}.
The Proposition 5.2 in Appendix 5.11 shows that the space
w
(Rn) := nz(Rn)Z| kzkw<o,
endowed with weighted norm k·kwforms also a Banach space.
It is straightforward to show that kzkw≤ kzk, for all v(Rn)Z. This implies that
(Rn)w
(Rn)
and that the inclusion map (
(Rn),k·k)(w
(Rn,k·kw) is continuous.
2.2 Filters and systems
Filters. Let DnRnand DNRN. We refer to the maps of the type U: (Dn)Z(DN)Zas filters
or operators and to those like H: (Dn)ZDN(or H: (Dn)Z±DN) as RN-valued functionals.
Echo state networks are universal 5
These definitions will be sometimes extended to accommodate situations where the domains and the
targets of the filters are not necessarily product spaces but just arbitrary subsets of (Rn)Zand RNZ
like, for instance, (Rn) and (RN).
A filter U: (Dn)Z(DN)Zis called causal when for any two elements z,w(Dn)Zthat satisfy
that zτ=wτfor any τt, for a given tZ, we have that U(z)t=U(w)t. Let Tτ: (Dn)Z(Dn)Z
be the time delay operator defined by Tτ(z)t:= ztτ. The filter Uis called time-invariant (TI) when
it commutes with the time delay operator, that is, TτU=UTτ, for any τZ(in this expression,
the two operators Tτhave to be understood as defined in the appropriate sequence spaces).
We recall (see for instance [Boyd 85]) that there is a bijection between causal time-invariant filters
and functionals on (Dn)Z. Indeed, consider the sets F(Dn)Zand H(Dn)Zdefined by
F(Dn)Z:= U: (Dn)Z(RN)Z|Uis causal and time-invariant,(2.1)
H(Dn)Z:= H: (Dn)ZRN.(2.2)
Then, given a time-invariant filter U: (Dn)Z(RN)Z, we can associate to it a functional HU:
(Dn)ZRNvia the assignment HU(z) := U(ze)0, where ze(Rn)Zis an arbitrary extension of
z(Dn)Zto (Dn)Z. Let Ψ:F(Dn)ZH(Dn)Zbe the map such that Ψ(U) := HU. Conversely, for
any functional H: (Dn)ZRN, we can define a time-invariant causal filter UH: (Dn)Z(RN)Z
by UH(z)t:= H((PZTt)(z)), where Ttis the (t)-time delay operator and PZ: (Rn)Z(Rn)Z
is the natural projection. Let Φ:H(Dn)ZF(Dn)Zbe the map such that Φ(H) := UH. It is easy
to verify that:
ΨΦ=IH
(Dn)Zor, equivalently, HUH=H, for any functional H: (Dn)ZRN,
ΦΨ=IF
(Dn)Zor, equivalently, UHU=U, for any causal TI filter U: (Dn)Z(RN)Z,
that is, Ψand Φare inverses of each other and hence are both bijections. Additionally, we note that
the sets F(Dn)Zand H(Dn)Zare vector spaces with naturally defined operations and that Ψand Φare
linear maps between them, which allows us to conclude that F(Dn)Zand H(Dn)Zare linear isomorphic.
When a filter is causal and time-invariant, we work in many situations just with the restriction
U: (Dn)Z(DN)Zinstead of the original filter U: (Dn)Z(DN)Zwithout making the
distinction, since the former uniquely determines the latter. Indeed, by definition, for any z(Dn)Z
and tZ:
U(z)t= (Tt(U(z)))0= (U(Tt(z)))0,(2.3)
where the second equality holds by the time-invariance of Uand the value in the right-hand side depends
only on PZ(Tt(z)) (Dn)Z, by causality.
Reservoir systems and filters. Consider now the RC system determined by (1.1)–(1.2) with reser-
voir map defined on subsets DN, D0
NRNand DnRn, that is, F:DN×DnD0
Nand
h:D0
NRd. There are two properties of reservoir systems that will be crucial in what follows:
Existence of solutions property: this property holds when for each z(Dn)Zthere exists an
element x(DN)Zthat satisfies the relation (1.1) for each tZ.
Uniqueness of solutions or echo state property (ESP): it holds when the system has the
existence of solutions property and, additionally, these solutions are unique.
The echo state property has deserved much attention in the context of echo state networks [Jaeg 10,
Jaeg 04,Bueh 06,Yild 12,Bai 12,Wain 16,Manj 13,Gall 17]. We emphasize that these two properties
Echo state networks are universal 6
are genuine conditions that are not automatically satisfied by all RC systems. Later on in the paper,
Theorem 3.1 specifies sufficient conditions for them to hold.
The combination of the existence of solutions with the axiom of choice allows us to associate filters
UF: (Dn)Z(DN)Zto each RC system with that property via the reservoir map and (1.1), that is,
UF(z)t:= xtRN, for all tZ. We will denote by UF
h: (Dn)Z(Dd)Zthe corresponding filter
determined by the entire reservoir system, that is, UF
h(z)t=hUF(z)t:= ytRd.UF
his said to be
areservoir filter or a response map associated to the RC system (1.1)–(1.2). The filters UFand
UF
hare causal by construction. A unique reservoir filter can be associated to a reservoir system when
the echo state property holds. We warn the reader that reservoir filters appear in the literature only
in the presence of the ESP; that is why we sometimes make the distinction between those that come
from reservoir systems that do and do not satisfy the ESP by referring to them as reservoir filters
and generalized reservoir filters, respectively.
In the systems theory literature, the RC equations (1.1)–(1.2) are referred to as the state-variable
or the internal representation point of view and associated filters as the external representation
of the system.
The next proposition shows that in the presence of the ESP, reservoir filters are not only causal
but also time-invariant. In that situation we can hence associate to UF
hareservoir functional
HF
h: (Dn)ZRddetermined by HF
h:= HUF
h.
Proposition 2.1 Let DNRN,DnRn, and F:DN×DnDNbe a reservoir map that satisfies
the echo state property for all the elements in (Dn)Z. Then, the corresponding filter UF: (Dn)Z
(DN)Zis causal and time-invariant.
We emphasize that, as it can be seen in the proof in the appendix, it is the autonomous character
of the reservoir map that guarantees time-invariance in the previous proposition. An explicit time
dependence on time in that map would spoil that conclusion.
Reservoir system morphisms. Let N1, N2, n, d Nand let F1:DN1×DnDN1,h1:DN1Rd
and F2:DN2×DnDN2,h2:DN2Rdbe two reservoir systems. We say that a map f:DN1
DN2is a morphism between the two systems when it satisfies the following two properties:
(i) Reservoir equivariance: f(F1(x1,z)) = F2(f(x1),z),for all x1DN1, and zDn.
(ii) Readout invariance: h1(x1) = h2(f(x1)), for all x1DN1.
When the map fhas an inverse and it is also a morphism between the systems determined by the
pairs (F2, h2) and (F1, h1) we say that fis a system isomorphism and that the systems (F1, h1) and
(F2, h2) are isomorphic. Given a system F1:DN1×DnDN1,h1:DN1Rdand a bijection
f:DN1DN2, the map fis a system isomorphism with respect to the system F2:DN2×DnDN2,
h2:DN2Rddefined by
F2(x2,z) := fF1(f1(x2),z),for all x2DN2,zDn,(2.4)
h2(x2) := h1(f1(x2))),for all x2DN2.(2.5)
The proof of the following statement is a straightforward consequence of the definitions.
Proposition 2.2 Let F1:DN1×DnDN1,h1:DN1Rdand F2:DN2×DnDN2,
h2:DN2Rdbe two reservoir systems. Let f:DN1DN2be a morphism between them. Then:
(i) If x1(DN1)Zis a solution for the reservoir map F1associated to the input z(Dn)Z, then the
sequence x2(DN2)Zdefined by x2
t:= fx1
t,tZ, is a solution for the reservoir map F2
associated to the same input.
Echo state networks are universal 7
(ii) If UF1
h1is a generalized reservoir filter for the system determined by the pair (F1, h1)then it is also
a reservoir filter for the system (F2, h2). Equivalently, given a generalized reservoir filter UF1
h1
determined by (F1, h1), there exists a generalized reservoir filter UF2
h2determined by (F2, h2)such
that UF1
h1=UF2
h2.
(iii) If fis a system isomorphism then the implications in the previous two points are reversible.
2.3 Continuity and the fading memory property
In agreement with the notation introduced in the previous section, in the following paragraphs the
symbol U: (Dn)Z(DN)Zstands for a causal and time-invariant filter or, strictly speaking, for
the restriction of U: (Dn)Z(DN)Zto Z, see (2.3); HU: (Dn)ZDNis the associated
functional, for some DNRNand DnRn. Analogously, UHis the filter associated to a given
functional H.
Definition 2.3 (Continuous filters and functionals) Let DNRNand DnRnbe bounded
subsets such that (Dn)Z
(Rn)and (DN)Z
(RN). A causal and time-invariant filter U:
(Dn)Z(DN)Zis called continuous when it is a continuous map between the metric spaces
(Dn)Z,k·kand (DN)Z,k·k. An analogous prescription can be used to define continuous
functionals H:(Dn)Z,k·k(DN,k·k).
The following proposition shows that when filters are causal and time-invariant, their continuity can
be read out of their corresponding functionals and viceversa.
Proposition 2.4 Let DnRnand DNRNbe such that (Dn)Z
(Rn)and (DN)Z
(RN).
Let U: (Dn)Z(DN)Zbe a causal and time-invariant filter, H: (Dn)ZDNa functional,
and let Φand Ψbe the maps defined in the previous section. Then, if the filter Uis continuous then
so is the associated functional Ψ(U) =: HU. Conversely, if His continuous then so is Φ(H) =: UH.
Define now the vector spaces
F
(Dn)Z:= U: (Dn)Z
(RN)|Uis causal, time-invariant, and continuous,(2.6)
H
(Dn)Z:= H: (Dn)ZRN|His continuous.(2.7)
The previous statements guarantee that the maps Ψand Φrestrict to the maps (that we denote with
the same symbol) Ψ:F
(Dn)ZH
(Dn)Zand Φ:H
(Dn)ZF
(Dn)Zthat are linear isomorphisms
and are inverses of each other.
Definition 2.5 (Fading memory filters and functionals) Let w:N(0,1] be a weighting se-
quence and let DNRNand DnRnbe such that (Dn)Zw
(Rn)and (DN)Zw
(RN).
We say that a causal and time-invariant filter U: (Dn)Z(DN)Z(respectively, a functional
H: (Dn)ZDN) satisfies the fading memory property (FMP) with respect to the sequence w
when it is a continuous map between the metric spaces (Dn)Z,k·kwand (DN)Z,k·kw(respec-
tively, (Dn)Z,k·kwand (DN,k·k)). If the weighting sequence wis such that wt=λt, for some
λ(0,1) and all tN, then Uis said to have the λ-exponential fading memory property. We
define the sets
Fw
(Dn)Z,(DN)Z:= U: (Dn)Z(DN)Z|Ucausal, time-invariant, and FMP w.r.t. w,(2.8)
Hw
(Dn)Z,(DN)Z:= H: (Dn)ZDN|His FMP with respect to w.(2.9)
Echo state networks are universal 8
These definitions can be extended by replacing the product set (DN)Zby any subset of w
(RN)that is
not necessarily a product space. In particular, we define the sets
Fw
(Dn)Z:= U: (Dn)Zw
(RN)|Uis causal, time-invariant, and FMP w.r.t. w,(2.10)
Hw
(Dn)Z:= H: (Dn)ZRN|His FMP with respect to w.(2.11)
Definitions 2.3 and 2.5 can be easily reformulated in terms of more familiar -δ-type criteria, as they
were introduced in [Boyd 85]. For example, the continuity of the functional H: (Dn)ZDNis
equivalent to stating that for any z(Dn)Zand any  > 0, there exists a δ()>0 such that for any
s(Dn)Zthat satisfies that
kzsk= sup
tZ
{kztstk} < δ(),then kHU(z)HU(s)k< . (2.12)
Regarding the fading memory property, it suffices to replace the implication in (2.12) by
kzskw= sup
tZ
{kztstkwt}< δ(),then kHU(z)HU(s)k< . (2.13)
A very important part of the results that follow concern uniformly bounded families of sequences,
that is, subsets of (Rn)Zof the form
KM:= nz(Rn)Z| kztk ≤ Mfor all tZo,for some M > 0. (2.14)
It is straightforward to show that KM
(Rn)w
(Rn), for all M > 0 and any weighting sequence
w. A very useful fact is that the relative topology induced by (w
(Rn),k·kw) in KMcoincides with the
one induced by the product topology in (Rn)Z. This is a consequence of the following result that is a
slight generalization of [Munk 14, Theorem 20.5]. A proof is provided in Appendix 5.3 for the sake of
completeness.
Theorem 2.6 Let k·k :Rn[0,)be a norm in Rn,M > 0, and let w:N(0,1] be a weighting
sequence. Let dM(a,b) := min {kabk, M },a,bRn, be a bounded metric on Rnand define the
w-weighted metric DM
won (Rn)Zas
DM
w(x,y) := sup
tZdM(xt,yt)wt,x,y(Rn)Z.(2.15)
Then DM
wis a metric that induces the product topology on (Rn)Z. The space (Rn)Zis complete
relative to this metric.
An important consequence that can be drawn from this theorem is that all the weighted norms induce
the same topology on the subspaces formed by uniformly bounded sequences. An obvious consequence
of this fact is that continuity with respect to this topology can be defined without the help of weighting
sequences or, equivalently, filters or functionals with uniformly bounded inputs that have the fading
memory with respect to a weighting sequence, have the same feature with respect to any other weighting
sequence. We make this more specific in the following statements.
Corollary 2.7 Let M > 0and let KM:= nz(Rn)Z| kztk ≤ Mfor all tZobe a subset of
(Rn)Zformed by uniformly bounded sequences. Let w:N(0,1] be an arbitrary weighting sequence.
Then, the metric induced by the weighted norm k·kwon KMcoincides with D2M
w. Moreover, since D2M
w
induces the product topology on KM=Bk·k(0, M )Z, we can conclude that all the weighted norms
induce the same topology on KM. We recall that Bk·k (0, M )is the closure of the ball with radius M
centered at the origin, with respect to the norm k·k in Rn. The same conclusion holds when instead of
KMwe consider the set (Dn)Z, with Dna compact subset of Rn.
Echo state networks are universal 9
Theorem 2.6 can also be used to give a quick alternative proof in discrete time to an important
compactness result originally formulated in Boyd and Chua in [Boyd 85, Lemma 1] for continuous time
and, later on, in [Grig 17] for discrete time. The next corollary contains an additional completeness
statement.
Corollary 2.8 Let KMbe the set of uniformly bounded sequences, defined as in (2.14), and let w:
N(0,1] be a weighting sequence. Then, (KM,k·kw)is a compact, complete, and convex subset of the
Banach space (w
(Rn),k·kw). The compactness and the completeness statements also hold when instead
of KMwe consider the set (Dn)Z, with Dna compact subset of Rn; if Dnis additionally convex then
the convexity of (Dn)Zis also guaranteed.
It is important to point out that the coincidence between the product topology and the topologies
induced by weighted norms that we described in Corollary 2.7 only occurs for uniformly bounded sets
of the type introduced in (2.14). As we state in the next result, the norm topology in w
(Rn) is strictly
finer than the one induced by the product topology in (Rn)Z.
Proposition 2.9 Let w:N(0,1] be a weighting sequence and let (w
(Rn),k·kw)be the Banach
space constructed using the corresponding weighted norm on the space of left infinite sequences with
elements in Rn. The norm topology in w
(Rn)is strictly finer than the subspace topology induced by the
product topology in (Rn)Zon w
(Rn)(Rn)Z.
The results that we just proved imply an elementary property of the sets that we defined in (2.8)-(2.9)
and (2.10)-(2.11) that we state in the following lemma.
Lemma 2.10 Let M > 0and let wbe a weighting sequence. Let U:KMw
(RN)(respectively,
H:KMRN) be and element of Fw
KM(respectively, Hw
KM). Then there exists L > 0such that
U(KM)KL(respectively, H(KM)Bk·k(0, L))) and we can hence conclude that UFw
KM,KL
(respectively, HHw
KM,KL). Conversely, the inclusion Fw
KM,KLFw
KM(respectively, Hw
KM,KLHw
KM)
holds true for any M > 0. The sets Fw
KMand Hw
KMare vector spaces.
The next proposition spells out how the fading memory property is independent of the weighting
sequence that is used to define it, which shows its intrinsically topological nature. A conceptual conse-
quence of this fact is that the fading memory property does not contain any information about the rate
at which systems that have it “forget” inputs. A similar statement in the continuous time setup has
been formulated in [Sand 03]. Additionally, there is a bijection between FMP filters and functionals.
Proposition 2.11 Let KM(Rn)Zand KLRNZbe subsets of uniformly bounded sequences
defined as in (2.14)and let w:N(0,1] be a weighting sequence. Let U:KMKLbe a causal
and time-invariant filter and let H:KMBk·k (0, L)be a functional. Then:
(i) If U(respectively H) has the fading memory property with respect to the weighting sequence w, then
it has the same property with respect to any other weighting sequence. In particular, this implies
that
Fw
KM,KL=Fw0
KM,KLand Hw
KM,KL=Hw0
KM,KL,for any weighting sequence w0.
In what follows we just say that U(respectively H) has the fading memory property and denote
FFMP
KM,KL:= Fw
KM,KLand HFMP
KM,KL:= Hw
KM,KL,for any weighting sequence w.
The same statement holds true for the vector spaces Fw
KMand Hw
KM, that will be denoted in the
sequel by FFMP
KMand HFMP
KM, respectively.
Echo state networks are universal 10
(ii) Let Φand Ψbe the maps defined in the previous section. Then, if the filter Uhas the fading memory
property then so does the associated functional Ψ(U) =: HU. Analogously, if Hhas the fading
memory property, then so does Φ(H) =: UH. This implies that the maps Ψand Φrestrict to maps
(that we denote with the same symbols) Ψ:FFMP
KM,KLHFMP
KM,KLand Φ:HFMP
KM,KLFFMP
KM,KL
that are inverses of each other. The same applies to Ψ:FFMP
KMHFMP
KMand Φ:HFMP
KMFFMP
KM
that, in this case, are linear isomorphisms.
The same statements can be formulated when instead of KMand KLwe consider the sets (Dn)Zand
(DN)Z, with Dnand DNcompact subsets of Rnand RN, respectively.
In the conditions of the previous proposition, the vector spaces FFMP
KMand HFMP
KMcan be endowed
with a norm. More specifically, let U:KMw
(Rn) be a filter and let H:KMRNbe a
functional that have the FMP. Define:
|||U|||:= sup
zKM
{kU(z)k}= sup
zKM(sup
tZ
{kU(z)tk}),(2.16)
|||H|||:= sup
zKM
{kH(z)k} .(2.17)
The compactness of (KM,k·kw) guaranteed by Corollary 2.8 and the fact that by Lemma 2.10 Uand
Hmap into uniformly bounded sequences and a compact subspace of RN, respectively, ensures that
the values in (2.16) and (2.17) are finite, which makes FFMP
KM,|||·|||and HFMP
KM,|||·|||into normed
spaces that, as we will see in the next result, are linearly homeomorphic. For any L > 0 these norms
restrict to the spaces FFMP
KM,KLand HFMP
KM,KL, which are in general not linear but become nevertheless
metric spaces.
Proposition 2.12 The linear isomorphism Ψ:FFMP
KM,|||·|||HFMP
KM,|||·|||and its inverse Φ
satisfy that
|||Ψ(U)|||≤ |||U|||,for any UFFMP
KM,(2.18)
|||Φ(H)|||≤ |||H|||,for any HHFMP
KM.(2.19)
These inequalities imply that these two maps are continuous linear bijections and hence the spaces
FFMP
KM,|||·|||and HFMP
KM,|||·|||are linearly homeomorphic. Equivalently, the following diagram com-
mutes and all the maps in it are linear and continuous
FFMP
KM,|||·|||Ψ
HFMP
KM,|||·|||
IdFFMP
KMx
yIdHFMP
KM
FFMP
KM,|||·|||Φ
HFMP
KM,|||·|||.
For any L > 0, the inclusions FFMP
KM,KL,|||·|||FFMP
KM,|||·|||and HFMP
KM,KL,|||·|||HFMP
KM,|||·|||
(see Lemma 2.10) are continuous and so are the restricted bijections (that we denote with the same sym-
bols) Ψ: (FFMP
KM,KL,|||·|||)(HFMP
KM,KL,|||·|||)and Φ: (HFMP
KM,KL,|||·|||)(FFMP
KM,KL,|||·|||)that are
inverses of each other. The last statement is a consequence of the following inequalities:
|||Ψ(U1)Ψ(U2)|||≤ |||U1U2|||,for any U1, U2FFMP
KM,KL,(2.20)
|||Φ(H1)Φ(H2)|||≤ |||H1H2|||,for any H1, H2HFMP
KM,KL.(2.21)
The same statements can be formulated when instead of KMand KLwe consider the sets (Dn)Zand
(DN)Z, with Dnand DNcompact subsets of Rnand RN, respectively.
Echo state networks are universal 11
3 Internal approximation of reservoir filters
This section characterizes situations under which reservoir filters can be uniformly approximated by
finding uniform approximants for the corresponding reservoir systems. Such a statement is part of the
next theorem that also identifies criteria for the availability of the echo state and the fading memory
properties (recall that we used the acronyms ESP and FMP, respectively). As it was already mentioned,
a reservoir system has the ESP when it has a unique semi-infinite solution for each semi-infinite input.
We also recall that in the presence of uniformly bounded inputs, as it was shown in Section 2.3, the FMP
amounts to the continuity of a reservoir filter with respect to the product topologies on the input and
output spaces. The completeness and compactness of those spaces established in Corollary 2.8 allows
us to use various fixed point theorems to show that solutions for reservoir systems exist under very
weak hypotheses and that for contracting and continuous reservoir maps (we define this below) these
solutions are unique and depend continuously on the inputs. Said differently, contracting continuous
reservoir maps induce reservoir filters that automatically have the echo state and the fading memory
properties.
Theorem 3.1 Let KM(Rn)Zand KLRNZbe subsets of uniformly bounded sequences defined
as in (2.14)and let F:Bk·k(0, L)×Bk·k(0, M )Bk·k (0, L)be a continuous reservoir map.
(i) Existence of solutions: for each zKMthere exists a xKL(not necessarily unique) that
solves the reservoir equation associated to F, that is,
xt=F(xt1,zt),for all tZ.
(ii) Uniqueness and continuity of solutions (ESP and FMP): suppose that the reservoir map F
is a contraction, that is, there exists 0< r < 1such that for all u,vBk·k (0, L),zBk·k(0, M ),
one has
kF(u,z)F(v,z)k ≤ rkuvk.
Then, the reservoir system associated to Fhas the echo state property. Moreover, this system has
a unique associated causal and time-invariant filter UF:KMKLthat has the fading memory
property, that is, UFFFMP
KM,KL. The set UF(KM)of accessible states of the filter UFis compact.
(iii) Internal approximation property: let F1, F2:Bk·k(0, L)×Bk·k (0, M )Bk·k(0, L)be two
continuous reservoir maps such that F1is a contraction with constant 0< r < 1and F2has the
existence of solutions property. Let UF1, UF2:KMKLbe the corresponding filters (if F2does
not have the ESP, then UF2is just a generalized filter). Then, for any  > 0, we have that
kF1F2k< δ() := (1 r)implies that |||UF1UF2|||< . (3.1)
Part (i) also holds true when instead of KMand KLwe consider the sets (Dn)Zand (DN)Z, with Dn
and DNcompact and convex subsets of Rnand RN, respectively, that replace the closed balls Bk·k(0, M )
and Bk·k(0, L). The same applies to parts (ii) and (iii) but, this time, the convexity hypothesis is not
needed.
Define the set KKM,KL:= nF:Bk·k(0, L)×Bk·k (0, M )Bk·k(0, L)|Fis a continuous contractiono.
Using the notation introduced in the previous section, the statement in (3.1) and part (ii) of the theorem
automatically imply that the map
Ξ : (KKM,KL,k·k)FFMP
KM,KL,|||·|||
F7−UF
Echo state networks are universal 12
is continuous and by Proposition 2.12, the map that associates to each FKKM,KLthe corresponding
functional HF, that is,
ΨΞ : (KKM,KL,k·k)HFMP
KM,KL,|||·|||
F7−HF,
is also continuous.
Proof of the theorem. (i) We start by defining, for each zKM, the map given by
Fz:KLKL
x7−(Fz(x))t:= F(xt1,zt).
We show first that Fzcan be written as a product of continuous functions. Indeed:
Fz=Y
tZ
F(·,zt)pt1(x),(3.2)
where the projections pt:KLBk·k(0, L) are given by pt(x) = xt. These projections are continuous
when we consider in KLthe product topology. Additionally, the continuity of the reservoir Fimplies
that Fzis a product of continuous functions, which ensures that Fzis itself continuous [Munk 14,
Theorem 19.6]. Moreover, by the corollaries 2.7 and 2.8, the space KLis a compact and convex subset
of the Banach space w
(Rn),k·kw(see Proposition 5.2), for any weighting sequence w. Schauder’s
Fixed Point Theorem (see [Shap 16, Theorem 7.1, page 75]) guarantees then that Fzhas at least a fixed
point, that is, a point xKLthat satisfies Fz(x) = xor, equivalently,
xt=F(xt1,zt),for all tZ,
which implies that xis a solution of Ffor z, as required.
Proof of part (ii) The main tool in the proof of this part is a parameter dependent version of the
Contraction Fixed Point Theorem, that we include here for the sake of completeness and whose proof
can be found in [Ster 10, Theorem 6.4.1, page 137].
Lemma Let (X, dX)be a complete metric space and let Zbe a metric space. Let K:X×ZX
be a continuous map such that for each zZ, the map Kz:XXgiven by Kz(x) := K(x, z)is a
contraction with a constant 0< r < 1(independent of z), that is, dX(K(x, z), K(y , z)) rd(x, y), for
all x, y Xand all zZ. Then:
(i) For each zZ, the map Kzhas a unique fixed point in X.
(ii) The map UK:ZXthat associates to each point zZthe unique fixed point of Kzis
continuous.
Consider now the map
F:KL×KMKL
(x,z)7−(F(x,z))t:= F(xt1,zt).
First, as we did in (3.2), it is easy to show that Fis continuous with respect to the product topologies
in KMand KL, by writing it down as the product of the composition of continuous functions. Second,
we show that the map Fis a contraction. Indeed, since by Corollary 2.7 we can choose an arbitrary
weighting sequence to generate the product topologies in KMand KL, we select w:N(0,1] given
Echo state networks are universal 13
by wt:= λt, with tNand λ > 0 that satisfies 0 <r<λ<1. Then, for any x,yKLand any
zKM, we have
kF(x,z)− F(y,z)kw= sup
tZkF(xt1,zt)F(yt1,zt)kλtsup
tZkxt1yt1kt,
where we used that Fis a contraction. Now, since 0 <r<λ<1 and hence r/λ < 1, we have
sup
tZkxt1yt1kt= sup
tZnkxt1yt1kλ(t1) r
λor
λkxykw.
This shows that Fis a family of contractions with constant r/λ < 1 that is continuously parametrized by
the elements in KM. The lemma above implies the existence of a continuous map UF: (KM,k·kw)
(KL,k·kw) that is uniquely determined by the identity
F(UF(z),z) = UF(z),for all zKM.
Proposition 2.1 implies that UFis causal and time-invariant. The set UF(KM) of accessible states of
the filter UFis compact because it is the image of a compact set (see Corollary 2.8) by a continuous
map (see [Munk 14, Theorem 26.5, page 166]).
Proof of part (iii) Let zKMand let UF1(z) be the unique solution for zof the reservoir systems
associated to F1available by the part (ii) of the theorem that we just proved. Additionally, let UF2(z)
be the value of a generalized filter associated to F2that exist by hypothesis. Then, for any tZ, we
have:
kUF1(z)tUF2(z)tk=kF1(UF1(z)t1,zt)F2(UF2(z)t1,zt)k
=kF1(UF1(z)t1,zt)F1(UF2(z)t1,zt) + F1(UF2(z)t1,zt)F2(UF2(z)t1,zt)k
≤ kF1(UF1(z)t1,zt)F1(UF2(z)t1,zt)k+kF1(UF2(z)t1,zt)F2(UF2(z)t1,zt)k
rkUF1(z)t1UF2(z)t1k+kF1(UF2(z)t1,zt)F2(UF2(z)t1,zt)k.
If we now recursively apply ntimes the same procedure to the first summand of this expression, we
obtain that
kUF1(z)tUF2(z)tk ≤ rnkUF1(z)tnUF2(z)tnk+kF1(UF2(z)t1,zt)F2(UF2(z)t1,zt)k
+rkF1(UF2(z)t2,zt1)F2(UF2(z)t2,zt1)k
+· · · +rn1
F1(UF2(z)tn,zt(n+1))F2(UF2(z)tn,zt(n+1) )
(3.3)
If we combine the inequality (3.3) with the hypothesis
kF1F2k= sup
xBk·k(0,L),zBk·k (0,M )
{kF1(x,z)F2(x,z)k} < δ() := (1 r),
we obtain
kUF1(z)UF2(z)k= sup
tZ
{kUF1(z)tUF2(z)tk}
2Lrn+ (1 + · · · +rn1)δ()=2Lrn+1rn
1rδ() (3.4)
Since this inequality is valid for any nN, we can take the limit n→ ∞ and we obtain that
kUF1(z)UF2(z)kδ()
1r=.
Echo state networks are universal 14
Additionally, as this relation is valid for any zKM, we can conclude that
|||UF1UF2|||= sup
zKM
{kUF1(z)UF2(z)k} ≤ ,
as required.
As a straightforward corollary of the first part of the previous theorem, it is easy to show that echo
state networks always have (generalized) reservoir filters associated as well as to formulate conditions
that ensure simultaneously the echo state and the fading memory properties.
We recall that a map σ:R[1,1] is a squashing function if it is non-decreasing, limx→−∞ σ(x) =
1, and limx→∞ σ(x) = 1.
Corollary 3.2 Consider echo state network given by
xt=σ(Axt1+Czt+ζ),
yt=Wxt,
(3.5)
(3.6)
where CMN,n for some NN,ζRN,AMN,N ,WMd,N , and the input signal z(Dn)Z,
with DnRna compact and convex subset. The function σ:RN[1,1]Nin (3.5)is constructed
by componentwise application of a squashing function that we also call σ. Then:
(i) If the squashing function σis continuous, then the reservoir equation (3.5)has the existence of
solutions property and we can hence associate to the system (3.5)-(3.6)a generalized reservoir
filter.
(ii) If the squashing function σis differentiable with Lipschitz constant Lσ:= supxR{|σ0(x)|} <
and the matrix Ais such that kAk2Lσ=σmax(A)Lσ<1, then the reservoir system (3.5)-(3.6)
has the echo state and the fading memory properties and we can hence associate to it a unique
time-invariant reservoir filter.
The statement in part (i) remains valid when [1,1]Nis replaced by a compact and convex subset
DN[1,1]Nthat is left invariant by the reservoir equation (3.5), that is, σ(Ax+Cz+ζ)DNfor
any xDNand any zDn. The same applies to part (ii) but only the compactness hypothesis is
necessary.
Remark 3.3 The hypothesis kAk2Lσ<1 appears in the literature as a sufficient condition to ensure
the echo state property, which has been extensively studied in the ESN literature [Jaeg 10,Jaeg 04,
Bueh 06,Bai 12,Yild 12,Wain 16,Manj 13]. Our result shows that this condition implies automati-
cally the fading memory property. Nevertheless, that condition is far from being sharp and has been
significantly improved in [Bueh 06,Yild 12]. We point out that the enhanced sufficient conditions for
the echo state property contained in those references also imply the fading memory property via part
(ii) of Theorem 3.1.
4 Echo state networks as universal uniform approximants
The internal approximation property that we introduced in part (ii) of Theorem 3.1 tells us that we
can approximate any reservoir filter by finding an approximant for the reservoir system that generates
it. This reduces the problem of proving a density statement in a space of operators between infinite-
dimensional spaces to a space of functions with finite dimensional variables and values. This topic is the
subject of many results in approximation theory, some of which we mentioned in the introduction. This
Echo state networks are universal 15
strategy allows one to find simple approximating reservoir filters for any reservoir system that has the
fading memory property. In the next result we use as approximating family the echo state networks that
we presented in the introduction and that, as we see later on, are the natural generalizations of neural
networks in a dynamic learning setup, with the important added feature that they are constructed using
linear readouts. The combination of this approach with a previously obtained result [Grig 17] on the
density of reservoir filters on the fading memory category allows us to prove in the next theorem that
echo state networks can approximate any fading memory filter. On other words, echo state networks
are universal.
All along this section, we use the Euclidean norm for the finite dimensional spaces, that is, for
each xRn, we write kxk:= Pn
i=1 x2
i1/2. For any M > 0, the symbol Bk·k (0, M ) (respectively
Bk·k(0, M )) denotes here the open (respectively closed) balls with respect to that norm. Additionally,
we set In:= Bk·k(0,1).
Theorem 4.1 Let U:IZ
nRdZbe a causal and time-invariant filter that has the fading memory
property. Then, for any  > 0and any weighting sequence w, there is an echo state network
xt=σ(Axt1+Czt+ζ),
yt=Wxt.
(4.1)
(4.2)
whose associated generalized filters UESN :IZ
nRdZsatisfy that
|||UUESN|||< . (4.3)
In these expressions CMN,n for some NN,ζRN,AMN,N , and WMd,N . The function
σ:RN[1,1]Nin (4.1)is constructed by componentwise application of a continuous squashing
function σ:R[1,1] that we denote with the same symbol.
When the approximating echo state network (4.1)-(4.2)satisfies the echo state property, then it has a
unique filter UESN associated which is necessarily time-invariant. The corresponding reservoir functional
HESN :IZ
nRdsatisfies that
|||HUHESN|||< . (4.4)
Remark 4.2 Echo state networks are generally used in practice in the following way: the architecture
parameters A,C, and ζare drawn at random from a given distribution and it is only the readout matrix
Wthat is trained using a teaching signal by solving a linear regression problem. It is important to
emphasize that the universality theorem that we just stated does not completely explain the empirically
observed robustness of ESNs with respect to the choice of those parameters. In the context of standard
feedforward neural networks this feature has been addressed using, for example, the so called extreme
learning machines [Huan 06]. In dynamical setups and for ESNs this question remains an open problem
that will be addressed in future works.
Proof of the theorem. As we already explained, we proceed by first approximating the filter Uby
one of the non-homogeneous state-affine system (SAS) reservoir filters introduced in [Grig 17], and we
later on show that we can approximate that reservoir filter by an echo state network like the one in
(4.1)-(4.2).
We start by recalling that a non-homogeneous state-affine system is a reservoir system determined
by the state-space transformation: xt=p(zt)xt1+q(zt),
yt=W1xt,
(4.5)
(4.6)
Echo state networks are universal 16
where the inputs ztIn:= Bk·k(0,1), the states xtRN1, for some N1N, and W1Md,N1. The
symbols p(zt) and q(zt) stand for polynomials with matrix coefficients and degrees rand s, respectively,
of the form:
p(z) = X
i1,...,in∈{0,...,r}
i1+···+inr
zi1
1· · · zin
nAi1,...,in, Ai1,...,inMN1,zIn
q(z) = X
i1,...,in∈{0,...,s}
i1+···+ins
zi1
1· · · zin
nBi1,...,in, Bi1,...,inMN1,1,zIn.
Let L > 0 and choose a real number Ksuch that
0< K < L
L+ 1 <1.(4.7)
Consider now SAS filters that satisfy that maxzInσmax(p(z)) < K and maxzInσmax (q(z)) < K.
It can be shown [Grig 17, Proposition 3.7] that under those hypotheses, the reservoir system (4.5)-
(4.6) has the echo state property and defines a unique causal, time-invariant, and fading memory filter
Up,q
W1:IZ
n(Rd)Z. Moreover, Theorem 3.12 in [Grig 17] shows that for any 1>0, there exists a
SAS filter Up,q
W1satisfying the hypotheses that we just discussed, for which
HUHp,q
W1< 1,(4.8)
where HUand Hp,q
W1are the reservoir functionals associated to Uand Up,q
W1, respectively. Proposition
2.12 together with this inequality imply that
UUp,q
W1< 1.(4.9)
We now show that the SAS filter Up,q
W1can be approximated by the filters generated by an echo state
network. Define the map
FSAS :Bk·k(0, L)×InRN1
(x,z)7−p(z)x+q(z),(4.10)
with Bk·k(0, L)RN1and pand qthe polynomials associated to the approximating SAS filter Up,q
W1in
(4.9).
The prescription on the choice of the constant Kin (4.7) has two main consequences. Firstly, the
map FSAS is a contraction. Indeed, for any (x,z),(y,z)Bk·k(0, L)×In:
kFSAS(x,z)FSAS (y,z)k≤kp(z)xp(z)yk≤kp(z)k2kxyk ≤ Kkxyk.(4.11)
The map FSAS is hence a contraction since K < 1 by hypothesis. Second, kFSASk< L because by
(4.7)
kFSASk= sup
(x,z)Bk·k(0,L)×In
{kp(z)x+q(z)k} ≤ sup
(x,z)Bk·k(0,L)×In
{kp(z)k2kxk+kq(z)k} ≤ KL+K < L.
This implies, in particular, that the map FSAS maps into Bk·k(0, L) and hence (4.10) can be rewritten
as
FSAS :Bk·k(0, L)×InBk·k(0, L).
Additionally, we set
L1:= kFSASk< L. (4.12)
Echo state networks are universal 17
The uniform density on compacta of the family of feedforward neural networks with one hidden
layer proved in [Cybe 89,Horn 89] guarantees that for any 2>0, there exists NN,GMN,N1,
CMN,n,EMN1,N , and ζRN, such that the map defined by
FNN :Bk·k(0, L)×InRN1
(x,z)7−Eσ (Gx+Cz+ζ),(4.13)
satisfies that
kFNN FSASk= sup
xBk·k(0,L),zIn
{kFNN(x,z)FSAS (x,z)k} < 2.(4.14)
The combination of (4.14) with the reverse triangle inequality implies that kFNNk− kFSAS k< 2
or, equivalently,
kFNNk<kFSAS k+2.(4.15)
Given that kFSASk=L1< L, if we choose 2>0 small enough so that L1+2< L or, equivalently,
2< L L1,(4.16)
then (4.15) guarantees that kFNNk< L, which shows that FNN maps into Bk·k(0, L), that is, we can
write that
FNN :Bk·k(0, L)×InBk·k(0, L).(4.17)
The continuity of the map FNN and the first part of Theorem 3.1 imply that the corresponding reservoir
equation has the existence of solutions property and that we can hence associate to it a (generalized)
filter UFNN . At the same time, as we proved in (4.11), the map FSAS is a contraction with constant
K < 1. These facts, together with (4.14) and the internal approximation property in Theorem 3.1 allow
us to conclude that the (unique) reservoir filter UFSAS associated to the reservoir map FSAS is such that
|||UFNN UFSAS |||< 2/(1 K).(4.18)
Consider now the readout map hW1:RN1Rdgiven by hW1(x) := W1xand let UhW1
FNN : (In)Z
(Rd)Zbe the filter given by UhW1
FNN (z)t:= W1UFNN (z)t,tZ. Analogously, define UhW1
FSAS : (In)Z
(Rd)Zand notice that UhW1
FSAS =Up,q
W1. Using these observations and (4.18) we have proved that for any
2>0 we can find a filter of the type UhW1
FNN that satisfies that
Up,q
W1UhW1
FNN ≤ kW1k2|||UFSAS UFNN |||<kW1k22/(1 K).(4.19)
Consequently, for any  > 0, if we first set 1=/2 in (4.8) and we then choose
2:= min (1 K)
2kW1k2
,LL1
2,(4.20)
in view of (4.16) and (4.19), we can guarantee using (4.9) and (4.19) that
UUhW1
FNN UUp,q
W1+Up,q
W1UhW1
FNN
2+
2=. (4.21)
In order to conclude the proof it suffices to show that the filter UhW1
FNN can be realized as the reservoir
filter associated to an echo state network of the type presented in the statement. We carry that out
Echo state networks are universal 18
by using the elements that appeared in the construction of the reservoir FNN in (4.13) to define a new
reservoir map FESN with the architecture of an echo state network. Let A:= GE MNand define
FESN :DN×InRN
(x,z)7−σ(Ax+Cz+ζ).(4.22)
The set DNin the domain of FESN is given by
DN:= [1,1]NE1(Bk·k(0, L)),(4.23)
where E1(Bk·k(0, L)) denotes the preimage of the set Bk·k (0, L)RN1by the linear map E:RN
RN1associated to the matrix EMN1,N . This set is compact as E1(Bk·k (0, L)) is closed and [1,1]N
is compact and hence DNis a closed subspace of a compact space which is always compact [Munk 14,
Theorem 26.2]. Additionally, DNis also convex because [1,1]Nis convex and E1(Bk·k(0, L)) is also
convex because it is the preimage of a convex set by a linear map, which is always convex.
We note now that the image of FESN is contained in DN. First, as the squashing function maps into
the interval [1,1], it is clear that
FESN (DN, In)[1,1]N.(4.24)
Second, for any xDNwe have by construction that xE1(Bk·k(0, L)) and hence ExBk·k(0, L).
Since by (4.17)FNN maps into Bk·k(0, L), we can ensure that for any zIn, the image FNN(Ex,z) =
(GEx+Cz+ζ) = Eσ (Ax+Cz+ζ)Bk·k(0, L) or, equivalently,
FESN(x,z) = σ(Ax+Cz+ζ)E1(Bk·k (0, L)).(4.25)
The relations (4.24) and (4.25) imply that
FESN (DN, In)DN,(4.26)
and hence, we can rewrite (4.22) as
FESN :DN×InDN.
The continuity of the map FESN and the compactness and convexity of the set DNRNthat we
established above allow us to use the first part of Theorem 3.1 to conclude that the corresponding
reservoir equation has the existence of solutions property and that we can hence associate to it a
(generalized) filter UFESN . Let W:= W1EMd,n and define the readout map hESN :DNRd
by hESN(x) := Wx=W1Ex. Denote by UESN any generalized reservoir filter associated to the echo
state network system (FESN, hESN) that, by construction, satisfies UESN(z)t:= hESN(UFESN (z)t) =
W UFESN (z)t, for any zInand tZ.
We next show that the map f:DN= [1,1]NE1(Bk·k(0, L)) Bk·k(0, L) given by f(x) := Ex
is a morphism between the echo state network system (FESN, hESN ) and the reservoir system (FNN, hW1).
Indeed, the reservoir equivariance property holds because, for any (x,z)DN×In, the definitions (4.13)
and (4.22) ensure that
f(FESN(x,z)) = Eσ (Ax+Cz+ζ) = Eσ (GEx+Cz+ζ) = FNN (Ex,z) = FNN(f(x),z).
The readout invariance is obvious. This fact and the second part in Proposition 2.2 imply that all the
generalized filters UESN associated to the echo state network are actually filters generated by the system
(FNN, hW1). This means that for each generalized filter UESN there exists a generalized filter of the type
UhW1
FNN such that UESN =UhW1
FNN . The inequality (4.21) proves then (4.3) in the statement of the theorem.
The last claim in the theorem is a straightforward consequence of Propositions 2.1 and 2.12.
Echo state networks are universal 19
5 Appendices
5.1 Proof of Proposition 2.1
Let τNand let Tn
τ: (Dn)Z(Dn)Zand TN
τ: (DN)Z(DN)Zbe the corresponding time delay
operators. For any z(Dn)Z, let x(DN)Zbe the unique solution of the reservoir system determined
by F, that is,
x:= UF(z).(5.1)
Then, for any tZ,TN
τUF(z)=xtτ.(5.2)
Analogously, let e
x(DN)Zbe the unique solution of Fassociated to the input Tn
τ(z), that is,
e
xt=UFTn
τ(z)t,for any tZ.(5.3)
By construction, the sequence e
xsatisfies that
e
xt=F(e
xt1, T n
τ(z)t) = F(e
xt1,ztτ),for any tZ.
It we set s:= tτ, this expression can be rewritten as
e
xs+τ=F(e
xs+τ1,zs),for any sZ,(5.4)
and if we define b
xs:= e
xs+τ, the equality (5.4) becomes
b
xs=F(b
xs1,zs),for any sZ,
which shows that b
x(DN)Zis a solution of Fdetermined by the input z(Dn)Z. Since the sequence
x(DN)Zin (5.1) is also a solution of Ffor the same input, the echo state property hypothesis on
the systems determined by Fimplies that x=b
x, necessarily. This implies that xtτ=b
xtτfor all
tZ, which is equivalent to e
xt=xtτ. This equality guarantees that (5.2) and (5.3) are equal and
since b
z(Dn)Zis arbitrary, we have that
TN
τUF=UFTn
τ,
as required.
5.2 Proof of Proposition 2.4
Suppose first that Uis continuous. This implies the existence of a positive function δU() such that if
u,v(Dn)Zare such that kuvk< δU(), then kU(u)U(v)k< . Under that hypothesis, it
is clear that:
kHU(u)HU(v)k=kU(u)0U(v)0k ≤ sup
tZ
{kU(u)tU(v)tk} =kU(u)U(v)k< ,
which shows the continuity of HU:(Dn)Z,k·k(DN,k·k).
Conversely, suppose that H:(Dn)Z,k·k(DN,k·k) is continuous and let δH()>0 be such
that if kuvk< δH() then kH(u)H(v)k< . Then, for any tZ,
kUH(u)tUH(v)tk=
H((PZTt)(u)) H((PZTt)(v))
< , (5.5)
which proves the continuity of UH. The inequality follows from the fact that for any u(Dn)Z, the
components of the sequence (PZTt)(u) are included in those of uand hence supsZ
(PZTt(u))s
supsZ{kusk} or, equivalently,
(PZTt)(u)
≤ kuk. This implies that if kuvk< δH()
then kTt(u)Tt(v)k< δH() and hence (5.5) holds.
Echo state networks are universal 20
5.3 Proof of Theorem 2.6
We first show that the map DM
w: (Rn)Z×(Rn)Z[0,) defined in (2.15) is indeed a met-
ric. It is clear that DM
w(x,y)0 and that DM
w(x,x) = 0, for any x,y(Rn)Z. Conversely, if
DM
w(x,y) = 0, this implies that dM(xt,yt)wtsuptZdM(xt,yt)wt=DM
w(x,y) = 0, which
ensures that dM(xt,yt) = 0, for any tZ, and hence x=ynecessarily since the map dMis a metric
in Rn[Munk 14, Chapter 2, §20]. It is also obvious that DM
w(x,y) = DM
w(y,x). Regarding the triangle
inequality, notice that for any x,y,z(Rn)Zand tZ:
dM(xt,zt)wtdM(xt,yt)wt+dM(yt,zt)wtDM
w(x,y) + DM
w(y,z),
which implies that
DM
w(x,z) = sup
tZdM(xt,zt)wtDM
w(x,y) + DM
w(y,z).
We now show that the metric topology on (R)Zassociated to DM
wcoincides with the product
topology. Let x(Rn)Zand let BDM
w(x, ) be an -ball around it with respect to the metric DM
w. Let
now NNbe large enough so that wN< /M. We then show that the basis element Vfor the product
topology in (Rn)Zgiven by
V:= · · · × Rn×Rn×BdM(xN, )× · · · BdM(x1, )×BdM(x0, )
and that obviously contains the element x(Rn)Zis such that VBDM
w(x, ). Indeed, since for any
y(Rn)Zand any tZwe have that dM(xt,yt)M, we can conclude that
dM(xt,yt)wtMwN,for all t≤ −N .
Therefore, DM
w(x,y)max MwN, dM(xN,yN)wN, . . . , dM(x1,y1)w1, dM(x0,y0)w0and hence
if yVthis expression is smaller than which allows us to conclude the desired inclusion V
BDM
w(x, ).
Conversely, consider a basis element of the product topology given by U=QtZUtwhere Ut=
BdM(xt, t) for a finite set of indices t∈ {α1, . . . , αr},t1, and Ut=Rnfor the rest. Let
:= mint∈{α1,...,αr}{twt}. We now show that BDM
w(x, )U. Indeed, if yBDM
w(x, ) then
dM(xt,yt)wtDM
w(x,y)< , for all tZ. It t∈ {α1, . . . , αr}then  < twtand hence
dM(xt,yt)wt< twt, which ensures that dM(xt,yt)< tand hence yU, as desired.
We conclude by showing that (Rn)Z, DM
wis a complete metric space. First, notice that since for
any x,y(Rn)Zand any given tZwe have that
dM(xt,yt)DM
w(x,y)
wt
,
we can conclude that if {x(i)}iNis a Cauchy sequence in (Rn)Z, then so are the sequences {xt(i)}iN
in Rn, for any tZ, with respect to the bounded metric dM. Since the completeness with respect
to the bounded metric dMand the Euclidean metric are equivalent [Munk 14, Chapter 7, §43] we can
ensure that {xt(i)}iNconverges to an element atRnwith respect to the Euclidean metric for any
tZ. We now show that {x(i)}iNconverges to a:= (at)tZ(Rn)Z, with respect to the metric
DM
w, which proves the completeness statement.
Indeed, since the metric DM
wgenerates the product topology, let U=Q∈∈ZUtbe a basis element
such that aUand, as before, Ut=BdM(at, t) for a finite set of indices t∈ {α1, . . . , αr},t1,
and Ut=Rnfor the rest. Let = min {α1, . . . , αr}. Since for each tZthe sequence xt(i)i→∞
at,
then there exists NtNsuch that for any k > Ntwe have that kxt(k)atk< . If we take N=
max {Nα1, . . . , Nαr}then it is clear that x(i)U, for all i>N, as required.
Echo state networks are universal 21
5.4 Proof of Corollary 2.7
Notice first that for any x,yKM, we have that kxtytk<2M,tZ, and hence
D2M
w(x,y) := sup
tZd2M(xt,yt)wt= sup
tZ
{kxtytkwt}=kxykw.
Hence, the topology induced by the weighted norm k·kwon KMcoincides with the metric topology
induced by the restricted metric D2M
w|KM×KMwhich, by Theorem 2.6, is the subspace topology induced
by the product topology on (Rn)Zon KM(see [Munk 14, Exercise 1, page 133]), as well as the product
topology on the product KM=Bk·k (0, M )Z(see [Munk 14, Theorem 19.3, page 116])).
5.5 Proof of Corollary 2.8
First, since KM=Bk·k(0, M )Z, it is clearly the product of compact spaces. By Tychonoff’s Theorem
([Munk 14, Chapter 5]) KMis compact when endowed with the product topology which, by Corollary
2.7, coincides with the topology associated to the restriction of the norm k·kwto KM, as well as with
the metric topology given by D2M
w|KM×KM.
Second, since (KM,k·kw) is metrizable it is a Hausdorff space. This implies (see [Munk 14, Theorem
26.3]) that as KMis a compact subspace of the Banach space (w
(Rn),k·kw) (see Proposition 5.2)
then it is necessarily closed. This in turn implies ([Simm 63, Theorem B, page 72]) that (KM,k·kw) is
complete.
Finally, the convexity statement follows from the fact that the product of convex sets is always
convex.
5.6 Proof of Proposition 2.9
Let dwbe the metric on w
(Rn) induced by the weighted norm k·kwand let Dw:= D1
wbe the w-
weighted metric on (Rn)Zwith constant M= 1 introduced in Theorem 2.6 and defined using the same
underlying norm in Rnas the one associated to k·kw. As we saw in that theorem, the metric Dwinduces
the product topology on (Rn)Z.
Let now uw
(Rn) and let  > 0. Let now vw
(Rn) be such that dw(u,v)< . By definition,
we have that
Dw(u,v) = sup
tZd1(xt,yt)wt= sup
tZ
{(min{kxtytk,1})wt} ≤ sup
tZ
{kxtytkwt}=dw(u,v)< ,
which shows that Bdw(u, )BDw(u, ) and allows us to conclude that the norm topology in w
(Rn)
is finer than the subspace topology induced by the product topology in (Rn)Z.
We now show that this inclusion is strict. Since the weighting sequence wconverges to zero, there
exists an element t0Zsuch that wt0< /2. Let λ > 0 arbitrary and define the element vλ(Rn)Z
by setting vλ
t0:= λut0and vλ
t:= λutwhen t6=t0. We now show that vλBDw(u, ) for any λ > 0.
Indeed,
Dw(u,vλ) = min{|λ1|kut0k,1}wt01·wt0< /2< .
At the same time, by definition,
dw(u,vλ) = |λ1|kut0kwt0<,
which shows that vλw
(Rn). However, since |λ1|kut0kwt0can be made as large as desired by
choosing λbig enough, we have proved that for any ball Bdw(u, 0), with 0>0 arbitrary, the ball
Echo state networks are universal 22
BDw(u, ) contains always an element in w
(Rn) that is not included in Bdw(u, 0). This argument
allows us to conclude that the norm topology in w
(Rn) is strictly finer than the subspace topology
induced by the product topology.
5.7 Proof of Lemma 2.10
The proof requires the following preparatory lemma that will also be used later on in the proof of
Proposition 2.11.
Lemma 5.1 Let M > 0and let wbe a weighting sequence. Then:
(i) The operator PZTt: (KM,k·kw)(KM,k·kw)is a continuous map, for any tZ.
(ii) The projections pi: (w
(Rn),k·kw)(Rn,k·k),iZ, given by pi(z) = zi, are continuous.
Proof of the lemma. (i) We show that this statement is true by characterizing PZTtas a
Cartesian product of continuous maps between two product spaces endowed with the product topologies
and by using Corollary 2.7. Indeed, notice first that the projections pi: (KM,k·kw)Bk·k (0, M )
are continuous since by Corollary 2.7 the topology induced on KMby the weighted norm k·kwis the
product topolopy. Since PZTtcan be written as the infinite Cartesian product of continuous maps
PZTt=Q−∞
i=tpi= (...,pt2, pt1, pt) it is hence continuous when using the product topology
induced by k·kw(see [Munk 14, Theorem 19.6]).
(ii) Notice first that the projections pi: (w
(Rn),k·kw)(Rn,k·k) are obviously continuous when we
consider in w
(Rn) the subspace topology induced by the product topology in (Rn)Z. The continuity
of pi: (w
(Rn),k·kw)(Rn,k·k) then follows directly from Proposition 2.9.H
We now proceed with the proof of Lemma 2.10. Let first HHw
KM. The FMP hypothesis implies
that the map H: (KM,k·kw)(RN,k·k) is continuous. Given that KMis compact by Corollary 2.8
then so is H(KM)RM. This in turn implies that H(KM) is closed and bounded [Munk 14, Theorem
27.3] which guarantees the existence of L > 0 such that H(KM)Bk·k(0, L)). The map obtained out
of Hby restriction of its target to Bk·k(0, L)) (that we denote with the same symbol) is also continuous
and hence HHw
KM,KL.
Let now U:KMw
(RN) in Fw
KMand consider the composition p0U:KMRN. The
FMP hypothesis on Uand the continuity of p0: (w
(Rn),k·kw)(Rn,k·k) that we established in the
second part of Lemma 5.1 imply that p0Uis continuous. This implies, together with the compactness
of KMthat we proved in Corollary 2.8, the existence of L > 0 such that p0U(KM)Bk·k(0, L)).
Equivalently, for any zKM, we have that U(z)0Bk·k(0, L)). Now, since Uis by hypothesis time
invariant, we have by (2.3) that
U(z)t= (Tt(U(z)))0=U(Tt(z))0Bk·k(0, L)), t Z,since Tt(z)KM,
which proves that U(KM)KL. The map obtained out of Uby restriction of its target to KL
(that we denote with the same symbol) is also continuous since (KL,k·kw) is a topological subspace of
(w
(Rn),k·kw) and hence UFw
KM,KL, as required.
The inclusion Fw
KM,KLFw
KM(respectively, Hw
KM,KLHw
KM) is a consequence of the continuity of
the inclusion map (KL,k·kw)(w
(Rn),k·kw) (respectively, (Bk·k(0, L)),k·k)(Rn,k·k)).
Echo state networks are universal 23
5.8 Proof of Proposition 2.11
Proof of part (i) The FMP of Uwith respect to the sequence wis, by definition, equivalent to the con-
tinuity of the map U: (KM,k·kw)(KL,k·kw) (respectively, H: (KM,k·kw)(Bk·k(0, L),k·k)).
By Corollary 2.8, this is equivalent to the continuity of these maps when KMand KLare endowed with
the product topology which is, by the same result, generated by any arbitrary weighting sequence.
Consider now U: (KM,k·kw)(w
(Rn),k·kw) in Fw
KM(respectively, H: (KM,k·kw)(Rn,k·k)
in Hw
KM). By Lemma 2.10 there exists an L > 0 such that U(respectively, H) can be considered
an element of Fw
KM,KL(respectively, Hw
KM,KL) by restriction of the target. Using the statement that
we just proved about the space Fw
KM,KL(respectively, Hw
KM,KL) we can conclude that U(respectively,
H) has the FMP with respect to any weighting sequence. Since, again by Lemma 2.10, the inclusion
Fw
KM,KLFw
KM(respectively, Hw
KM,KLHw
KM) holds true for any M > 0, and any weighting sequence
w, we can conclude that U(respectively