Content uploaded by Sai Ravela

Author content

All content in this area was uploaded by Sai Ravela on Dec 11, 2019

Content may be subject to copyright.

Content uploaded by Ziwei li

Author content

All content in this area was uploaded by Ziwei li on Dec 08, 2019

Content may be subject to copyright.

On Neural Learnability of Chaotic Dynamics

Ziwei Li∗and Sai Ravela

Earth Signals and Systems Group, Department of Earth, Atmospheric,

and Planetary Sciences, Massachusetts Institute of Technology, Cambridge, MA 91106, USA

(Dated: December 10, 2019)

In modeling nonlinear dynamics, neural networks are of interest for prediction and uncertainty

quantiﬁcation. The “learnability” of chaotic dynamics by neural networks, however, remains poorly

understood. In this work, we show that a parsimonious network trained on few data points suﬃces

for accurate prediction of local divergence rates on the whole attractor. To understand neural

learnability, we decompose the mappings in the neural network into a series of geometric stretching

and compressing operations that indicate topological mixing and, therefore, chaos. This reveals

that neural networks and chaotic dynamical systems are structurally similar, which yields excellent

reproduction of local divergence rates. To build parsimonious networks, we employ an approach

that matches the spectral features of the dynamics of deep learning those of polynomial regression.

Introduction – Chaotic systems are ubiquitous [24].

For these systems, there usually exists a set of contin-

uous nonlinear governing equations, but ﬁnding the ex-

act solutions is often impossible, in part due to a com-

mon chaos characteristic wherein two close by trajecto-

ries diverge exponentially. In practice, modelers use dis-

cretization to solve the nonlinear equations, and often

from multiple initial conditions to quantify errors. Do-

ing so, however, raises diﬃcult challenges in the form

of non-linearity, high-dimensionality and non-Gaussian

uncertainty[21]. As a result, the search for simple-yet-

eﬀective models for chaotic dynamics remains a crucial

pursuit in the physical sciences.

Recently, there has been a surge of interest in us-

ing neural networks (NNs) to emulate chaotic dynam-

ics [2, 3, 7, 9, 19, 29, 30, 32], showing neural networks

as promising models. We follow this line of investigation,

ﬁrst showing that neural models with only a few neurons,

and trained using a small number of data points, recon-

struct the entire attractor object of the classic Lorenz-63

(L63) system [18]. NNs can “extrapolate” from partial

knowledge of the attractor, rendering a uniform distribu-

tion of the training data unnecessary. The neural models

are also, it appears, as chaotic as the L63 system they

train from.

This success is much like other eﬀort seeking to em-

ulate chaotic dynamics. By way of explanation, one

typically resorts to the universal approximation theorem

(UAP)[10, 16, 17, 23]. However, UAP is an existence

statement. It neither explains the emergence of chaos

nor the eﬃcacy with which the attractor is reconstructed.

Inspired by the validity of a geometric interpretation of

the L63 system [26], we show that a geometric perspec-

tive helps to explain the neural eﬃcacy. Neural mappings

alternately rotate, stretch, and compress, which are the

deﬁning characteristics of chaotic dynamics [6]. Whilst

the question of optimizing neural computation for predic-

tion remains open, our work suggests that possession of

∗ziweili@mit.edu

geometrical properties required by chaos theory enables

neural networks to eﬃciently match the structure of the

L63’s attractor object and its predictability. To the best

of our knowledge, this explanation for neural learnability

of chaos is new [20]. A key step in this process is to show

that NNs are compact and do not over-ﬁt in relation to

the attractor object. We achieve this by noting that L63

is a polynomial, allowing us impose bounds the size of

the neural network [22, 25].

Methods – The L63 model was originally used to de-

scribe 2-D Rayleigh-B´enard convection, in which the pa-

rameters of the streamfunction and temperature ﬁelds

are written in a set of ordinary diﬀerential equations [18]:

˙

X=σ(Y−X);

˙

Y=ρX −Y−XZ;

˙

Z=−βZ +X Y,

(1)

where Xand Yare the strengths of the streamfunction

and temperature modes, and Zrepresents the deviation

of the vertical temperature proﬁle from linearity. Consis-

tent with Lorenz’s original paper, we set σ= 10, β= 8/3,

and ρ= 28. The solutions of L63 are known to be dis-

sipative (volume in phase space contracts rapidly), and

chaotic (sensitive to initial perturbations).

We deﬁne L63 as a discrete mapping from the current

state of the system xn= (X, Y, Z)Tto the state of the

next timestep xn+1:

ΦL63(xn) = xn+1 .(2)

We choose this discrete form both for L63 and NN maps

because it provides a straightforward connection between

the geometric L63 mapping and dynamics in the neu-

ral network. Since the exact form of Eq. (2) for L63 is

unknown, the discrete map is obtained by numerically

integrating Eq. (1) and sampling at increment dt.

The discrete map implemented by a single-hidden-layer

feedforward NN is

ΦNN(xn) = W2g(W1xn+b1) + b2,(3)

in which the 3×1 input vector, xn, is ﬁrst left-multiplied

by an L×3 weight matrix W1, and added to an L×1

arXiv:submit/2959506 [cs.LG] 10 Dec 2019

2

bias term b1, where Lis the number of neurons. The

resulting vector is then element-wise “compressed” by a

sigmoid function g, which takes the form of tanh in our

setup. Left-multiplication by a 3×Lmatrix W2followed

by addition of another bias term b2ﬁnishes a mapping

iteration of the NN.

We use Matlab function ode45 to numerically solve

for the discrete mappings of L63 as training data. To

obtain data on the attractor, we randomly initialize 1000

trajectories from region [−20,20] ×[−20,20] ×[0,50] with

uniform distribution. Each trajectory is integrated for

2500 timesteps with dt= 0.01. We abandon the ﬁrst

2000 timesteps to remove the transient parts, which are

much shorter than 2000 steps. Then, the remaining 500

timesteps of the 1000 trajectories are aggregated as pairs

(x,x0) that satisfy x0=ΦL63(x), forming the training

data pool.

The prior locations of the training data pairs (x) can

be seen as a representation of the L63 attractor (AL63),

and each consecutive location pair provide information

about the discrete L63 ﬂow. We then randomly sample

a speciﬁc number of training pairs from the data pool

to train NNs. Each NN is trained for 103epochs with

Bayesian regularization [8], where an epoch means a full

sweep through the sampled training data.

We use the ﬁnite-time Lyapunov exponent (FTLE) [13]

to compare the local divergence rates of NN and L63 and

quantify their similarity on the basis of predictability.

The FTLE is computed from forward-propagating two

nearby trajectories that originate from the vicinity of the

AL63 attractor. Formally, it is deﬁned as

λmax := 1

Nt

ln

max

δx0

δxNt

|δx0|=1

Nt

ln √σmax,(4)

where λmax denotes the maximum FTLE, δx0is the ini-

tial perturbation between two trajectories, and δxNtde-

notes their diﬀerence after Ntsteps. The FTLE relates

to the original Lyapunov exponent when Nt→ ∞ and

δx0→0[4]. To obtain λmax, we select the direction

of δx0such that |δxNt|is maximized, and in practice,

we calculate λmax using the largest eigenvalue (σmax) of

JTJ, where Jis the Jacobian matrix evaluated using per-

turbations around x0.

Results – We ﬁrst show that NN can learn the chaotic

dynamics of L63 eﬃciently with a small number of data

and neurons. The quadratic prediction error is reported

in Ref. [31], and will not be the main focus of this paper.

We instead compare the short-term and long-term behav-

iors of the two systems. Speciﬁcally, we show that the dy-

namics represented by NN possess similar predictability

to L63 as quantiﬁed by FTLE, and NNs are able to ex-

trapolate regions that are unknown in the training data.

We analyze a 4-neuron network trained on 40 data

points randomly sampled from the training data pool

(table Ishows the learnt parameters of this network).

Fig. 1depicts two trajectories that follow the L63 ﬂow

FIG. 1. Two trajectories produced by L63 (blue) and the 4-

neuron NN trained on 40 data points sampled from the whole

attractor (red). They are both 2000 timesteps long, and start

from the same location on the Lorenz attractor.

and the ﬂow of the trained NN, respectively. They in-

terlace with each other, creating the well-known Lorenz

attractor. Trajectories starting from other locations on

the attractor roughly trace out the same structure (not

shown). The close-resemblance between the two attrac-

tors indicates that the dynamics of this parsimonious NN

is similar to that of L63, i.e., NNs are able to learn chaotic

dynamics eﬃciently.

FIG. 2. One-to-one scatter plot of FTLE with NN and L63.

The NN used in this plot has 4 neurons and is trained on only

40 data points. The panels correspond to four integration

steps, Nt, with time increment dt= 0.01.

To calculate FTLE, we generate points following the

NN ﬂow using the same generation process as in Meth-

ods. The generated points in the phase space represents

the NN attractor (ANN). We then randomly initialize

2000 trajectories on AL63 . Every trajectory from AL63 is

paired with another trajectory that starts from the clos-

est point on ANN, and in each pair of trajectories, the

former follows the L63 ﬂow while the latter follows the

NN ﬂow.

3

FIG. 3. The root-mean-squre error in FTLE of neural net-

works for each neuron and number-of-data conﬁguration. The

FTLE is calculated with Nt= 50 and averaged over 2000 tra-

jectories randomly initialized on the attractor. The red dot

represents the example conﬁguration in Fig. 1and 2. The

red surface is located at z= 0.05.

The FTLE of the trajectory pairs are compared under

diﬀerent timesteps: Nt= 5,50,100,500 (Fig. 2). When

Nt= 5,50, NN accurately reproduces local divergence

rates over the whole attractor, indicating that the short-

term predictability of the two systems agree with each

other. As Ntincreases, the correspondence diverges (Nt

= 100), and converges again (Nt= 500) to the classical

largest Lyapunov exponent of L63, which is roughly 0.91

as in Ref. [28]. The convergence of FTLE under large

timesteps implies that the long-term behavior of the two

systems is also similar.

TABLE I. Parameters of the 4-neuron NN ﬂow

Matrix Values

W1

0.0034 0.0030 −0.0050

0.0115 0.0072 −0.0015

−0.0067 −0.0009 −0.0064

−0.0075 −0.0005 0.0001

b1

T

−0.1131 −0.6111 −0.0266 −0.1395

W2

6.0807 5.2861 −7.9178 −107.1371

370.0114 −26.1875 −270.5582 366.2765

−169.8626 95.5298 −40.6654 62.2099

b2

T

−11.5557 71.0935 40.1989

diag(S)4.3852 1.2087 0.7184 0.0000

The agreement in FTLE generally improves under in-

creasing numbers of neurons and data points (Fig. 3).

This trend is expected if we invoke the bias and variance

trade-oﬀ [11]: increased complexity in learning models

such as neural networks generally translates into better

prediction accuracy (lower bias), provided that regular-

ization techniques prevent the learning algorithm from

entering the high-variance regime.

NN can extrapolate from an incomplete training

dataset. Fig. 4shows a comparison of two trajecto-

FIG. 4. Similar to Fig. 1, but the red trajectory is produced

by a 5-neuron NN trained on 100 data points sampled from

X > −5 part of the attractor. The region to the right of the

grey partition is the training data range, and the region to

the left is unknown to the NN.

ries predicted by NN and L63 that originate from the

same location (red dot). The NN in this case has 5 neu-

rons, and is trained on 100 data points sampled from the

X > −5 part of AL63 , which amounts to knowing 73%

of the attractor structure. The two trajectories are close

in the ﬁrst 100 timesteps, and then bifurcate onto the

two branches of the attractor. Despite starting from an

unknown region, the NN still predicts a well-behaved at-

tractor that closely resembles the original attractor in the

extrapolated region of X≤ −5. The one-to-one corre-

spondence of FTLEs between L63 and the NN trained on

the incomplete dataset is similar to Fig. 2(not shown).

Geometric perspective of the NN ﬂow – We showed

from the previous section that the neural learnability on

the L63 dynamics is very good. However, the theoreti-

cal approach to understand this learnability is unknown.

Although the UAP states that mapping ΦL63 can be ap-

proximated by NN arbitrarily well, it does not explain

the NN’s ability to reconstruct the strange attractor ef-

ﬁciently, nor its skill of extrapolation. Inspired by the

exact mathematical correspondence between the geomet-

ric Lorenz ﬂow and L63 [12, 26], we give our geometric

understanding of the NN ﬂow.

The dynamics of NN (Eq. 3) can be seen as a map-

ping in a multi-dimensional Riemann space (this inter-

pretation is also used in classiﬁcation problems [14]). In

the discrete map of the simple 4-neuron network dis-

cussed above, the input vector xin the 3-D phase space

is mapped into a 4-D neuron space, and then mapped

back to the phase space. We write an Nt-step trajectory

(Nt≥2) as LNt

0={x0,x1, ..., xNt}. In each mapping

from step n→n+ 1, n ∈ {0,1, ..., Nt−1}, there exists a

4-D intermediate vector yin the neuron space:

yn+1 =g(W1xn+b1),(n= 0,1, ..., Nt−1).(5)

We refer to yas neuron vector. The recurrence relation

of the neuron vector is then:

yn+1 =g(W∗yn+b∗),(n= 1,2, ..., Nt−1).(6)

4

where W∗=W1W2is a 4-by-4 matrix, and b∗=

W1b2+b1is a 4-by-1 vector. W∗can be decomposed

as W∗=USVTusing singular-value decomposition. U

and Vare 4-D both orthonormal matrices, and Sis a

diagonal matrix of rank 3. We then rewrite Eq. (6) more

explicitly as

yn+1 =g(USVTyn+b∗),(7)

which we call the neuron map. Equation (7) encodes the

entire dynamics learnt by NN, because it is diﬀerent from

Eq. (3) by a homomorphism, i.e., Eq. (5). Therefore, un-

derstanding the neuron map is equivalent to understand-

ing the dynamics of NN.

The neuron map comprises 4 sub-steps: rotation,

stretch, rotation, and compression. Rotations in this pa-

per takes the generalized sense of orthogonal transfor-

mation, and they are carried out by matrices VTand

Uin the neuron map. Since the sigmoid function only

has a compressing eﬀect due to its gradient being smaller

than or equal to 1, diagonal matrix Smust have at least

one diagonal element larger than 1 in order to satisfy

the requirement of stretching in chaotic dynamics. For

the 4-neuron NN at question, Simposes expanding eﬀect

in two dimensions since two of its diagonal elements are

greater than 1 (see table I).

Through the error growth between each timestep, the

eﬀects of compression and expansion exerted by NN are

seen more clearly. For a small perturbation (δy) between

two initial points near y, its value at the next timestep

is

δy0=g0(W∗y+b∗)W∗δy(8)

where we neglected second- and higher-order terms, and

denotes the element-wise product. We let Gjj =

g0PL

i=1 W∗

ji yi+b∗

j, then g0(W∗yn+b∗)W∗δy=

GW∗δy, where G= diag{G11 , G22, ...}. The squared

error is then

|δy0|2= (W∗δy)TG2W∗δy.(9)

From Eq. (9), it’s clear that singular values of W∗that

are larger than 1 will expand the perturbation, and G

compresses the perturbation since g0(x)∈(0,1],∀x∈R.

With information of y,Gcontrols the degree of com-

pression in each direction in the neuron space. Uand

Vin the decomposition of W∗controls the orientations

of compression and expansion, so that they take place in

diﬀerent directions.

The stretch and compression sub-steps in neuron maps

are frequently thought of as the standard way to create

topological mixing, an indicator of chaos. The ability to

be trained to obtain these geometric operations makes

NN very good to approximate discrete chaotic mapping.

As another perhaps more concrete example, a 2-neuron

NN map can be trained to faithfully recreate the H´enon

map (not shown), which is a 2-D chaotic map deﬁned

such that trajectories are stretched in one direction and

compressed in the other [15].

Generalization to multi-layer networks is straightfor-

ward in the above framework. Since “the dynamics of

the neural vector” will be ambiguous as there are mul-

tiple layers of neurons, we apply the same argument to

perturbations in the phase space. For a perturbation of

δxaround x, its squared length at the next timestep is

|δx0|2=|WN+1GNWN...G1W1δx|2,(10)

where Gi= diag{g0(Wiyi−1+bi)}, and yi−1is the neu-

ron vector of the ith layer for i > 1 (y0=x). The

weight and gradient matrices then consecutively param-

eterize multiple stretching and compressing operations in

a single NN map.

Lower-bounding the number of neurons – Since the

Euler-forward scheme of Eq. (1) is a 3-D (n= 3) poly-

nomial with a degree of at most d= 2, we use previ-

ous theoretical results on learning polynomials with NNs

[1, 5] to establish lower bounds on the necessary number

of neurons. In eﬀect, we assume that the dynamics are

polynomial but the learning system doesn’t know the ex-

act governing equations. The number of neurons (L) for

learning a polynomial with root-mean-square error target

is bounded by L= Ω(n6d/3)[1]. This is a rather coarse

estimate as more than 5 ×105nodes are needed when

∼1. In stark contrast, exactly two neurons in a single

hidden-layer of a PolyNet reproduce the sparse L63 poly-

nomial to numerical precision [25]. Matching equilibrium

norms of neural and polynomial regression asymptoti-

cally, a full polynomial (n, d) needs L=n+d

d−(n+ 1)

hidden nodes [22, 25] for an exact match. If a poly-

nomial were instead represented by a network with di-

rect input-output connections for the linear part, added

with a single tanh -activated hidden-layer for the resid-

ual nonlinear part, then matching the equilibrium norm

yields a bound: L∝n

2n+1 hn+d

d−(n+ 1)iof hidden-

layer units [22]. Eliminating constants using random

bounded-input, bounded-weight networks reveals that an

n= 3, d = 2 polynomial matches networks of 3 to 8

nodes with 95% conﬁdence. Note that the standard net-

work with just a single hidden tanh -layer with no input-

output bypass is sub-optimal, asymptotically yielding the

bound [22]:L∝n

2n+1 hn+d

d−1iof hidden-layer units.

A Taylor-expansion of the sigmoid function to the third

order: tanh(x) = x−x3/3 + O(x5) allows Eq.(3) to

be modeled as a polynomial of degree 3 (NN polyno-

mial). We further require all coeﬃcients of the NN poly-

nomial to be equal to those in Eq. (1). Then for an

NN with Lhidden nodes, biases, and n-dimensional in-

put/output, a total of 2nL +n+Lparameters should

satisfy 3C3

(n+3) constraining equations. The parameters

should be over-determined for a good ﬁt, hence at least

L=d(3C3

(n+3) −n)/(2n+ 1)e= 9 hidden nodes are

needed. To obtain an error estimate, we assume that

the prediction errors between NN and its truncated poly-

nomial dominates over the errors between NN and L63,

5

therefore the truncation errors of NN polynomial pro-

vides an upper bound on the prediction error of NN. We

can estimate the truncation error by substituting table I

into the NN polynomial to obtain ΦNN−poly, and calcu-

late the expected error over data sampled from the ANN

attractor: 2=h(ΦNN−poly −ΦNN)2iANN . 5000 random

samples on ANN gives a normalized error of ∼0.12.

Discussion – Our work suggests that NN may be a

good candidate to learn from data and represent a broad

range of chaotic dynamics with a good skill of generaliza-

tion. With the ﬂow-like dynamics and gradient-descent

algorithm, it may serve as a non-parametric model for

chaotic systems without explicit expressions. Conversely,

neural networks can be seen as a much more generalized

class of chaotic systems. Apart from compression and

expansion operations that are necessary for chaos, the

higher-dimensional rotations are also vital in creating the

ﬂow-like dynamics. We may further posit that the neu-

ral networks may be a unifying formulation for modeling

chaotic dynamics because it reproduces the H´enon map

and the discrete Lorenz map under the same mathemat-

ical framework.

On the other hand, the compression operation repre-

sented by the sigmoid function makes NN preferable to

model simple dissipative systems; its ability to model

conservative dynamics and systems of much higher di-

mensions is yet to be tested. More work is also needed,

possibly with the aid of Riemann geometry, to fundamen-

tally understand the geometric operations in the high-

dimensional neuron space.

Acknowledgments – Ziwei Li was advised by Sai Rav-

ela. Support from ONR grant N00014-19-1-2273, the

MIT Environmental Solutions Initiative, the John S. and

Maryann Montrym Fund, and the MIT Lincoln Labora-

tory is gratefully acknowledged.

[1] Andoni, A., Panigrahy, R., Valiant, G., and Zhang, L.,

in Proceedings of the 31st International Conference on

International Conference on Machine Learning (2014).

[2] Bahi, J. M., Couchot, J. F., Guyeux, C., and Salomon,

M., Chaos 22, 013122 (2012).

[3] Bakker, R., Schouten, J. C., Lee Giles, C., Takens, F.,

and Van den Bleek, C. M., Neural Computation 12, 2355

(2000).

[4] Barreira, L. and Pesin, Y., in University Lecture Series,

Vol. 23 (American Mathematical Society, Providence, RI,

2002) p. 151.

[5] Barron, A. R., IEEE Transactions on Information Theory

39, 930 (1993).

[6] Berge, P., Pomeau, Y., and Vidal, C., Order within chaos

(Wiley, 1987) p. 329.

[7] Chen, R. T., Rubanova, Y., Bettencourt, J., and Du-

venaud, D., in 32nd Conference on Neural Information

Processing Systems (2018).

[8] Dan Foresee, F. and Hagan, M. T., in Proceedings of

International Conference on Neural Networks (1997).

[9] Dudul, S. V., Applied Soft Computing 5, 333 (2005).

[10] Funahashi, K. and Nakamura, Y., Neural Networks 6,

801 (1993).

[11] Goodfellow, I., Bengio, Y., and Courville, A., Deep

Learning (MIT Press, Cambridge, MA, 2016).

[12] Guckenheimer, J. and Williams, R. F., Publ. Math. IHES

50, 307 (1979).

[13] Haller, G., Physica D: Nonlinear Phenomena 149, 248

(2001).

[14] Hauser, M. and Ray, A., in 31st Conference on Neural

Information Processing Systems (2017) p. 10.

[15] H´enon, M., Communications in Mathematical Physics

50, 69 (1976).

[16] Hornik, K., Neural Networks 4, 251 (1991).

[17] Hornik, K., Stinchcombe, M., and White, H., Neural

Networks 2, 359 (1989).

[18] Lorenz, E. N., Journal of the Atmospheric Sciences 20,

130 (1963).

[19] Madondo, M. and Gibbons, T., in Proceedings of the Mid-

west Instruction and Computing Symposium (2018).

[20] The term learnability is used here in the sense of neu-

ral system’s ﬁdelity to speciﬁed properties of a dynami-

cal system, e.g., the predictability of the dynamical sys-

tem quantiﬁed by ﬁnite-time Lyapunov exponent. This

is somewhat diﬀerent from Valiant’s deﬁnition [27].

[21] Ravela, S., “Tractable non-gaussian representations in

dynamic data driven coherent ﬂuid mapping,” in Hand-

book of Dynamic Data Driven Applications Systems,

edited by E. Blasch, S. Ravela, and A. Aved (Springer

International Publishing, Cham, 2018) pp. 29–46.

[22] Ravela, S., Li, Z., Trautner, M., and Reilly, S., Preprint

(2019).

[23] Seidl, D. R. and Lorenz, R. D., in Proceedings of the In-

ternational Joint Conference on Neural Networks 1991,

Vol. 2 (IEEE, 1991) pp. 709–714.

[24] Strogatz, S. H., Nonlinear Dynamics and Chaos: With

Applications to Physics, Biology, Chemistry, and Engi-

neering, 2nd ed. (CRC Press, Boca Raton, FL, 2015) p.

531.

[25] Trautner, M. and Ravela, S., “Neural integration of con-

tinuous dynamics,” (2019), arXiv:1911.10309 [cs.LG].

[26] Tucker, W., Foundations of Computational Mathematics

2, 53 (2002).

[27] Valiant, L. G., Commun. ACM 27, 1134 (1984).

[28] Viswanath, D., Lyapunov Exponents from Random Fi-

bonacci Sequences to the Lorenz Equations, Ph.D. thesis,

Cornell University, Ithaca, NY, USA (1998).

[29] Yu, R., Zheng, S., and Liu, Y., in Proceedings of

the ICML 17 Workshop on Deep Structured Prediction

(2017).

[30] Zerroug, A., Terrissa, L., and Faure, A., Annual Review

of Chaos Theory, Bifurcations and Dynamical Systems

4, 55 (2013).

[31] Zhang, L., in 2017 IEEE 30th Canadian Conference on

Electrical and Computer Engineering, 2 (2017) pp. 30–

33.

[32] Zhang, L., in 2017 IEEE Life Sciences Conference (2017)

pp. 39–42.