Content uploaded by Alexander Mozeika

Author content

All content in this area was uploaded by Alexander Mozeika on Oct 17, 2020

Content may be subject to copyright.

arXiv:2004.08930v3 [cs.LG] 14 Oct 2020

Space of Functions Computed by Deep-Layered Machines

Alexander Mozeika,1, ∗Bo Li,2, †and David Saad2 , ‡

1London Institute for Mathematical Sciences, London, W1K 2XF, United Kingdom

2Nonlinearity and Complexity Research Group, Aston University, Birmingham, B4 7ET, United Kingdom

We study the space of functions computed by random-layered machines, including deep neural

networks and Boolean circuits. Investigating the distribution of Boolean functions computed on the

recurrent and layer-dependent architectures, we ﬁnd that it is the same in both models. Depending

on the initial conditions and computing elements used, we characterize the space of functions com-

puted at the large depth limit and show that the macroscopic entropy of Boolean functions is either

monotonically increasing or decreasing with the growing depth.

Deep-layered machines comprise multiple consecutive

layers of basic computing elements aimed at represent-

ing an arbitrary function, where the ﬁrst and ﬁnal layers

represent its input and output arguments, respectively.

Notable examples include deep neural networks (DNNs)

composed of perceptrons [1] and Boolean circuits con-

structed from logical gates [2]. Being universal approx-

imators [3, 4], DNNs have been successfully employed

in diﬀerent machine learning applications [1]. Similarly,

Boolean circuits can compute any Boolean function even

when constructed from a single gate [5].

While the majority of DNN research focuses on their

application in carrying out various learning tasks, it is

equally important to establish the space of functions

they typically represent for a given architecture and the

computing elements used. One way to address such a

generic study is to consider a random ensemble of DNNs.

The study of random neural networks using methods of

statistical physics has played an important role in un-

derstanding their typical properties for storage capacity

and generalization ability [6, 7] and properties of energy-

based [8–12] and associative memory models [13, 14], as

well as the links between energy-based models and feed-

forward layered machines [15]. In parallel, there have

been theoretical studies within the computer science com-

munity of the range of Boolean functions generated by

random Boolean circuits [16, 17]. Both the DNNs and

the Boolean circuits share common basic properties.

Characterizing the space of functions computed by

random-layered machines is of great importance, since

it sheds light on their approximation and generalization

properties. However, it is also highly challenging due to

the inherent recursiveness of computation and random-

ness in their architecture and/or computing elements.

Existing theoretical studies of the function space of deep-

layered machines are mostly based on the mean ﬁeld ap-

proach, which allows for a sensitivity analysis of the func-

tions realized by deep-layered machines due to input or

parameter perturbations [4, 18–20].

To gain a complete and detailed understanding of the

function space, we develop a path-integral formalism that

directly examines individual functions computed. This is

carried out by processing al l possible input conﬁgurations

simultaneously and the corresponding outputs. For sim-

plicity, we always consider Boolean functions with binary

input and output variables.

The main contribution of this Letter is in providing

a detailed understanding of the distribution of Boolean

functions computed at each layer. It points to the equiv-

alence between recurrent and layer-dependent architec-

tures, and consequently to the potential signiﬁcant reduc-

tion in the number of trained free variables. Additionally,

the complexity of Boolean functions implemented mea-

sured by their entropy, which depends on the number of

layers and computing elements used, exhibits a rapid sim-

pliﬁcation when rectiﬁed linear unit (ReLU) components

are employed, which arguably explains their generaliza-

tion successes.

Framework.––The layered machines considered consist

of L+ 1 layers, each with Nnodes. Node iat layer lis

connected to the set of nodes {i1, i2, ..., ik}of layer l−1;

its activity is determined by the gate αl

i, computing a

function of kinputs, according to the propagation rule

P(Sl

i|~

Sl−1) = δSl

i, αl

i(Sl−1

i1, Sl−1

i2, ..., Sl−1

ik),(1)

where δis the Dirac or Kronecker delta function, de-

pending on the domain of Sl

i. The probabilistic form

of Eq. (1) adopted here is convenient for the generating

functional analysis and inclusion of noise [19, 21]. We

primarily consider two structures here: (i) densely con-

nected models where k=Nand node iis connected to

all nodes from the previous layer––one such example is

the fully connected neural network with Sl

i=αl(Hl

i),

where Hl

i=PN

j=1 Wl

jSl−1

j/√N+bl

iis the preactivation

ﬁeld and αlis the activation function at layer l, (we will

mainly focus on the case bl

i= 0; the eﬀect of nonzero bias

is discussed in [22]); (ii) sparsely connected models where

k∈O(N0)––examples include the sparse neural networks

and layered Boolean circuits where αl

iis a Boolean gate

with kinputs, e.g., majority gate.

Consider a binary input vector ~s = (s1,...,sn)∈

{−1,1}n, which is fed to the initial layer l= 0. To ac-

commodate a broader set of functions, we also consider

an augmented input vector, e.g., (i) ~

SI= (~s, 1), which

is equivalent to adding a bias variable in the context of

neural networks; (ii) ~

SI= (~s, −~s, 1,−1), which has been

2

Figure 1. A deep-layered machine computing all possible 2n

inputs. The direction of computation is from bottom to top.

The binary string SL∈ {−1,1}2nrepresents the Boolean

function computed on the blue nodes of the output layer L.

The augmented vector ~

SI= (~s, 1) is used as an example of

input here. The constant 1is represented by the dashed circle.

used to construct all Boolean functions [16]. Each node

iat layer 0points to a randomly chosen element of ~

SI

such that

P0(~

S0|~s) =

N

Y

i=1

P0S0

i|SI

ni(~s)=

N

Y

i=1

δS0

i, SI

ni(~s),(2)

where ni= 1, ..., |~

SI|is an index chosen from the ﬂat

distribution P(ni) = 1/|~

SI|.

The computation of the layered machine

is governed by the propagator P(~

SL|~s) =

P~

SL−1···~

S0P(~

S0|~s)QL

l=1 P(~

Sl|~

Sl−1), where each

node at layer Lcomputes a Boolean function

{−1,1}n→ {−1,1}. When the gates αl

ior the

network topology are random, then the layered machine

can be viewed as a disordered dynamical system with

quenched disorder [19, 21]. To probe the functions

being computed, we consider the simultaneous layer

propagation of all possible inputs ~sγ∈ {−1,1}n, labeled

by γ= 1,...,2ngoverned by the product propagator

Q2n

γ=1 P(~

SL

γ|~sγ). The binary string SL

i∈ {−1,1}2n

represents the Boolean function computed at node

iat layer L, as illustrated in Fig. 1. Note that we

use the vector notation ~

Sl= (Sl

1, ..., Sl

i, ..., Sl

N)and

Sl

i= (Sl

i,1, ..., Sl

i,γ , ..., Sl

i,2n)to represent the states and

functions, respectively. Using the above formalism, the

distribution of Boolean functions fcomputed on the

ﬁnal layer is given by

PL

N(f) = 1

N

N

X

i=1 *2n

Y

γ=1

δfγ, SL

i,γ +,(3)

where components of fsatisfy fγ=f(~sγ), and

angular brackets represent the average generated by

Q2n

γ=1 P(~

SL

γ|~sγ). To compute PL

N(f)and averages of

other macroscopic observables, which are expected to

be self-averaging for N→ ∞ [23], we introduce the

disorder-averaged generating functional (GF) Γ[{ψl

i,γ }] =

P{~

Sl

γ}QγP(~

S0

γ|~

SI

γ)QlP(~

Sl

γ|~

Sl−1

γ)e−iPiψl

i,γ Sl

i,γ , where

the overline denotes an average over the quenched dis-

order. To keep the presentation concise, we outline the

GF formalism only for DNNs in the following and refer

the reader to [22] for the details of the derivation used in

Boolean circuits.

Layer-dependent and recurrent architectures.––We fo-

cus on two diﬀerent architectures: layer-dependent archi-

tectures, where the gates and/or connections are diﬀer-

ent from layer to layer, and recurrent, where the gates

and connections are shared across all layers. Both archi-

tectures represent feed-forward machines that implement

input-output mappings.

Speciﬁcally, we assume that the weights Wl

ij in fully

connected DNNs with layer-dependent architectures are

independent Gaussian random variables sampled from

N(0, σ2). In DNNs with recurrent architectures, the

weights are sampled once and are shared among layers,

i.e. Wl+1

ij =Wl

ij . We apply the sign activation function

in the ﬁnal layer, i.e. αL(hL

i) = sgn(hL

i), to ensure that

the output of the DNN is Boolean.

We ﬁrst outline the derivation for fully connected re-

current architectures. It is suﬃcient to characterize the

disorder-averaged GF by introducing cross-layer overlaps

ql,l′

γγ ′= (1/N)PihSl

i,γ Sl′

i,γ′ias order parameters and the

corresponding conjugate order parameter Ql,l′

γγ ′, which

leads to a saddle-point integral Γ = R{dqdQ}eNΨ[q,Q]

with the potential [22]

Ψ = iTr {q Q}+

|~

SI|

X

m=1

P(m) ln X

SZdHMm[H,S],(4)

where Mm[H,S]is an eﬀective single-site measure

Mm= e−iPl,γ ψl

γSl

γ−iPll′,γγ′Ql,l′

γγ′Sl

γSl′

γ′(5)

×NH|0,C

2n

Y

γ=1

P0(S0

γ|SI

m,γ )

L

Y

l=1

δSl

γ, αl(hl

γ).

Due to weight sharing, the preactivation ﬁelds H=

(h1,...,hL), where hl∈R2n, are governed by the Gaus-

sian distribution N(H|0,C)and correlated across lay-

ers with covariance [C]l,l′

γγ ′=σ2ql−1,l′−1

γγ ′. Setting ψl

γto

zero and diﬀerentiating Ψwith respect to {ql,l′

γγ ′, Ql,l′

γγ ′}

yields the saddle point of the potential Ψdominating Γ

for N→ ∞, at which the conjugate order parameters

Ql,l′

γγ ′vanish [22], leading to

ql,l′

γγ ′=(PmP(m)Sl

γS0

γ′Mm, l′= 0

RdHαl(hl

γ)αl′(hl′

γ′)NH|0,C. l′>0(6)

3

Notice that in the above Gaussian average, all preactiva-

tion ﬁelds but the pair {hl

γ, hl′

γ′}can be integrated out,

reducing it to a tractable two-dimensional integral.

The GF analysis can be performed similarly for layer-

dependent architectures. Here the result has the same

form as Eq. (6) with ql,l′

γγ ′=δl,l′ql,l′

γγ ′, i.e. the over-

laps between diﬀerent layers are absent [22], implying

[C]l,l′

γγ ′=σ2δl−1,l′−1ql−1,l′−1

γγ ′for the covariances of pre-

activation ﬁelds. In this case, we denote the equal-layer

covariance matrix as cl:= Cl,l.

We remark that the behavior of DNNs with layer-

dependent architectures in the limit of N→ ∞ can also

be studied by mapping to Gaussian processes [4, 18, 24].

However, it is not clear if such analysis is possible in the

highly correlated recurrent case while the GF or path-

integral framework is still applicable [25–27].

Marginalizing the eﬀective single-site measure in

Eq. (5) gives rise to the distribution of Boolean func-

tions f∈ {−1,1}2ncomputed at layer Lof DNNs with

recurrent architectures

PL(f) = ZdhN(h|0,cL)

2n

Y

γ=1

δfγ, αL(hγ),(7)

where in the above the element of the covariance ma-

trix is cLγγ ′=CL,L

γγ ′=σ2qL−1,L−1

γγ ′. Note that

the physical meaning of PL(f)is the distribution of

Boolean functions deﬁned in Eq. (3) averaged over disor-

der PL(f) = limN→∞ PL

N(f).

Moreover, Eq. (7) also applies to layer-dependent ar-

chitectures since the equal-layer covariance matrix cLis

the same in two scenarios. Therefore we arrive at the

ﬁrst important conclusion that the typical sets of Boolean

functions computed at the output layer Lby the layer-

dependent and recurrent architectures are identical. Fur-

thermore, if the gate functions αlare odd, then it can be

shown that all the cross-layer overlaps ql,l′

γγ ′of the recur-

rent architectures vanish, implying the statistical equiv-

alence of the hidden layer activities to the layered archi-

tectures as well [22].

A similar GF analysis can be applied to sparsely

connected Boolean circuits constructed from a single

Boolean gate α, keeping in mind that distributions of

gates can be easily accommodated. In such models, the

source of disorder are random connections. In layer-

dependent architectures, a gate is connected randomly

to exactly k∈O(N0)gates from the previous layer and

this connectivity pattern is changing from layer to layer.

In recurrent architectures, on the other hand, the ran-

dom connections are sampled once and the connectivity

pattern is shared among layers. Note that in Boolean

circuits, the activities at every layer Sl

ialways represent

a Boolean function. For layer-dependent architectures,

Figure 2. Test accuracy of trained fully connected DNNs

applied on the MNIST dataset. Images have been downsam-

pled by a factor of 2to reduce training time, and each hidden

layer has 128 nodes. Each data point is averaged over 5ran-

dom initializations. The accuracies of recurrent architectures,

with weight Fsharing between hidden layers, are comparable

to those of layer-dependent architectures.

investigating the distribution of activities gives rise to

Pl+1(f) = X

f1,...,fkk

Y

j=1

Pl(fj)(8)

×

2n

Y

γ=1

δfγ, α(f1,γ ,...,fk,γ),

which describes how the probability of the Boolean func-

tion f∈ {−1,1}2nis evolving from layer to layer [28] [22].

We note that for recurrent architecture the equation for

the probability of Boolean functions computed is ex-

actly the same as above [22], suggesting that in ran-

dom Boolean circuits the typical sets of Boolean func-

tions computed on layers in the layer-dependent and re-

current architectures are identical. Note that the cou-

pling asymmetry plays a crucial role in this equivalence

property [22, 29, 30].

The equivalence between two architectures points to

a potential reduction in the number of free parameters

in layered machines by weight sharing or connectivity

sharing among layers, useful in devices with limited com-

putation resources [31]. For illustration, we consider

the image recognition task of Modiﬁed National Insti-

tute of Standards and Technology (MNIST) handwritten

digit data [32] using DNNs with both layer-dependent

and recurrent architectures (weight shared from hidden

to hidden layers only; for details see [22]). The experi-

ment shown in Fig. 2 demonstrates the feasibility of us-

ing recurrent architectures to perform image classiﬁca-

tion tasks with a slightly lower accuracy but signiﬁcant

saving in the number of trained parameters.

Boolean functions computed at large depth.––We con-

sider the typical Boolean functions computed in random-

layered machines by examining PL(f)in the large depth

limit L→ ∞ for speciﬁc gates in the following examples.

In DNNs using the ReLU activation function αl(x) =

max(x, 0), in the hidden layers (the sign activation func-

4

tion is always used in the output layer), which is com-

monly used in applications, all covariance matrix ele-

ments cLγγ′in the Eq. (7) converge to the same value

in the limit L→ ∞, implying that all components of the

preactivation ﬁeld vector hare also the same and hence

the components of fare identical. Therefore, random

deep ReLU networks compute only constant Boolean

functions in the inﬁnite depth limit, echoing recent ﬁnd-

ings of a bias toward simple functions in random DNNs

constructed from ReLUs [22], which arguably plays a role

in their generalization ability [33, 34].

In DNNs using sign activation function also in hid-

den layers, i.e., Eq. (1) enforces the rule Sl

i=

sgn(PjWl

ij Sl−1

j/√N), those cross-pattern overlaps

ql

γγ ′= (1/N)PihSl

i,γ Sl

i,γ′isatisfying |ql

γγ ′|<1mono-

tonically decrease with an increasing number of layers

and vanish as l→ ∞, such “chaotic” nature of dynamics

also holds in random DNNs with other sigmoidal activa-

tion functions such as the error and hyperbolic tangent

functions [4, 24]. The consequences of this behavior is

that for the input vector ~

SI=~s,PL(f)is uniform on

the set of all odd functions [22], i.e., functions satisfying

f(−~s) = −f(~s). Furthermore, for ~

SI= (~s, 1),PL(f)is

uniform on the set of all Boolean functions [22].

For Boolean circuits, there are also scenarios where

the distribution PL(f)has a single Boolean function

in its support or it is uniform over some set of func-

tions [16, 17, 35]. The latter depends on the gates α

used in Eq. (1) and input vector ~

SI. For example, in the

AND gate with α(S1, S2) = sgn(S1+S2+ 1) or the OR

gate with α(S1, S2) = sgn(S1+S2−1) [22], their out-

put is biased, respectively, toward +1 or −1[16, 22, 35].

The consequence of the latter is that the distribution

PL(f)has only a single Boolean function in its sup-

port [22, 35]. On the other hand, when the majority

gate α(S1, ..., Sk) = sgn(Pk

j=1 Sj), which is balanced

PS1,...,Skα(S1, ..., Sk) = 0 and nonlinear [36], is used

with the input vector ~

SI= (~s, −~s, 1,−1), then the distri-

bution PL(f)is uniform over all Boolean functions [35],

which is consistent with the result of [16].

Entropy of Boolean functions.––Having considered the

distribution of Boolean functions for a few diﬀerent ex-

amples, we observed that random-layered machines ei-

ther reduce to a single Boolean function or compute

all (or a subset of) functions with a uniform proba-

bility on the layer L, as L→ ∞. We note that

for the Shannon entropy over Boolean functions HL=

−PfPL(f) log PL(f), these two scenarios saturate its

lower and upper bounds, respectively, given by 0and

2nlog 2. Thus, the entropy HLcan be seen, at least in-

tuitively, as a measure of function space complexity.

In Fig. 3, we study the entropy HL, computed us-

ing Eqs. (7) and (8), as a function of the depth Lin

random-layered machines constructed from diﬀerent acti-

vation functions or gates and computing diﬀerent inputs.

The initial increase in entropy after layer L= 0, seen in

Fig. 3(a), (b), can be explained by the properties of gates

used and the initial set of (simple) Boolean functions at

layer L= 0; functions from the layer L= 0 are “copied”

onto layer L= 1, while new functions are also created,

as illustrated in Fig. 3(c), (d). Note that the minimal

depth in ReLU networks to produce a Boolean function

is L= 2. The dependence of entropy HLon Lafter

the initial increase depends on the speciﬁc gate functions

used. For the ReLU activation function in DNNs and the

AND gate in Boolean circuits, the entropies HLmono-

tonically decrease with L, suggesting that sizes of sets of

typical Boolean functions computed are decreasing with

increasing numbers of layers L. Random initialization

of layered machines with such gates/activation functions

serves as a biasing prior toward a more restricted set of

functions [33, 34]. On the other hand, for balanced gates,

with appropriate initial conditions, e.g., sign in DNNs

and majority vote in Boolean circuits, the entropy HL

is monotonically increasing with the depth L, indicat-

ing that the sizes of sets of the typical Boolean functions

computed are increasing.

In summary, we present an analytical framework to

examine Boolean functions represented by random deep-

layered machines, by considering all possible inputs si-

multaneously and applying the generating functional

analysis to compute various relevant macroscopic quan-

tities. We derived the probability of Boolean functions

computed on the output nodes. Surprisingly, we discover

that the typical sets of Boolean functions computed by

the layer-dependent and recurrent architectures are iden-

tical. It points to the possibility of computing complex

functions with a reduced number of parameters by weight

or connection sharing, as showcased in an image classi-

ﬁcation experiment. We also study the Boolean func-

tions computed by speciﬁc random-layered machines. Bi-

ased activation functions (e.g., ReLU) or biased Boolean

gates (e.g., AND/OR) can lead to more restricted typ-

ical sets of Boolean functions found at deeper layers,

which may explain their generalization ability. On the

other hand, balanced activation functions (e.g., sign) or

Boolean gates (e.g., majority) complemented with appro-

priate initial conditions, lead to a uniform distribution on

all Boolean functions at the inﬁnite depth limit. It will

be interesting to investigate the functions realized by dif-

ferent DNN architectures with structured data and by

diﬀerent learning algorithms [7, 37–40].

We also showed the monotonic behavior of the entropy

of Boolean functions as a function of depth, which is of

interest in the ﬁeld of computer science. We envisage

that the insights gained and the methods developed will

facilitate further study of deep-layered machines.

B.L. and D.S. acknowledge support from the Lever-

hulme Trust (RPG-2018-092), European Union’s Horizon

2020 research and innovation program under the Marie

Skłodowska-Curie grant agreement No. 835913. D.S.

5

(a) (b)

(a)

(c)

Figure 3. Normalized entropy and distribution of functions

of deep-layered machines. (a) Normalized entropy HL/2nof

Boolean functions computed by DNNs with sign or ReLU

activation in the hidden layers as a function of the network

depth L; the initial condition is set as ~

SI= (~s, 1). (b) HL/2n

vs Lfor Boolean circuits constructed by MAJ3 or the AND

gate with initial condition ~

SI= (~s, −~s, 1,−1). (c) The dis-

tribution of Boolean functions PL(f)computed by Boolean

circuits with two inputs n= 2 (the number of all possible

functions is 16) is represented by the sizes of circles on a 4×4

grid. Upper panel: MAJ3-gate-based circuits, in which more

functions are created at larger depth Land PL(f)converges

to a uniform distribution. Lower panel: AND-gate-based cir-

cuits, in which new functions are created from L= 0 to L= 1,

while PL(f)converges to a distribution with supports in a

single Boolean function as network depth increases.

acknowledges support from the EPSRC program grant

TRANSNET (EP/R035342/1).

∗alexander.mozeika@kcl.ac.uk

†b.li10@aston.ac.uk

‡d.saad@aston.ac.uk

[1] Yann LeCun, Yoshua Bengio, and Geoﬀrey Hinton. Deep

learning. Nature, 521(7553):436–444, 2015.

[2] Ryan O’Donnell. Analysis of Boolean Functions. Cam-

bridge University Press, New York, 2014.

[3] Kurt Hornik, Maxwell Stinchcombe, and Halbert White.

Multilayer feedforward networks are universal approxi-

mators. Neural Networks, 2(5):359 – 366, 1989.

[4] Ben Poole, Subhaneil Lahiri, Maithreyi Raghu, Jascha

Sohl-Dickstein, and Surya Ganguli. Exponential expres-

sivity in deep neural networks through transient chaos.

In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon,

and R. Garnett, editors, Advances in Neural Information

Processing Systems 29, pages 3360–3368. Curran Asso-

ciates, Inc., New York, 2016.

[5] N. Nisan and S. Schocken. The Elements of Computing

Systems: Building a Modern Computer from First Prin-

ciples. MIT Press, 2008.

[6] Andreas Engel and Christian Van den Broeck. Statisti-

cal Mechanics of Learning. Cambridge University Press,

New York, 2001.

[7] David Saad, editor. On-Line Learning in Neural Net-

works. Cambridge University Press, New York, 1998.

[8] Elena Agliari, Adriano Barra, Andrea Galluzzi,

Francesco Guerra, and Francesco Moauro. Multitasking

associative networks. Phys. Rev. Lett., 109:268101, 2012.

[9] Haiping Huang and Taro Toyoizumi. Advanced mean-

ﬁeld theory of the restricted boltzmann machine. Phys.

Rev. E, 91:050101, 2015.

[10] Marylou Gabrié, Eric W Tramel, and Florent Krza-

kala. Training restricted boltzmann machine via the

thouless-anderson-palmer free energy. In C. Cortes, N. D.

Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, edi-

tors, Advances in Neural Information Processing Systems

28, pages 640–648. Curran Associates, Inc., 2015.

[11] Marc Mézard. Mean-ﬁeld message-passing equations in

the hopﬁeld model and its generalizations. Phys. Rev. E,

95:022117, 2017.

[12] Adriano Barra, Giuseppe Genovese, Peter Sollich, and

Daniele Tantari. Phase diagram of restricted boltzmann

machines and generalized hopﬁeld networks with arbi-

trary priors. Phys. Rev. E, 97:022310, 2018.

[13] J J Hopﬁeld. Neural networks and physical systems with

emergent collective computational abilities. Proceedings

of the National Academy of Sciences, 79(8):2554–2558,

1982.

[14] J.A. Hertz, A.S. Krogh, and R.G. Palmer. Introduction

To The Theory Of Neural Computation. Addison-Wesley,

1991.

[15] J. J. Hopﬁeld. Learning algorithms and probability distri-

butions in feed-forward and feed-back networks. Proceed-

ings of the National Academy of Sciences, 84(23):8429–

8433, 1987.

[16] Petr Savický. Random boolean formulas representing any

boolean function with asymptotically equal probability.

Discrete Mathematics, 83(1):95 – 103, 1990.

[17] Alex Brodsky and Nicholas Pippenger. The boolean func-

tions computed by random boolean formulas or how to

grow the right function. Random Structures & Algo-

rithms, 27(4):490–519, 2005.

[18] Jaehoon Lee, Jascha Sohl-dickstein, Jeﬀrey Pennington,

Roman Novak, Sam Schoenholz, and Yasaman Bahri.

Deep neural networks as gaussian processes. In Pro-

ceedings of the 6th International Conference on Learning

Representations, 2018.

[19] Bo Li and David Saad. Exploring the function space of

deep-learning machines. Phys. Rev. Lett., 120:248301,

2018.

[20] Bo Li and David Saad. Large deviation analysis of

function sensitivity in random deep neural networks.

Journal of Physics A: Mathematical and Theoretical,

53(10):104002, 2020.

[21] Alexander Mozeika, David Saad, and Jack Raymond.

Computing with noise: Phase transitions in boolean for-

mulas. Phys. Rev. Lett., 103:248701, 2009.

[22] See Supplemental Material for details, which includes

Refs. [41–43].

[23] Marc Mézard, Giorgio Parisi, and Miguel Virasoro. Spin

glass theory and beyond: An Introduction to the Replica

6

Method and Its Applications, volume 9. World Scientiﬁc

Publishing Co Inc, 1987.

[24] Greg Yang and Hadi Salman. A ﬁne-grained spectral

perspective on neural networks. arXiv:1907.10599, 2019.

[25] A.C.C. Coolen. Chapter 15 statistical mechanics of re-

current neural networks ii - dynamics. In F. Moss and

S. Gielen, editors, Neuro-Informatics and Neural Mod-

elling, volume 4 of Handbook of Biological Physics, pages

619 – 684. North-Holland, 2001.

[26] Taro Toyoizumi and Haiping Huang. Structure of at-

tractors in randomly connected networks. Phys. Rev. E,

91:032802, 2015.

[27] A. Crisanti and H. Sompolinsky. Path integral approach

to random neural networks. Phys. Rev. E, 98:062120,

2018.

[28] Viewing the layers as time steps, the functions can be

seen as molecules of gas undergoing k-body collisions.

[29] B. Cessac. Increase in complexity in random neural net-

works. J. Phys. I France, 5(3):409–432, 1995.

[30] J P L Hatchett, B Wemmenhove, I Pérez Castillo,

T Nikoletopoulos, N S Skantzos, and A C C Coolen. Par-

allel dynamics of disordered ising spin systems on ﬁnitely

connected random graphs. Journal of Physics A: Math-

ematical and General, 37(24):6201–6220, 2004.

[31] Y. Cheng, D. Wang, P. Zhou, and T. Zhang. Model

compression and acceleration for deep neural networks:

The principles, progress, and challenges. IEEE Signal

Processing Magazine, 35(1):126–136, 2018.

[32] Yann LeCun, Corinna Cortes, and Christopher J. Burges.

The MNIST Database of Handwritten Digits (1998),

http://yann.lecun.com/exdb/mnist/.

[33] Guillermo Valle-Perez, Chico Q. Camargo, and Ard A.

Louis. Deep learning generalizes because the parameter-

function map is biased towards simple functions. In Pro-

ceedings of the 7th International Conference on Learning

Representations. 2019.

[34] Giacomo De Palma, Bobak Kiani, and Seth Lloyd. Ran-

dom deep neural networks are biased towards simple

functions. In H. Wallach, H. Larochelle, A. Beygelz-

imer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,

Advances in Neural Information Processing Systems 32,

pages 1962–1974. Curran Associates, Inc., 2019.

[35] Alexander Mozeika, David Saad, and Jack Raymond.

Noisy random boolean formulae: A statistical physics

perspective. Phys. Rev. E, 82:041112, 2010.

[36] Since {Sj}are binary variables in the context of Boolean

circuits, linearity is deﬁned in the ﬁnite ﬁeld GF (2) [16,

17].

[37] Sebastian Goldt, Marc Mézard, Florent Krzakala, and

Lenka Zdeborová. Modelling the inﬂuence of data struc-

ture on learning in neural networks: the hidden manifold

model. arXiv:1909.11500, 2019.

[38] Pietro Rotondo, Mauro Pastore, and Marco Gherardi.

Beyond the storage capacity: Data-driven satisﬁability

transition. Phys. Rev. Lett., 125:120601, Sep 2020.

[39] Mauro Pastore, Pietro Rotondo, Vittorio Erba, and

Marco Gherardi. Statistical learning theory of structured

data. Phys. Rev. E, 102:032119, Sep 2020.

[40] Lenka Zdeborová. Understanding deep learning is also a

job for physicists. Nature Physics, 16(6):602–604, 2020.

[41] B Derrida, E Gardner, and A Zippelius. An exactly solv-

able asymmetric neural network model. Europhysics Let-

ters (EPL), 4(2):167–173, 1987.

[42] R. Kree and A. Zippelius. Continuous-time dynamics of

asymmetrically diluted neural networks. Phys. Rev. A,

36:4421–4427, 1987.

[43] Diederik P. Kingma and Jimmy Ba. Adam: A method for

stochastic optimization. In Proceedings of the 3rd Inter-

national Conference on Learning Representations. 2015.

arXiv:2004.08930v3 [cs.LG] 14 Oct 2020

Space of Functions Computed by Deep-Layered Machines

Supplemental Material

Alexander Mozeika,1Bo Li,2and David Saad2

1London Institute for Mathematical Sciences, London, W1K 2XF, United Kingdom

2Nonlinearity and Complexity Research Group, Aston University, Birmingham, B4 7ET, United Kingdom

I. CONVENTION OF NOTATION

We denote variables with overarrows as vectors with site indices (e.g., i, j ), which can be of size k,nor N. On the

other hand, we denote bold-symbol variables as vectors of size 2nwith pattern indices (e.g., γ, γ′), or matrices of size

2n×2n. For convenience, we deﬁne M:= 2n.

The function δ(·,·)stands for Kronecker delta as δ(i, j) = δi,j if arguments i, j are integer variables, while it stands

for Dirac delta function as δ(x, y) = δ(x−y)if the arguments x, y are continuous variables; in the latter case, the

summation operation should be interpreted as integration, such that Pyδ(x, y)f(y) := Rdy δ(x−y)f(y).

The binary variables S∈ {+1,−1}in this work are mapped onto the conventional Boolean variables z∈ {0,1}

(z= 0 represents False, z= 1 represents True) through S= 1 −2z(S= +1 represents False, S=−1represents

True). This choice of notation has the advantage that Boolean addition (0 + 0 = 0,0 + 1 = 1,1 + 1 = 0) can be

represented as integer multiplication (1×1 = 1,1×(−1) = −1,(−1) ×(−1) = 1). Under this convention, the

AND gate is deﬁned as sgn(Si+Sj+ 1), the OR gate is deﬁned as sgn(Si+Sj−1), while the majority vote gate is

sgn(Pk

j=1 Sj).

II. GENERATING FUNCTIONAL ANALYSIS OF FULLY CONNECTED NEURAL NETWORKS

To probe the functions being computed by neural networks, we need to consider the layer propagation of all 2n

input patterns as Q2n

γ=1 P(~

SL

γ|~

SI

γ(~sγ)). We introduce the disorder-averaged generating functional in order to compute

the macroscopic quantities

Γ[{ψl

i,γ}] = X

{Sl

i,γ }∀l,i,γ

2n

Y

γ=1

P(~

S0

γ|~

SI

γ)

L

Y

l=1

P(~

Sl

γ|~

Sl−1

γ)e−iPl,i,γ ψl

i,γ Sl

i,γ

=EWX

{Sl

i,γ }∀l,i,γ ZL

Y

l=1 Y

i,γ

dhl

i,γdxl

i,γ

2π

2n

Y

γ=1

P(~

S0

γ|~

SI

γ)

L

Y

l=1

P(~

Sl

γ|~

hl

γ)e−iPl,i,γ ψl

i,γ Sl

i,γ

×exp X

l,i,γ

ixl

i,γ hl

i,γ −X

l,γ X

ij

i

√NWl

ij xl

i,γ Sl−1

j,γ ,(S1)

where we have introduced the notation P(~

Sl

γ|~

hl

γ) = QN

i=1 P(Sl

i,γ |hl

i,γ) = QN

i=1 δSl

i,γ , αl(hl

i,γ )and inserted the

Fourier representation of unity 1 = Rdhl

i,γ dxl

i,γ

2πexp ixl

i,γ hl

i,γ −PjWl

ij Sl−1

j,γ ,∀l, i, γ . Noisy computation can be

easily accommodated in such probabilistic formalism.

A. Layer-dependent Architectures

We ﬁrst consider layer-dependent weights, where each element follows the Gaussian distribution Wl

ij ∼ N(0, σ2

w).

Assuming self-averaging, averaging over the weight disorder component in the last line of the Eq. (S1) yields

EWexp −

L

X

l=1 X

γX

ij

i

√NWl

ij xl

i,γ Sl−1

j,γ

= exp −σ2

w

2

L

X

l=1 X

γ,γ ′X

i

xl

i,γ xl

i,γ′1

NX

j

Sl−1

j,γ Sl−1

j,γ′.(S2)

2

By introducing the overlap order parameters {ql

γγ ′}L

l=0 through the Fourier representation of unity

1 = ZdQl

γγ ′dql

γγ ′

2π/N exp iNQl

γγ ′ql

γγ ′−1

NX

j

Sl

j,γ Sl

j,γ′,(S3)

the generating functional can be factorized over sites as follows

Γ[{ψl

i,γ }] = ZL

Y

l=0 Y

γγ ′

dQl

γγ ′dql

γγ ′

2π/N exp iNX

l,γγ ′

Ql

γγ ′ql

γγ ′

×exp N

X

i=1

log ZL

Y

l=1 Y

γ

dhl

i,γ X

{Sl

i,γ }∀l,γ

Mni(hi,Si)

=ZL

Y

l=0 Y

γγ ′

dQl

γγ ′dql

γγ ′

2π/N exp iNX

l,γγ ′

Ql

γγ ′ql

γγ ′

×exp N|~

SI|

X

m=1

1

N

N

X

i=1

δ(m, ni)log ZL

Y

l=1 Y

γ

dhl

i,γ X

{Sl

i,γ }∀l,γ

Mm(hi,Si),

where hi,Siare shorthand notations of {hl

i},{Sl

i}with hl

i:= (hl

i,1, ..., hl

i,γ, ..., hl

i,2n)and Sl

i:= (Sl

i,1, ..., Sl

i,γ , ..., Sl

i,2n).

The single-site measure Mmin the above expression is deﬁned as

Mm(hi,Si) =

2n

Y

γ=1

e−iPl,γ ψl

i,γ Sl

i,γ P(S0

i,γ |SI

m,γ )

L

Y

l=1

P(Sl

i,γ |hl

i,γ ) exp −X

l,γγ ′

iQl

γγ ′Sl

i,γ Sl

i,γ′

×

L

Y

l=1

1

p(2π)2n|cl|exp −1

2X

γγ ′

hl

i,γ(cl)−1

γγ ′hl

i,γ′.(S4)

In Eq. (S4), clis a 2n×2ncovariance matrix with elements cl

γγ ′=σ2

wql−1

γγ ′and mis a random index following the

empirical distribution 1

NPN

i=1 δ(m, ni).

Setting ψl

i,γ = 0 and considering limN→∞ 1

NPN

i=1 δ(m, ni)→P(m) = 1/|~

SI|, we arrive at

Γ = Z{dQdq}eNΨ(Q,q),(S5)

Ψ(Q,q) =

L

X

l=0 X

γγ ′

iQl

γγ ′ql

γγ ′+

|~

SI|

X

n=1

P(n) log ZL

Y

l=1 Y

γ

dhl

γX

{Sl

γ}∀l,γ

Mm(h,S).(S6)

The saddle point equations are obtained by computing ∂Ψ/∂ql

γγ ′= 0 and ∂Ψ/∂Ql

γγ ′= 0

iQl−1

γγ ′=−X

n

P(n)RdhPS

∂

∂ql−1

γγ′Mm(h,S)

RdhPSMm(h,S),1≤l≤L, (S7)

iQL

γγ ′= 0,(S8)

ql

γγ ′=

|~

SI|

X

m=1

P(m)Sl

γSl

γ′Mm,0≤l≤L. (S9)

Back-propagating the boundary condition iQL

γγ ′= 0 results in iQl

γγ ′= 0,∀l[1].

3

The measure Mmbecomes

Mm(h,S) =

2n

Y

γ=1

P(S0

γ|SI

m,γ )

L

Y

l=1

P(Sl

γ|hl

γ)

×

L

Y

l=1

1

p(2π)2n|cl|exp −1

2X

γγ ′

hl

γ(cl)−1

γγ ′hl

γ′,(S10)

while the saddle point equations of overlaps have the form of

q0

γγ ′=X

m

P(m)SI

m,γ SI

m,γ′,(S11)

ql

γγ ′=Zdhl

γdhl

γ′

φl(hl

γ)φl(hl

γ′)

q(2π)2|Σl

γγ ′|

exp −1

2[hl

γ, hl

γ′]·(Σl

γγ ′)−1·[hl

γ, hl

γ′]⊤,(S12)

where the 2×2covariance matrix Σl

γγ ′is deﬁned as

Σl

γγ ′:= σ2

w ql−1

γγ ql−1

γγ ′

ql−1

γ′γql−1

γ′γ′!.(S13)

B. Recurrent Architectures

In this section, we consider recurrent topology where the weights are independent of layers Wl

ij =Wij ∼ N(0, σ2

w).

The calculation resembles the case of layer-dependent weights, except that the disorder average yields cross-layer

overlaps

EWexp −

L

X

l=1 X

γX

ij

i

√NWij xl

i,γ Sl−1

j,γ

= exp −σ2

w

2

L

X

l,l′=1 X

γ,γ ′X

i

xl

i,γ xl′

i,γ′1

NX

j

Sl−1

j,γ Sl′−1

j,γ′.(S14)

Introducing order parameters ql,l′

γγ ′:= 1

NPjSl

j,γ Sl′

j,γ′and setting ψl

i,γ = 0, we eventually obtain

Γ = Z{dQdq}eNΨ(Q,q),(S15)

Ψ(Q,q) =iTr {q Q}+

|~

SI|

X

m=1

P(m) log ZL

Y

l=1 Y

γ

dhl

γX

{Sl

γ}∀l,γ

Mm(h,S),(S16)

Mm(h,S) =

2n

Y

γ=1

P(S0

γ|SI

m,γ )

L

Y

l=1

P(Sl

γ|hl

γ) exp −X

ll′,γγ ′

iQl,l′

γγ ′Sl

γSl′

γ′

×1

p(2π)2nL|C|exp −1

2H⊤C−1H,(S17)

where iTr {q Q}= i PL

l,l′=0 Pγγ ′Ql,l′

γγ ′ql,l′

γγ ′and H= (h1, ..., hL)∈R2nLexpresses the preactivation ﬁelds of all

patterns and all layers, while Cis a 2nL×2nLcovariance matrix.

4

The corresponding saddle point equations are

iQl−1,l′−1

γγ ′=−X

n

P(n)RdhPS

∂

∂ql−1,l′−1

γγ′Mm(h,S)

RdhPSMm(h,S),1≤l, l′≤L, (S18)

iQL,l

γγ ′= 0,∀l(S19)

ql,l′

γγ ′=X

m

P(m)Sl

γSl′

γ′Mm,0≤l≤L. (S20)

All conjugate order parameters {iQl,l′

γγ ′}vanish identically similar to the previous case, such that the eﬀective

single-site measure becomes

Mm(h,S) =

2n

Y

γ=1

P(S0

γ|SI

m,γ )

L

Y

l=1

P(Sl

γ|hl

γ)

×1

p(2π)2nL|C|exp −1

2H⊤C−1H,(S21)

and the saddle point equation of the order parameters follows

q0,0

γγ ′=X

m

P(m)SI

m,γ SI

m,γ′,(S22)

ql,0

γγ ′=X

m

P(m)Sl

γS0

γ′Mm

=X

m

P(m)SI

m,γ′Zdhl

γ

φl(hl

γ)

p2πσ2

w

exp −1

2σ2

whl

γ2,(S23)

ql,l′

γγ ′=Zdhl

γdhl′

γ′

φl(hl

γ)φl′(hl′

γ′)

q(2π)2|Σl,l′

γγ ′|

exp −1

2[hl

γ, hl′

γ′]·(Σl,l′

γγ ′)−1·[hl

γ, hl′

γ′]⊤,(S24)

where the 2×2covariance matrix Σl,l′

γγ ′is deﬁned as

Σl,l′

γγ ′:= σ2

w ql−1,l−1

γγ ql−1,l′−1

γγ ′

ql′−1,l−1

γ′γql′−1,l′−1

γ′γ′!.(S25)

Similar formalism was derived in the context of dynamical recurrent neural networks to study the autocorrelation of

spin/neural dynamics [2].

C. Strong Equivalence Between Layer-dependent and Recurrent Architectures for Odd Activation Functions

In general, the statistical properties of the activities of machines of layer-dependent architectures and recurrent

architectures are diﬀerent, since the ﬁelds {hl}of diﬀerent layers are directly correlated in the latter case. However,

one can observe that the equal-layer overlaps ql,l

γγ ′in the recurrent architectures is identical to ql

γγ ′in the layer-

dependent architectures, by noticing the same initial condition in Eq. (S22) and Eq. (S11) and the same forward

propagation rules in Eq. (S24) (with l′=l) and Eq. (S12).

If the cross-layer overlaps {ql,l′

γγ ′|l6=l′}vanish, then the direct correlation between hlof diﬀerent layers also vanish

such that

1

p(2π)2nL|C|exp −1

2H⊤C−1H=Y

l

1

p(2π)2n|cl|exp −1

2(hl)⊤(cl)−1hl.(S26)

In this case, the distributions of the macroscopic trajectories {hl,Sl}of the two architectures are equivalent. One

suﬃcient condition for this to hold is that the activation functions φl(·)are odd functions satisfying φl(−x) = −φl(x).

Firstly, this condition implies that ql,0

γγ ′= 0,∀lby Eq. (S23); secondly, ql,0

γγ ′= 0 and the fact that φl(·)is odd implies

ql+1,1

γγ ′= 0, which leads to ql,l′

γγ ′= 0,∀l6=l′by induction.

5

D. Weak Equivalence Between Layer-dependent and Recurrent Architectures for General Activation

Functions

As shown above, in general the trajectories {hl,Sl}of layer-dependent architectures follow a diﬀerent distribution

from the case of recurrent architectures with shared weights except for some speciﬁc cases such as DNNs with odd

activation functions. Here we focus on the distribution of activities in the output layer.

For layer-dependent weights, the joint distribution of the local ﬁelds and activations at layer Lis obtained by

marginalizing the variables of initial and hidden layers

P(hL,SL) = ZY

γ

dxL

γ

2πZL−1

Y

l=1 Y

γ

dhl

γX

m

P(m)X

{Sl

γ}∀γ,l<L

Mm(h,S)

=ZL−1

Y

l=1

dhlX

{Sl

γ}∀γ,l<L L

Y

l=1 Y

γ

P(Sl

γ|hl

γ)L

Y

l=1

1

p(2π)2n|cl|exp −1

2(hl)⊤(cl)−1hl

=N(hL|0,cL(qL−1))

2n

Y

γ=1

P(SL

γ|hL

γ),(S27)

where N(hL|0,cL(qL−1)) is a 2ndimensional multivariate Gaussian distribution. The distribution of Boolean func-

tions f(·)computed at layer Lis

PL(f) = ZdhLX

SL

P(hL,SL)

2n

Y

γ=1

δ(SL

γ, f (~sγ))

=ZdhN(h|0,cL)

2n

Y

γ=1

δ(fγ, αL(hγ)),(S28)

where the binary string fof size 2nrepresents the Boolean function f(·)with fγ=f(~sγ).

For shared weights, the ﬁelds of all layers H= (h1, ..., hl, ..., hL)∈R2nLare coupled with covariance C

P(hL,SL) = ZL−1

Y

l=1

dhlX

{Sl

γ}∀γ,l<L L

Y

l=1 Y

γ

P(Sl

γ|hl

γ)1

p(2π)2nL|C|exp −1

2(H)⊤C−1H

=Y

γ

P(SL

γ|hL

γ)ZL−1

Y

l=1

dhl1

p(2π)2nL|C|exp −1

2(H)⊤C−1H

=Y

γ

P(SL

γ|hL

γ)1

p(2π)2n|CL,L|exp −1

2(hL)⊤(CL,L)−1hL

=N(hL|0,CL,L(qL−1,L−1))

2n

Y

γ=1

P(SL

γ|hL

γ).(S29)

Since the equal-layer overlap follows the same dynamical rule with the case of layer-dependent weights such that

CL,L =cL, the distributions P(hL,SL)of the two scenarios are equivalent. This suggests that if only the input-

output mapping is of interest (but not the hidden layer activity), the distributions of the Boolean functions PL(f)

computed at the ﬁnal layer of the two architectures are equivalent.

III. REMARK ON THE EQUIVALENCE PROPERTY

We observed that although auto-correlations generally exist in recurrent architectures, they do not complicate the

single-layer macroscopic behaviors of the system studied. This is due to the fact that the weights/couplings being used

are asymmetric (i.e., Wij and Wji are independent of each other), such that there are no intricate feedback interactions

of a node with its state at previous time steps, which renders a Markovian single-layer macroscopic dynamics [3]. If

symmetric couplings are present, then the whole history of the network is needed to characterize the dynamics [3].

6

Similar eﬀect of coupling asymmetry was also observed in sparsely connected networks [4]. Early works investigating

asymmetric coupling in neural networks include [5, 6].

Although we only consider ﬁnite input dimension, we expect that the equivalence property also holds in the cases

where nhas the same order as Nas long as the couplings are asymmetric.

IV. EXTENSION OF THE THEORY ON DENSELY-CONNECTED NEURAL NETWORKS

While the theory on densely-connected neural networks was developed in the inﬁnite width limit N→ ∞, we expect

that it applies to large but ﬁnite systems (some properties investigated below require N≫n). In [7], it is found that

the order parameters in such systems satisfy the large deviation principle, which implies an exponential convergence

rate to the typical behaviors as the width Ngrows. Typically, order parameters of systems with N∼103are well

concentrated around their typical values as predicted by the theory.

Accommodating the cases where nhas the same order as N(both tend to inﬁnity) in the current framework is a bit

subtle, as it requires an inﬁnite number of order parameters and may result in the loss of the self-averaging properties.

While Boolean input variables are of primary interest here, the input domain can be generalized to any countable

sets; this is relevant for adapting the theory to real input variables with deﬁned numerical precision, e.g., if a real input

variable s∈[0,1] is subject to precision of 0.01, then only a ﬁnite number of possible values of s∈ {0,0.01,0.02, ..., 1}

need to be considered. On the other hand, real input variables with arbitrary precision are diﬃcult to deal with,

as there is an uncountably inﬁnite number of input patterns and the product Qγ··· is ill-deﬁned. It is worthwhile

noting [8] that random neural networks on real spherical data shares similar properties with those on Boolean data

in high dimension; therefore we expect that the equivalence between layer-dependent and recurrent architectures also

applies to real data.

Other than being able to treat the highly correlated recurrent architectures, the GF or path-integral framework

also facilitates the characterization of ﬂuctuations around the typical behaviors [9], and the computation of large

deviations in ﬁnite-size systems [7].

V. TRAINING EXPERIMENTS

Figure 2 of the main text demonstrates the feasibility of using DNNs with recurrent architectures to perform an

image recognition task on the MNIST hand-written digit data. In this section, we describe the details of the training

experiment. The objective is not to achieve state-of-the-art performance, but to showcase the potential in using

recurrent architectures for parameter reduction. Therefore, we pre-process the image data by downsampling with a

factor of 2 through average pooling, which saves runtime of training by reducing the size of each image from 28 ×28

to 14 ×14. See Fig. S1(a) for an example.

We consider DNNs of both architectures, layer-dependent and recurrent, where the input ~s is directly copied onto

the initial layer ~

S0, and a softmax function is applies to the ﬁnal layer. We remark that the theory developed in

this work is applicable to random-weight DNNs implementing Boolean functions, while it is not directly applicable to

trained networks. For recurrent architectures, since the dimension of input and output layers are ﬁxed, only weights

Whid between hidden layers are shared, i.e.

~

S0Win

−−−→ ~

S1Whid

−−−→ ~

S2Whid

−−−→ ·· · ~

SlWhid

−−−→ ~

Sl+1 Whid

−−−→ ··· Whid

−−−→ ~

SL−1Wout

−−−→ ~

SL,(S30)

where all the hidden layers have the same width. The corresponding DNNs of both architectures are trained by

the ADAM algorithm with back-propagation [10]. In Fig. S1(b), we demonstrate that for diﬀerent widths of hidden

layers, DNNs with recurrent architectures can achieve performance that is comparable to those with layer-dependent

architectures.

VI. BOOLEAN FUNCTIONS COMPUTED BY RANDOM DNNS

To examine the distribution of Boolean functions computed at layer L(we always apply sign activation function in

the ﬁnal layer), notice that nodes at layer Lare not coupled together, so it is suﬃcient to consider a particular node

in the ﬁnal layer, which follows the distribution of the eﬀective single site measure established before.

Further notice that the local ﬁeld hL∈R2nin the ﬁnal layer follows a multivariate Gaussian distribution with zero

mean and covariance cL

γγ ′=σ2

wqL−1

γγ ′. Essentially the local ﬁeld hLis a Gaussian process with a dot product kernel

7

(a)

average pooling

(b)

Figure S1. (a) The MNIST data are pre-processed by average pooling to downsample the images, in order to reduce training

time. (b) Test accuracy of trained fully connected DNNs with 6 hidden layers applied to MNIST dataset. For diﬀerent widths

of the hidden layers, DNNs with recurrent architectures can achieve comparable performance to those with layer-dependent

architectures.

(in the limit N→ ∞) [8, 11]

k(~x, ~x′) = k~x ·~x′

n=σ2

wqL−1

x,x′,(S31)

where ~x,~x′are n-dimensional vectors.

The probability of a Boolean function f(s1,γ , ..., sn,γ )being computed in the fully connected neural network is

PL(f) = ZdhN(h|0,cL(q))

2n

Y

γ=1

δsgn(hL

γ), fγ.(S32)

We focus on systems of layer-dependent architectures, where the overlap ql

γγ ′is governed by the forward dynamics

ql

γγ ′=Zdhl

γdhl

γ′

φ(hl

γ)φ(hl

γ′)

p(2π)2|Σl|exp −1

2[hl

γ, hl

γ′]·Σl

γγ ′(ql−1)−1·[hl

γ, hl

γ′]⊤.(S33)

• For sign activation function, choosing σw= 1 yields

Σl

γγ ′= 1ql−1

γγ ′

ql−1

γγ ′1!,(S34)

ql

γγ ′=2

πsin−1ql−1

γγ ′,∀l > 0,(S35)

q0

γγ ′=(1

nPn

m=1 sm,γ sm,γ′,~

SI=~s,

1

n+1 Pn

m=1(1 + sm,γ sm,γ ′).~

SI= (~s, 1).(S36)

• For ReLU activation function, choosing σw=√2yields

Σl

γγ ′= 2 ql−1

γγ ql−1

γγ ′

ql−1

γγ ′ql−1

γ′γ′!= 2 ql−1

γγ ql−1

γγ ′

ql−1

γγ ′ql−1

γ′γ′!,(S37)

ql

γγ =1

2Σl

γγ = 1,∀l, γ (S38)

ql

γγ ′=1

2πqΣl

γγ ′+π

2Σl

γγ ′,12 + Σl

γγ ′,12 tan−1Σl

γγ ′,12qΣl

γγ ′

=1

πq1−ql−1

γγ ′2+ql−1

γγ ′π

2+ sin−1ql−1

γγ ′,∀l > 0,(S39)

q0

γγ ′=(1

nPn

m=1 sm,γ sm,γ′,~

SI=~s,

1

n+1 Pn

m=1(1 + sm,γ sm,γ ′).~

SI= (~s, 1).(S40)

8

(b)

(a)

Figure S2. Iteration mappings of overlaps of fully connected neural networks in the absense of bias variables. (a) sign activation

function. ql= 0 is a stable ﬁxed point while ql= 1,−1are two unstable ﬁxed points. (b) ReLU activation functions with

σw=√2.ql= 1 is a stable ﬁxed point.

The iteration mappings of overlaps of the two activation functions considered are depicted in Fig. S2.

A. ReLU Networks in the Large LLimit

In this section, we focus on the large depth limit L→ ∞. For ReLU activation function, all the matrix elements

of cLbecome identical in the large Llimit, leading to cL(q)∝J(where Jis the all-one matrix) and a degenerate

Gaussian distribution of the vector hLenforcing all its components to be the same. To make it explicit, we consider

the distribution of hLas follows

P(hL) = 1

p(2π)M|cL|exp −1

2(hL)⊤(cL)−1hL,

=ZdxL

(2π)Mexp ixL·hL−σ2

w

2(xL)⊤JxL(S41)

= lim

κ→1ZdxL

(2π)Mexp ixL·hL−σ2

w

2X

γ

(xL

γ)2+κX

γ6=γ′

xL

γxL

γ′,(S42)

Now deﬁne c(κ) = σ2

w(1 −κ)I+κJand notice

[c(κ)]−1=1

σ2

w(1 −κ)I−κ

Mρ + (1 −κ)J≈1

σ2

w(1 −κ)I−1

MJ+1−κ

M2κJ,(S43)

P(hL) = lim

κ→1

1

p(2π)M|c(κ)|exp −1

2(hL)⊤[c(κ)]−1hL<