ArticlePDF Available

Abstract and Figures

We study the space of functions computed by random-layered machines, including deep neural networks and Boolean circuits. Investigating the distribution of Boolean functions computed on the recurrent and layer-dependent architectures, we find that it is the same in both models. Depending on the initial conditions and computing elements used, we characterize the space of functions computed at the large depth limit and show that the macroscopic entropy of Boolean functions is either monotonically increasing or decreasing with the growing depth.
Content may be subject to copyright.
arXiv:2004.08930v3 [cs.LG] 14 Oct 2020
Space of Functions Computed by Deep-Layered Machines
Alexander Mozeika,1, Bo Li,2, and David Saad2 ,
1London Institute for Mathematical Sciences, London, W1K 2XF, United Kingdom
2Nonlinearity and Complexity Research Group, Aston University, Birmingham, B4 7ET, United Kingdom
We study the space of functions computed by random-layered machines, including deep neural
networks and Boolean circuits. Investigating the distribution of Boolean functions computed on the
recurrent and layer-dependent architectures, we find that it is the same in both models. Depending
on the initial conditions and computing elements used, we characterize the space of functions com-
puted at the large depth limit and show that the macroscopic entropy of Boolean functions is either
monotonically increasing or decreasing with the growing depth.
Deep-layered machines comprise multiple consecutive
layers of basic computing elements aimed at represent-
ing an arbitrary function, where the first and final layers
represent its input and output arguments, respectively.
Notable examples include deep neural networks (DNNs)
composed of perceptrons [1] and Boolean circuits con-
structed from logical gates [2]. Being universal approx-
imators [3, 4], DNNs have been successfully employed
in different machine learning applications [1]. Similarly,
Boolean circuits can compute any Boolean function even
when constructed from a single gate [5].
While the majority of DNN research focuses on their
application in carrying out various learning tasks, it is
equally important to establish the space of functions
they typically represent for a given architecture and the
computing elements used. One way to address such a
generic study is to consider a random ensemble of DNNs.
The study of random neural networks using methods of
statistical physics has played an important role in un-
derstanding their typical properties for storage capacity
and generalization ability [6, 7] and properties of energy-
based [8–12] and associative memory models [13, 14], as
well as the links between energy-based models and feed-
forward layered machines [15]. In parallel, there have
been theoretical studies within the computer science com-
munity of the range of Boolean functions generated by
random Boolean circuits [16, 17]. Both the DNNs and
the Boolean circuits share common basic properties.
Characterizing the space of functions computed by
random-layered machines is of great importance, since
it sheds light on their approximation and generalization
properties. However, it is also highly challenging due to
the inherent recursiveness of computation and random-
ness in their architecture and/or computing elements.
Existing theoretical studies of the function space of deep-
layered machines are mostly based on the mean field ap-
proach, which allows for a sensitivity analysis of the func-
tions realized by deep-layered machines due to input or
parameter perturbations [4, 18–20].
To gain a complete and detailed understanding of the
function space, we develop a path-integral formalism that
directly examines individual functions computed. This is
carried out by processing al l possible input configurations
simultaneously and the corresponding outputs. For sim-
plicity, we always consider Boolean functions with binary
input and output variables.
The main contribution of this Letter is in providing
a detailed understanding of the distribution of Boolean
functions computed at each layer. It points to the equiv-
alence between recurrent and layer-dependent architec-
tures, and consequently to the potential significant reduc-
tion in the number of trained free variables. Additionally,
the complexity of Boolean functions implemented mea-
sured by their entropy, which depends on the number of
layers and computing elements used, exhibits a rapid sim-
plification when rectified linear unit (ReLU) components
are employed, which arguably explains their generaliza-
tion successes.
Framework.––The layered machines considered consist
of L+ 1 layers, each with Nnodes. Node iat layer lis
connected to the set of nodes {i1, i2, ..., ik}of layer l1;
its activity is determined by the gate αl
i, computing a
function of kinputs, according to the propagation rule
P(Sl
i|~
Sl1) = δSl
i, αl
i(Sl1
i1, Sl1
i2, ..., Sl1
ik),(1)
where δis the Dirac or Kronecker delta function, de-
pending on the domain of Sl
i. The probabilistic form
of Eq. (1) adopted here is convenient for the generating
functional analysis and inclusion of noise [19, 21]. We
primarily consider two structures here: (i) densely con-
nected models where k=Nand node iis connected to
all nodes from the previous layer––one such example is
the fully connected neural network with Sl
i=αl(Hl
i),
where Hl
i=PN
j=1 Wl
jSl1
j/N+bl
iis the preactivation
field and αlis the activation function at layer l, (we will
mainly focus on the case bl
i= 0; the effect of nonzero bias
is discussed in [22]); (ii) sparsely connected models where
kO(N0)––examples include the sparse neural networks
and layered Boolean circuits where αl
iis a Boolean gate
with kinputs, e.g., majority gate.
Consider a binary input vector ~s = (s1,...,sn)
{−1,1}n, which is fed to the initial layer l= 0. To ac-
commodate a broader set of functions, we also consider
an augmented input vector, e.g., (i) ~
SI= (~s, 1), which
is equivalent to adding a bias variable in the context of
neural networks; (ii) ~
SI= (~s, ~s, 1,1), which has been
2
Figure 1. A deep-layered machine computing all possible 2n
inputs. The direction of computation is from bottom to top.
The binary string SL∈ {−1,1}2nrepresents the Boolean
function computed on the blue nodes of the output layer L.
The augmented vector ~
SI= (~s, 1) is used as an example of
input here. The constant 1is represented by the dashed circle.
used to construct all Boolean functions [16]. Each node
iat layer 0points to a randomly chosen element of ~
SI
such that
P0(~
S0|~s) =
N
Y
i=1
P0S0
i|SI
ni(~s)=
N
Y
i=1
δS0
i, SI
ni(~s),(2)
where ni= 1, ..., |~
SI|is an index chosen from the flat
distribution P(ni) = 1/|~
SI|.
The computation of the layered machine
is governed by the propagator P(~
SL|~s) =
P~
SL1···~
S0P(~
S0|~s)QL
l=1 P(~
Sl|~
Sl1), where each
node at layer Lcomputes a Boolean function
{−1,1}n→ {−1,1}. When the gates αl
ior the
network topology are random, then the layered machine
can be viewed as a disordered dynamical system with
quenched disorder [19, 21]. To probe the functions
being computed, we consider the simultaneous layer
propagation of all possible inputs ~sγ∈ {−1,1}n, labeled
by γ= 1,...,2ngoverned by the product propagator
Q2n
γ=1 P(~
SL
γ|~sγ). The binary string SL
i∈ {−1,1}2n
represents the Boolean function computed at node
iat layer L, as illustrated in Fig. 1. Note that we
use the vector notation ~
Sl= (Sl
1, ..., Sl
i, ..., Sl
N)and
Sl
i= (Sl
i,1, ..., Sl
i,γ , ..., Sl
i,2n)to represent the states and
functions, respectively. Using the above formalism, the
distribution of Boolean functions fcomputed on the
final layer is given by
PL
N(f) = 1
N
N
X
i=1 *2n
Y
γ=1
δfγ, SL
i,γ +,(3)
where components of fsatisfy fγ=f(~sγ), and
angular brackets represent the average generated by
Q2n
γ=1 P(~
SL
γ|~sγ). To compute PL
N(f)and averages of
other macroscopic observables, which are expected to
be self-averaging for N [23], we introduce the
disorder-averaged generating functional (GF) Γ[{ψl
i,γ }] =
P{~
Sl
γ}QγP(~
S0
γ|~
SI
γ)QlP(~
Sl
γ|~
Sl1
γ)eiPiψl
i,γ Sl
i,γ , where
the overline denotes an average over the quenched dis-
order. To keep the presentation concise, we outline the
GF formalism only for DNNs in the following and refer
the reader to [22] for the details of the derivation used in
Boolean circuits.
Layer-dependent and recurrent architectures.––We fo-
cus on two different architectures: layer-dependent archi-
tectures, where the gates and/or connections are differ-
ent from layer to layer, and recurrent, where the gates
and connections are shared across all layers. Both archi-
tectures represent feed-forward machines that implement
input-output mappings.
Specifically, we assume that the weights Wl
ij in fully
connected DNNs with layer-dependent architectures are
independent Gaussian random variables sampled from
N(0, σ2). In DNNs with recurrent architectures, the
weights are sampled once and are shared among layers,
i.e. Wl+1
ij =Wl
ij . We apply the sign activation function
in the final layer, i.e. αL(hL
i) = sgn(hL
i), to ensure that
the output of the DNN is Boolean.
We first outline the derivation for fully connected re-
current architectures. It is sufficient to characterize the
disorder-averaged GF by introducing cross-layer overlaps
ql,l
γγ = (1/N)PihSl
i,γ Sl
i,γias order parameters and the
corresponding conjugate order parameter Ql,l
γγ , which
leads to a saddle-point integral Γ = R{dqdQ}eNΨ[q,Q]
with the potential [22]
Ψ = iTr {q Q}+
|~
SI|
X
m=1
P(m) ln X
SZdHMm[H,S],(4)
where Mm[H,S]is an effective single-site measure
Mm= eiPl,γ ψl
γSl
γiPll,γγQl,l
γγSl
γSl
γ(5)
×NH|0,C
2n
Y
γ=1
P0(S0
γ|SI
m,γ )
L
Y
l=1
δSl
γ, αl(hl
γ).
Due to weight sharing, the preactivation fields H=
(h1,...,hL), where hlR2n, are governed by the Gaus-
sian distribution N(H|0,C)and correlated across lay-
ers with covariance [C]l,l
γγ =σ2ql1,l1
γγ . Setting ψl
γto
zero and differentiating Ψwith respect to {ql,l
γγ , Ql,l
γγ }
yields the saddle point of the potential Ψdominating Γ
for N→ ∞, at which the conjugate order parameters
Ql,l
γγ vanish [22], leading to
ql,l
γγ =(PmP(m)Sl
γS0
γMm, l= 0
RdHαl(hl
γ)αl(hl
γ)NH|0,C. l>0(6)
3
Notice that in the above Gaussian average, all preactiva-
tion fields but the pair {hl
γ, hl
γ}can be integrated out,
reducing it to a tractable two-dimensional integral.
The GF analysis can be performed similarly for layer-
dependent architectures. Here the result has the same
form as Eq. (6) with ql,l
γγ =δl,lql,l
γγ , i.e. the over-
laps between different layers are absent [22], implying
[C]l,l
γγ =σ2δl1,l1ql1,l1
γγ for the covariances of pre-
activation fields. In this case, we denote the equal-layer
covariance matrix as cl:= Cl,l.
We remark that the behavior of DNNs with layer-
dependent architectures in the limit of N→ ∞ can also
be studied by mapping to Gaussian processes [4, 18, 24].
However, it is not clear if such analysis is possible in the
highly correlated recurrent case while the GF or path-
integral framework is still applicable [25–27].
Marginalizing the effective single-site measure in
Eq. (5) gives rise to the distribution of Boolean func-
tions f∈ {−1,1}2ncomputed at layer Lof DNNs with
recurrent architectures
PL(f) = ZdhN(h|0,cL)
2n
Y
γ=1
δfγ, αL(hγ),(7)
where in the above the element of the covariance ma-
trix is cLγγ =CL,L
γγ =σ2qL1,L1
γγ . Note that
the physical meaning of PL(f)is the distribution of
Boolean functions defined in Eq. (3) averaged over disor-
der PL(f) = limN→∞ PL
N(f).
Moreover, Eq. (7) also applies to layer-dependent ar-
chitectures since the equal-layer covariance matrix cLis
the same in two scenarios. Therefore we arrive at the
first important conclusion that the typical sets of Boolean
functions computed at the output layer Lby the layer-
dependent and recurrent architectures are identical. Fur-
thermore, if the gate functions αlare odd, then it can be
shown that all the cross-layer overlaps ql,l
γγ of the recur-
rent architectures vanish, implying the statistical equiv-
alence of the hidden layer activities to the layered archi-
tectures as well [22].
A similar GF analysis can be applied to sparsely
connected Boolean circuits constructed from a single
Boolean gate α, keeping in mind that distributions of
gates can be easily accommodated. In such models, the
source of disorder are random connections. In layer-
dependent architectures, a gate is connected randomly
to exactly kO(N0)gates from the previous layer and
this connectivity pattern is changing from layer to layer.
In recurrent architectures, on the other hand, the ran-
dom connections are sampled once and the connectivity
pattern is shared among layers. Note that in Boolean
circuits, the activities at every layer Sl
ialways represent
a Boolean function. For layer-dependent architectures,
Figure 2. Test accuracy of trained fully connected DNNs
applied on the MNIST dataset. Images have been downsam-
pled by a factor of 2to reduce training time, and each hidden
layer has 128 nodes. Each data point is averaged over 5ran-
dom initializations. The accuracies of recurrent architectures,
with weight Fsharing between hidden layers, are comparable
to those of layer-dependent architectures.
investigating the distribution of activities gives rise to
Pl+1(f) = X
f1,...,fkk
Y
j=1
Pl(fj)(8)
×
2n
Y
γ=1
δfγ, α(f1,...,fk,γ),
which describes how the probability of the Boolean func-
tion f∈ {−1,1}2nis evolving from layer to layer [28] [22].
We note that for recurrent architecture the equation for
the probability of Boolean functions computed is ex-
actly the same as above [22], suggesting that in ran-
dom Boolean circuits the typical sets of Boolean func-
tions computed on layers in the layer-dependent and re-
current architectures are identical. Note that the cou-
pling asymmetry plays a crucial role in this equivalence
property [22, 29, 30].
The equivalence between two architectures points to
a potential reduction in the number of free parameters
in layered machines by weight sharing or connectivity
sharing among layers, useful in devices with limited com-
putation resources [31]. For illustration, we consider
the image recognition task of Modified National Insti-
tute of Standards and Technology (MNIST) handwritten
digit data [32] using DNNs with both layer-dependent
and recurrent architectures (weight shared from hidden
to hidden layers only; for details see [22]). The experi-
ment shown in Fig. 2 demonstrates the feasibility of us-
ing recurrent architectures to perform image classifica-
tion tasks with a slightly lower accuracy but significant
saving in the number of trained parameters.
Boolean functions computed at large depth.––We con-
sider the typical Boolean functions computed in random-
layered machines by examining PL(f)in the large depth
limit L→ ∞ for specific gates in the following examples.
In DNNs using the ReLU activation function αl(x) =
max(x, 0), in the hidden layers (the sign activation func-
4
tion is always used in the output layer), which is com-
monly used in applications, all covariance matrix ele-
ments cLγγin the Eq. (7) converge to the same value
in the limit L→ ∞, implying that all components of the
preactivation field vector hare also the same and hence
the components of fare identical. Therefore, random
deep ReLU networks compute only constant Boolean
functions in the infinite depth limit, echoing recent find-
ings of a bias toward simple functions in random DNNs
constructed from ReLUs [22], which arguably plays a role
in their generalization ability [33, 34].
In DNNs using sign activation function also in hid-
den layers, i.e., Eq. (1) enforces the rule Sl
i=
sgn(PjWl
ij Sl1
j/N), those cross-pattern overlaps
ql
γγ = (1/N)PihSl
i,γ Sl
i,γisatisfying |ql
γγ |<1mono-
tonically decrease with an increasing number of layers
and vanish as l→ ∞, such “chaotic” nature of dynamics
also holds in random DNNs with other sigmoidal activa-
tion functions such as the error and hyperbolic tangent
functions [4, 24]. The consequences of this behavior is
that for the input vector ~
SI=~s,PL(f)is uniform on
the set of all odd functions [22], i.e., functions satisfying
f(~s) = f(~s). Furthermore, for ~
SI= (~s, 1),PL(f)is
uniform on the set of all Boolean functions [22].
For Boolean circuits, there are also scenarios where
the distribution PL(f)has a single Boolean function
in its support or it is uniform over some set of func-
tions [16, 17, 35]. The latter depends on the gates α
used in Eq. (1) and input vector ~
SI. For example, in the
AND gate with α(S1, S2) = sgn(S1+S2+ 1) or the OR
gate with α(S1, S2) = sgn(S1+S21) [22], their out-
put is biased, respectively, toward +1 or 1[16, 22, 35].
The consequence of the latter is that the distribution
PL(f)has only a single Boolean function in its sup-
port [22, 35]. On the other hand, when the majority
gate α(S1, ..., Sk) = sgn(Pk
j=1 Sj), which is balanced
PS1,...,Skα(S1, ..., Sk) = 0 and nonlinear [36], is used
with the input vector ~
SI= (~s, ~s, 1,1), then the distri-
bution PL(f)is uniform over all Boolean functions [35],
which is consistent with the result of [16].
Entropy of Boolean functions.––Having considered the
distribution of Boolean functions for a few different ex-
amples, we observed that random-layered machines ei-
ther reduce to a single Boolean function or compute
all (or a subset of) functions with a uniform proba-
bility on the layer L, as L→ ∞. We note that
for the Shannon entropy over Boolean functions HL=
PfPL(f) log PL(f), these two scenarios saturate its
lower and upper bounds, respectively, given by 0and
2nlog 2. Thus, the entropy HLcan be seen, at least in-
tuitively, as a measure of function space complexity.
In Fig. 3, we study the entropy HL, computed us-
ing Eqs. (7) and (8), as a function of the depth Lin
random-layered machines constructed from different acti-
vation functions or gates and computing different inputs.
The initial increase in entropy after layer L= 0, seen in
Fig. 3(a), (b), can be explained by the properties of gates
used and the initial set of (simple) Boolean functions at
layer L= 0; functions from the layer L= 0 are “copied”
onto layer L= 1, while new functions are also created,
as illustrated in Fig. 3(c), (d). Note that the minimal
depth in ReLU networks to produce a Boolean function
is L= 2. The dependence of entropy HLon Lafter
the initial increase depends on the specific gate functions
used. For the ReLU activation function in DNNs and the
AND gate in Boolean circuits, the entropies HLmono-
tonically decrease with L, suggesting that sizes of sets of
typical Boolean functions computed are decreasing with
increasing numbers of layers L. Random initialization
of layered machines with such gates/activation functions
serves as a biasing prior toward a more restricted set of
functions [33, 34]. On the other hand, for balanced gates,
with appropriate initial conditions, e.g., sign in DNNs
and majority vote in Boolean circuits, the entropy HL
is monotonically increasing with the depth L, indicat-
ing that the sizes of sets of the typical Boolean functions
computed are increasing.
In summary, we present an analytical framework to
examine Boolean functions represented by random deep-
layered machines, by considering all possible inputs si-
multaneously and applying the generating functional
analysis to compute various relevant macroscopic quan-
tities. We derived the probability of Boolean functions
computed on the output nodes. Surprisingly, we discover
that the typical sets of Boolean functions computed by
the layer-dependent and recurrent architectures are iden-
tical. It points to the possibility of computing complex
functions with a reduced number of parameters by weight
or connection sharing, as showcased in an image classi-
fication experiment. We also study the Boolean func-
tions computed by specific random-layered machines. Bi-
ased activation functions (e.g., ReLU) or biased Boolean
gates (e.g., AND/OR) can lead to more restricted typ-
ical sets of Boolean functions found at deeper layers,
which may explain their generalization ability. On the
other hand, balanced activation functions (e.g., sign) or
Boolean gates (e.g., majority) complemented with appro-
priate initial conditions, lead to a uniform distribution on
all Boolean functions at the infinite depth limit. It will
be interesting to investigate the functions realized by dif-
ferent DNN architectures with structured data and by
different learning algorithms [7, 37–40].
We also showed the monotonic behavior of the entropy
of Boolean functions as a function of depth, which is of
interest in the field of computer science. We envisage
that the insights gained and the methods developed will
facilitate further study of deep-layered machines.
B.L. and D.S. acknowledge support from the Lever-
hulme Trust (RPG-2018-092), European Union’s Horizon
2020 research and innovation program under the Marie
Skłodowska-Curie grant agreement No. 835913. D.S.
5
(a) (b)
(a)
(c)
Figure 3. Normalized entropy and distribution of functions
of deep-layered machines. (a) Normalized entropy HL/2nof
Boolean functions computed by DNNs with sign or ReLU
activation in the hidden layers as a function of the network
depth L; the initial condition is set as ~
SI= (~s, 1). (b) HL/2n
vs Lfor Boolean circuits constructed by MAJ3 or the AND
gate with initial condition ~
SI= (~s, ~s, 1,1). (c) The dis-
tribution of Boolean functions PL(f)computed by Boolean
circuits with two inputs n= 2 (the number of all possible
functions is 16) is represented by the sizes of circles on a 4×4
grid. Upper panel: MAJ3-gate-based circuits, in which more
functions are created at larger depth Land PL(f)converges
to a uniform distribution. Lower panel: AND-gate-based cir-
cuits, in which new functions are created from L= 0 to L= 1,
while PL(f)converges to a distribution with supports in a
single Boolean function as network depth increases.
acknowledges support from the EPSRC program grant
TRANSNET (EP/R035342/1).
alexander.mozeika@kcl.ac.uk
b.li10@aston.ac.uk
d.saad@aston.ac.uk
[1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep
learning. Nature, 521(7553):436–444, 2015.
[2] Ryan O’Donnell. Analysis of Boolean Functions. Cam-
bridge University Press, New York, 2014.
[3] Kurt Hornik, Maxwell Stinchcombe, and Halbert White.
Multilayer feedforward networks are universal approxi-
mators. Neural Networks, 2(5):359 – 366, 1989.
[4] Ben Poole, Subhaneil Lahiri, Maithreyi Raghu, Jascha
Sohl-Dickstein, and Surya Ganguli. Exponential expres-
sivity in deep neural networks through transient chaos.
In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon,
and R. Garnett, editors, Advances in Neural Information
Processing Systems 29, pages 3360–3368. Curran Asso-
ciates, Inc., New York, 2016.
[5] N. Nisan and S. Schocken. The Elements of Computing
Systems: Building a Modern Computer from First Prin-
ciples. MIT Press, 2008.
[6] Andreas Engel and Christian Van den Broeck. Statisti-
cal Mechanics of Learning. Cambridge University Press,
New York, 2001.
[7] David Saad, editor. On-Line Learning in Neural Net-
works. Cambridge University Press, New York, 1998.
[8] Elena Agliari, Adriano Barra, Andrea Galluzzi,
Francesco Guerra, and Francesco Moauro. Multitasking
associative networks. Phys. Rev. Lett., 109:268101, 2012.
[9] Haiping Huang and Taro Toyoizumi. Advanced mean-
field theory of the restricted boltzmann machine. Phys.
Rev. E, 91:050101, 2015.
[10] Marylou Gabrié, Eric W Tramel, and Florent Krza-
kala. Training restricted boltzmann machine via the
thouless-anderson-palmer free energy. In C. Cortes, N. D.
Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, edi-
tors, Advances in Neural Information Processing Systems
28, pages 640–648. Curran Associates, Inc., 2015.
[11] Marc Mézard. Mean-field message-passing equations in
the hopfield model and its generalizations. Phys. Rev. E,
95:022117, 2017.
[12] Adriano Barra, Giuseppe Genovese, Peter Sollich, and
Daniele Tantari. Phase diagram of restricted boltzmann
machines and generalized hopfield networks with arbi-
trary priors. Phys. Rev. E, 97:022310, 2018.
[13] J J Hopfield. Neural networks and physical systems with
emergent collective computational abilities. Proceedings
of the National Academy of Sciences, 79(8):2554–2558,
1982.
[14] J.A. Hertz, A.S. Krogh, and R.G. Palmer. Introduction
To The Theory Of Neural Computation. Addison-Wesley,
1991.
[15] J. J. Hopfield. Learning algorithms and probability distri-
butions in feed-forward and feed-back networks. Proceed-
ings of the National Academy of Sciences, 84(23):8429–
8433, 1987.
[16] Petr Savický. Random boolean formulas representing any
boolean function with asymptotically equal probability.
Discrete Mathematics, 83(1):95 – 103, 1990.
[17] Alex Brodsky and Nicholas Pippenger. The boolean func-
tions computed by random boolean formulas or how to
grow the right function. Random Structures & Algo-
rithms, 27(4):490–519, 2005.
[18] Jaehoon Lee, Jascha Sohl-dickstein, Jeffrey Pennington,
Roman Novak, Sam Schoenholz, and Yasaman Bahri.
Deep neural networks as gaussian processes. In Pro-
ceedings of the 6th International Conference on Learning
Representations, 2018.
[19] Bo Li and David Saad. Exploring the function space of
deep-learning machines. Phys. Rev. Lett., 120:248301,
2018.
[20] Bo Li and David Saad. Large deviation analysis of
function sensitivity in random deep neural networks.
Journal of Physics A: Mathematical and Theoretical,
53(10):104002, 2020.
[21] Alexander Mozeika, David Saad, and Jack Raymond.
Computing with noise: Phase transitions in boolean for-
mulas. Phys. Rev. Lett., 103:248701, 2009.
[22] See Supplemental Material for details, which includes
Refs. [41–43].
[23] Marc Mézard, Giorgio Parisi, and Miguel Virasoro. Spin
glass theory and beyond: An Introduction to the Replica
6
Method and Its Applications, volume 9. World Scientific
Publishing Co Inc, 1987.
[24] Greg Yang and Hadi Salman. A fine-grained spectral
perspective on neural networks. arXiv:1907.10599, 2019.
[25] A.C.C. Coolen. Chapter 15 statistical mechanics of re-
current neural networks ii - dynamics. In F. Moss and
S. Gielen, editors, Neuro-Informatics and Neural Mod-
elling, volume 4 of Handbook of Biological Physics, pages
619 – 684. North-Holland, 2001.
[26] Taro Toyoizumi and Haiping Huang. Structure of at-
tractors in randomly connected networks. Phys. Rev. E,
91:032802, 2015.
[27] A. Crisanti and H. Sompolinsky. Path integral approach
to random neural networks. Phys. Rev. E, 98:062120,
2018.
[28] Viewing the layers as time steps, the functions can be
seen as molecules of gas undergoing k-body collisions.
[29] B. Cessac. Increase in complexity in random neural net-
works. J. Phys. I France, 5(3):409–432, 1995.
[30] J P L Hatchett, B Wemmenhove, I Pérez Castillo,
T Nikoletopoulos, N S Skantzos, and A C C Coolen. Par-
allel dynamics of disordered ising spin systems on finitely
connected random graphs. Journal of Physics A: Math-
ematical and General, 37(24):6201–6220, 2004.
[31] Y. Cheng, D. Wang, P. Zhou, and T. Zhang. Model
compression and acceleration for deep neural networks:
The principles, progress, and challenges. IEEE Signal
Processing Magazine, 35(1):126–136, 2018.
[32] Yann LeCun, Corinna Cortes, and Christopher J. Burges.
The MNIST Database of Handwritten Digits (1998),
http://yann.lecun.com/exdb/mnist/.
[33] Guillermo Valle-Perez, Chico Q. Camargo, and Ard A.
Louis. Deep learning generalizes because the parameter-
function map is biased towards simple functions. In Pro-
ceedings of the 7th International Conference on Learning
Representations. 2019.
[34] Giacomo De Palma, Bobak Kiani, and Seth Lloyd. Ran-
dom deep neural networks are biased towards simple
functions. In H. Wallach, H. Larochelle, A. Beygelz-
imer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,
Advances in Neural Information Processing Systems 32,
pages 1962–1974. Curran Associates, Inc., 2019.
[35] Alexander Mozeika, David Saad, and Jack Raymond.
Noisy random boolean formulae: A statistical physics
perspective. Phys. Rev. E, 82:041112, 2010.
[36] Since {Sj}are binary variables in the context of Boolean
circuits, linearity is defined in the finite field GF (2) [16,
17].
[37] Sebastian Goldt, Marc Mézard, Florent Krzakala, and
Lenka Zdeborová. Modelling the influence of data struc-
ture on learning in neural networks: the hidden manifold
model. arXiv:1909.11500, 2019.
[38] Pietro Rotondo, Mauro Pastore, and Marco Gherardi.
Beyond the storage capacity: Data-driven satisfiability
transition. Phys. Rev. Lett., 125:120601, Sep 2020.
[39] Mauro Pastore, Pietro Rotondo, Vittorio Erba, and
Marco Gherardi. Statistical learning theory of structured
data. Phys. Rev. E, 102:032119, Sep 2020.
[40] Lenka Zdeborová. Understanding deep learning is also a
job for physicists. Nature Physics, 16(6):602–604, 2020.
[41] B Derrida, E Gardner, and A Zippelius. An exactly solv-
able asymmetric neural network model. Europhysics Let-
ters (EPL), 4(2):167–173, 1987.
[42] R. Kree and A. Zippelius. Continuous-time dynamics of
asymmetrically diluted neural networks. Phys. Rev. A,
36:4421–4427, 1987.
[43] Diederik P. Kingma and Jimmy Ba. Adam: A method for
stochastic optimization. In Proceedings of the 3rd Inter-
national Conference on Learning Representations. 2015.
arXiv:2004.08930v3 [cs.LG] 14 Oct 2020
Space of Functions Computed by Deep-Layered Machines
Supplemental Material
Alexander Mozeika,1Bo Li,2and David Saad2
1London Institute for Mathematical Sciences, London, W1K 2XF, United Kingdom
2Nonlinearity and Complexity Research Group, Aston University, Birmingham, B4 7ET, United Kingdom
I. CONVENTION OF NOTATION
We denote variables with overarrows as vectors with site indices (e.g., i, j ), which can be of size k,nor N. On the
other hand, we denote bold-symbol variables as vectors of size 2nwith pattern indices (e.g., γ, γ), or matrices of size
2n×2n. For convenience, we define M:= 2n.
The function δ(·,·)stands for Kronecker delta as δ(i, j) = δi,j if arguments i, j are integer variables, while it stands
for Dirac delta function as δ(x, y) = δ(xy)if the arguments x, y are continuous variables; in the latter case, the
summation operation should be interpreted as integration, such that Pyδ(x, y)f(y) := Rdy δ(xy)f(y).
The binary variables S∈ {+1,1}in this work are mapped onto the conventional Boolean variables z∈ {0,1}
(z= 0 represents False, z= 1 represents True) through S= 1 2z(S= +1 represents False, S=1represents
True). This choice of notation has the advantage that Boolean addition (0 + 0 = 0,0 + 1 = 1,1 + 1 = 0) can be
represented as integer multiplication (1×1 = 1,1×(1) = 1,(1) ×(1) = 1). Under this convention, the
AND gate is defined as sgn(Si+Sj+ 1), the OR gate is defined as sgn(Si+Sj1), while the majority vote gate is
sgn(Pk
j=1 Sj).
II. GENERATING FUNCTIONAL ANALYSIS OF FULLY CONNECTED NEURAL NETWORKS
To probe the functions being computed by neural networks, we need to consider the layer propagation of all 2n
input patterns as Q2n
γ=1 P(~
SL
γ|~
SI
γ(~sγ)). We introduce the disorder-averaged generating functional in order to compute
the macroscopic quantities
Γ[{ψl
i,γ}] = X
{Sl
i,γ }l,i,γ
2n
Y
γ=1
P(~
S0
γ|~
SI
γ)
L
Y
l=1
P(~
Sl
γ|~
Sl1
γ)eiPl,i,γ ψl
i,γ Sl
i,γ
=EWX
{Sl
i,γ }l,i,γ ZL
Y
l=1 Y
i,γ
dhl
i,γdxl
i,γ
2π
2n
Y
γ=1
P(~
S0
γ|~
SI
γ)
L
Y
l=1
P(~
Sl
γ|~
hl
γ)eiPl,i,γ ψl
i,γ Sl
i,γ
×exp X
l,i,γ
ixl
i,γ hl
i,γ X
l,γ X
ij
i
NWl
ij xl
i,γ Sl1
j,γ ,(S1)
where we have introduced the notation P(~
Sl
γ|~
hl
γ) = QN
i=1 P(Sl
i,γ |hl
i,γ) = QN
i=1 δSl
i,γ , αl(hl
i,γ )and inserted the
Fourier representation of unity 1 = Rdhl
i,γ dxl
i,γ
2πexp ixl
i,γ hl
i,γ PjWl
ij Sl1
j,γ ,l, i, γ . Noisy computation can be
easily accommodated in such probabilistic formalism.
A. Layer-dependent Architectures
We first consider layer-dependent weights, where each element follows the Gaussian distribution Wl
ij ∼ N(0, σ2
w).
Assuming self-averaging, averaging over the weight disorder component in the last line of the Eq. (S1) yields
EWexp
L
X
l=1 X
γX
ij
i
NWl
ij xl
i,γ Sl1
j,γ
= exp σ2
w
2
L
X
l=1 X
γ,γ X
i
xl
i,γ xl
i,γ1
NX
j
Sl1
j,γ Sl1
j,γ.(S2)
2
By introducing the overlap order parameters {ql
γγ }L
l=0 through the Fourier representation of unity
1 = ZdQl
γγ dql
γγ
2π/N exp iNQl
γγ ql
γγ 1
NX
j
Sl
j,γ Sl
j,γ,(S3)
the generating functional can be factorized over sites as follows
Γ[{ψl
i,γ }] = ZL
Y
l=0 Y
γγ
dQl
γγ dql
γγ
2π/N exp iNX
l,γγ
Ql
γγ ql
γγ
×exp N
X
i=1
log ZL
Y
l=1 Y
γ
dhl
i,γ X
{Sl
i,γ }l,γ
Mni(hi,Si)
=ZL
Y
l=0 Y
γγ
dQl
γγ dql
γγ
2π/N exp iNX
l,γγ
Ql
γγ ql
γγ
×exp N|~
SI|
X
m=1
1
N
N
X
i=1
δ(m, ni)log ZL
Y
l=1 Y
γ
dhl
i,γ X
{Sl
i,γ }l,γ
Mm(hi,Si),
where hi,Siare shorthand notations of {hl
i},{Sl
i}with hl
i:= (hl
i,1, ..., hl
i,γ, ..., hl
i,2n)and Sl
i:= (Sl
i,1, ..., Sl
i,γ , ..., Sl
i,2n).
The single-site measure Mmin the above expression is defined as
Mm(hi,Si) =
2n
Y
γ=1
eiPl,γ ψl
i,γ Sl
i,γ P(S0
i,γ |SI
m,γ )
L
Y
l=1
P(Sl
i,γ |hl
i,γ ) exp X
l,γγ
iQl
γγ Sl
i,γ Sl
i,γ
×
L
Y
l=1
1
p(2π)2n|cl|exp 1
2X
γγ
hl
i,γ(cl)1
γγ hl
i,γ.(S4)
In Eq. (S4), clis a 2n×2ncovariance matrix with elements cl
γγ =σ2
wql1
γγ and mis a random index following the
empirical distribution 1
NPN
i=1 δ(m, ni).
Setting ψl
i,γ = 0 and considering limN→∞ 1
NPN
i=1 δ(m, ni)P(m) = 1/|~
SI|, we arrive at
Γ = Z{dQdq}eNΨ(Q,q),(S5)
Ψ(Q,q) =
L
X
l=0 X
γγ
iQl
γγ ql
γγ +
|~
SI|
X
n=1
P(n) log ZL
Y
l=1 Y
γ
dhl
γX
{Sl
γ}l,γ
Mm(h,S).(S6)
The saddle point equations are obtained by computing Ψ/∂ql
γγ = 0 and Ψ/∂Ql
γγ = 0
iQl1
γγ =X
n
P(n)RdhPS
∂ql1
γγMm(h,S)
RdhPSMm(h,S),1lL, (S7)
iQL
γγ = 0,(S8)
ql
γγ =
|~
SI|
X
m=1
P(m)Sl
γSl
γMm,0lL. (S9)
Back-propagating the boundary condition iQL
γγ = 0 results in iQl
γγ = 0,l[1].
3
The measure Mmbecomes
Mm(h,S) =
2n
Y
γ=1
P(S0
γ|SI
m,γ )
L
Y
l=1
P(Sl
γ|hl
γ)
×
L
Y
l=1
1
p(2π)2n|cl|exp 1
2X
γγ
hl
γ(cl)1
γγ hl
γ,(S10)
while the saddle point equations of overlaps have the form of
q0
γγ =X
m
P(m)SI
m,γ SI
m,γ,(S11)
ql
γγ =Zdhl
γdhl
γ
φl(hl
γ)φl(hl
γ)
q(2π)2|Σl
γγ |
exp 1
2[hl
γ, hl
γ]·l
γγ )1·[hl
γ, hl
γ],(S12)
where the 2×2covariance matrix Σl
γγ is defined as
Σl
γγ := σ2
w ql1
γγ ql1
γγ
ql1
γγql1
γγ!.(S13)
B. Recurrent Architectures
In this section, we consider recurrent topology where the weights are independent of layers Wl
ij =Wij ∼ N(0, σ2
w).
The calculation resembles the case of layer-dependent weights, except that the disorder average yields cross-layer
overlaps
EWexp
L
X
l=1 X
γX
ij
i
NWij xl
i,γ Sl1
j,γ
= exp σ2
w
2
L
X
l,l=1 X
γ,γ X
i
xl
i,γ xl
i,γ1
NX
j
Sl1
j,γ Sl1
j,γ.(S14)
Introducing order parameters ql,l
γγ := 1
NPjSl
j,γ Sl
j,γand setting ψl
i,γ = 0, we eventually obtain
Γ = Z{dQdq}eNΨ(Q,q),(S15)
Ψ(Q,q) =iTr {q Q}+
|~
SI|
X
m=1
P(m) log ZL
Y
l=1 Y
γ
dhl
γX
{Sl
γ}l,γ
Mm(h,S),(S16)
Mm(h,S) =
2n
Y
γ=1
P(S0
γ|SI
m,γ )
L
Y
l=1
P(Sl
γ|hl
γ) exp X
ll,γγ
iQl,l
γγ Sl
γSl
γ
×1
p(2π)2nL|C|exp 1
2HC1H,(S17)
where iTr {q Q}= i PL
l,l=0 Pγγ Ql,l
γγ ql,l
γγ and H= (h1, ..., hL)R2nLexpresses the preactivation fields of all
patterns and all layers, while Cis a 2nL×2nLcovariance matrix.
4
The corresponding saddle point equations are
iQl1,l1
γγ =X
n
P(n)RdhPS
∂ql1,l1
γγMm(h,S)
RdhPSMm(h,S),1l, lL, (S18)
iQL,l
γγ = 0,l(S19)
ql,l
γγ =X
m
P(m)Sl
γSl
γMm,0lL. (S20)
All conjugate order parameters {iQl,l
γγ }vanish identically similar to the previous case, such that the effective
single-site measure becomes
Mm(h,S) =
2n
Y
γ=1
P(S0
γ|SI
m,γ )
L
Y
l=1
P(Sl
γ|hl
γ)
×1
p(2π)2nL|C|exp 1
2HC1H,(S21)
and the saddle point equation of the order parameters follows
q0,0
γγ =X
m
P(m)SI
m,γ SI
m,γ,(S22)
ql,0
γγ =X
m
P(m)Sl
γS0
γMm
=X
m
P(m)SI
m,γZdhl
γ
φl(hl
γ)
p2πσ2
w
exp 1
2σ2
whl
γ2,(S23)
ql,l
γγ =Zdhl
γdhl
γ
φl(hl
γ)φl(hl
γ)
q(2π)2|Σl,l
γγ |
exp 1
2[hl
γ, hl
γ]·l,l
γγ )1·[hl
γ, hl
γ],(S24)
where the 2×2covariance matrix Σl,l
γγ is defined as
Σl,l
γγ := σ2
w ql1,l1
γγ ql1,l1
γγ
ql1,l1
γγql1,l1
γγ!.(S25)
Similar formalism was derived in the context of dynamical recurrent neural networks to study the autocorrelation of
spin/neural dynamics [2].
C. Strong Equivalence Between Layer-dependent and Recurrent Architectures for Odd Activation Functions
In general, the statistical properties of the activities of machines of layer-dependent architectures and recurrent
architectures are different, since the fields {hl}of different layers are directly correlated in the latter case. However,
one can observe that the equal-layer overlaps ql,l
γγ in the recurrent architectures is identical to ql
γγ in the layer-
dependent architectures, by noticing the same initial condition in Eq. (S22) and Eq. (S11) and the same forward
propagation rules in Eq. (S24) (with l=l) and Eq. (S12).
If the cross-layer overlaps {ql,l
γγ |l6=l}vanish, then the direct correlation between hlof different layers also vanish
such that
1
p(2π)2nL|C|exp 1
2HC1H=Y
l
1
p(2π)2n|cl|exp 1
2(hl)(cl)1hl.(S26)
In this case, the distributions of the macroscopic trajectories {hl,Sl}of the two architectures are equivalent. One
sufficient condition for this to hold is that the activation functions φl(·)are odd functions satisfying φl(x) = φl(x).
Firstly, this condition implies that ql,0
γγ = 0,lby Eq. (S23); secondly, ql,0
γγ = 0 and the fact that φl(·)is odd implies
ql+1,1
γγ = 0, which leads to ql,l
γγ = 0,l6=lby induction.
5
D. Weak Equivalence Between Layer-dependent and Recurrent Architectures for General Activation
Functions
As shown above, in general the trajectories {hl,Sl}of layer-dependent architectures follow a different distribution
from the case of recurrent architectures with shared weights except for some specific cases such as DNNs with odd
activation functions. Here we focus on the distribution of activities in the output layer.
For layer-dependent weights, the joint distribution of the local fields and activations at layer Lis obtained by
marginalizing the variables of initial and hidden layers
P(hL,SL) = ZY
γ
dxL
γ
2πZL1
Y
l=1 Y
γ
dhl
γX
m
P(m)X
{Sl
γ}γ,l<L
Mm(h,S)
=ZL1
Y
l=1
dhlX
{Sl
γ}γ,l<L L
Y
l=1 Y
γ
P(Sl
γ|hl
γ)L
Y
l=1
1
p(2π)2n|cl|exp 1
2(hl)(cl)1hl
=N(hL|0,cL(qL1))
2n
Y
γ=1
P(SL
γ|hL
γ),(S27)
where N(hL|0,cL(qL1)) is a 2ndimensional multivariate Gaussian distribution. The distribution of Boolean func-
tions f(·)computed at layer Lis
PL(f) = ZdhLX
SL
P(hL,SL)
2n
Y
γ=1
δ(SL
γ, f (~sγ))
=ZdhN(h|0,cL)
2n
Y
γ=1
δ(fγ, αL(hγ)),(S28)
where the binary string fof size 2nrepresents the Boolean function f(·)with fγ=f(~sγ).
For shared weights, the fields of all layers H= (h1, ..., hl, ..., hL)R2nLare coupled with covariance C
P(hL,SL) = ZL1
Y
l=1
dhlX
{Sl
γ}γ,l<L L
Y
l=1 Y
γ
P(Sl
γ|hl
γ)1
p(2π)2nL|C|exp 1
2(H)C1H
=Y
γ
P(SL
γ|hL
γ)ZL1
Y
l=1
dhl1
p(2π)2nL|C|exp 1
2(H)C1H
=Y
γ
P(SL
γ|hL
γ)1
p(2π)2n|CL,L|exp 1
2(hL)(CL,L)1hL
=N(hL|0,CL,L(qL1,L1))
2n
Y
γ=1
P(SL
γ|hL
γ).(S29)
Since the equal-layer overlap follows the same dynamical rule with the case of layer-dependent weights such that
CL,L =cL, the distributions P(hL,SL)of the two scenarios are equivalent. This suggests that if only the input-
output mapping is of interest (but not the hidden layer activity), the distributions of the Boolean functions PL(f)
computed at the final layer of the two architectures are equivalent.
III. REMARK ON THE EQUIVALENCE PROPERTY
We observed that although auto-correlations generally exist in recurrent architectures, they do not complicate the
single-layer macroscopic behaviors of the system studied. This is due to the fact that the weights/couplings being used
are asymmetric (i.e., Wij and Wji are independent of each other), such that there are no intricate feedback interactions
of a node with its state at previous time steps, which renders a Markovian single-layer macroscopic dynamics [3]. If
symmetric couplings are present, then the whole history of the network is needed to characterize the dynamics [3].
6
Similar effect of coupling asymmetry was also observed in sparsely connected networks [4]. Early works investigating
asymmetric coupling in neural networks include [5, 6].
Although we only consider finite input dimension, we expect that the equivalence property also holds in the cases
where nhas the same order as Nas long as the couplings are asymmetric.
IV. EXTENSION OF THE THEORY ON DENSELY-CONNECTED NEURAL NETWORKS
While the theory on densely-connected neural networks was developed in the infinite width limit N→ ∞, we expect
that it applies to large but finite systems (some properties investigated below require Nn). In [7], it is found that
the order parameters in such systems satisfy the large deviation principle, which implies an exponential convergence
rate to the typical behaviors as the width Ngrows. Typically, order parameters of systems with N103are well
concentrated around their typical values as predicted by the theory.
Accommodating the cases where nhas the same order as N(both tend to infinity) in the current framework is a bit
subtle, as it requires an infinite number of order parameters and may result in the loss of the self-averaging properties.
While Boolean input variables are of primary interest here, the input domain can be generalized to any countable
sets; this is relevant for adapting the theory to real input variables with defined numerical precision, e.g., if a real input
variable s[0,1] is subject to precision of 0.01, then only a finite number of possible values of s∈ {0,0.01,0.02, ..., 1}
need to be considered. On the other hand, real input variables with arbitrary precision are difficult to deal with,
as there is an uncountably infinite number of input patterns and the product Qγ··· is ill-defined. It is worthwhile
noting [8] that random neural networks on real spherical data shares similar properties with those on Boolean data
in high dimension; therefore we expect that the equivalence between layer-dependent and recurrent architectures also
applies to real data.
Other than being able to treat the highly correlated recurrent architectures, the GF or path-integral framework
also facilitates the characterization of fluctuations around the typical behaviors [9], and the computation of large
deviations in finite-size systems [7].
V. TRAINING EXPERIMENTS
Figure 2 of the main text demonstrates the feasibility of using DNNs with recurrent architectures to perform an
image recognition task on the MNIST hand-written digit data. In this section, we describe the details of the training
experiment. The objective is not to achieve state-of-the-art performance, but to showcase the potential in using
recurrent architectures for parameter reduction. Therefore, we pre-process the image data by downsampling with a
factor of 2 through average pooling, which saves runtime of training by reducing the size of each image from 28 ×28
to 14 ×14. See Fig. S1(a) for an example.
We consider DNNs of both architectures, layer-dependent and recurrent, where the input ~s is directly copied onto
the initial layer ~
S0, and a softmax function is applies to the final layer. We remark that the theory developed in
this work is applicable to random-weight DNNs implementing Boolean functions, while it is not directly applicable to
trained networks. For recurrent architectures, since the dimension of input and output layers are fixed, only weights
Whid between hidden layers are shared, i.e.
~
S0Win
~
S1Whid
~
S2Whid
→ ·· · ~
SlWhid
~
Sl+1 Whid
→ ··· Whid
~
SL1Wout
~
SL,(S30)
where all the hidden layers have the same width. The corresponding DNNs of both architectures are trained by
the ADAM algorithm with back-propagation [10]. In Fig. S1(b), we demonstrate that for different widths of hidden
layers, DNNs with recurrent architectures can achieve performance that is comparable to those with layer-dependent
architectures.
VI. BOOLEAN FUNCTIONS COMPUTED BY RANDOM DNNS
To examine the distribution of Boolean functions computed at layer L(we always apply sign activation function in
the final layer), notice that nodes at layer Lare not coupled together, so it is sufficient to consider a particular node
in the final layer, which follows the distribution of the effective single site measure established before.
Further notice that the local field hLR2nin the final layer follows a multivariate Gaussian distribution with zero
mean and covariance cL
γγ =σ2
wqL1
γγ . Essentially the local field hLis a Gaussian process with a dot product kernel
7
(a)
average pooling
(b)
Figure S1. (a) The MNIST data are pre-processed by average pooling to downsample the images, in order to reduce training
time. (b) Test accuracy of trained fully connected DNNs with 6 hidden layers applied to MNIST dataset. For different widths
of the hidden layers, DNNs with recurrent architectures can achieve comparable performance to those with layer-dependent
architectures.
(in the limit N→ ∞) [8, 11]
k(~x, ~x) = k~x ·~x
n=σ2
wqL1
x,x,(S31)
where ~x,~xare n-dimensional vectors.
The probability of a Boolean function f(s1 , ..., sn,γ )being computed in the fully connected neural network is
PL(f) = ZdhN(h|0,cL(q))
2n
Y
γ=1
δsgn(hL
γ), fγ.(S32)
We focus on systems of layer-dependent architectures, where the overlap ql
γγ is governed by the forward dynamics
ql
γγ =Zdhl
γdhl
γ
φ(hl
γ)φ(hl
γ)
p(2π)2|Σl|exp 1
2[hl
γ, hl
γ]·Σl
γγ (ql1)1·[hl
γ, hl
γ].(S33)
For sign activation function, choosing σw= 1 yields
Σl
γγ = 1ql1
γγ
ql1
γγ 1!,(S34)
ql
γγ =2
πsin1ql1
γγ ,l > 0,(S35)
q0
γγ =(1
nPn
m=1 sm,γ sm,γ,~
SI=~s,
1
n+1 Pn
m=1(1 + sm,γ sm,γ ).~
SI= (~s, 1).(S36)
For ReLU activation function, choosing σw=2yields
Σl
γγ = 2 ql1
γγ ql1
γγ
ql1
γγ ql1
γγ!= 2 ql1
γγ ql1
γγ
ql1
γγ ql1
γγ!,(S37)
ql
γγ =1
2Σl
γγ = 1,l, γ (S38)
ql
γγ =1
2πqΣl
γγ +π
2Σl
γγ ,12 + Σl
γγ ,12 tan1Σl
γγ ,12qΣl
γγ 
=1
πq1ql1
γγ 2+ql1
γγ π
2+ sin1ql1
γγ ,l > 0,(S39)
q0
γγ =(1
nPn
m=1 sm,γ sm,γ,~
SI=~s,
1
n+1 Pn
m=1(1 + sm,γ sm,γ ).~
SI= (~s, 1).(S40)
8
(b)
(a)
Figure S2. Iteration mappings of overlaps of fully connected neural networks in the absense of bias variables. (a) sign activation
function. ql= 0 is a stable fixed point while ql= 1,1are two unstable fixed points. (b) ReLU activation functions with
σw=2.ql= 1 is a stable fixed point.
The iteration mappings of overlaps of the two activation functions considered are depicted in Fig. S2.
A. ReLU Networks in the Large LLimit
In this section, we focus on the large depth limit L→ ∞. For ReLU activation function, all the matrix elements
of cLbecome identical in the large Llimit, leading to cL(q)J(where Jis the all-one matrix) and a degenerate
Gaussian distribution of the vector hLenforcing all its components to be the same. To make it explicit, we consider
the distribution of hLas follows
P(hL) = 1
p(2π)M|cL|exp 1
2(hL)(cL)1hL,
=ZdxL
(2π)Mexp ixL·hLσ2
w
2(xL)JxL(S41)
= lim
κ1ZdxL
(2π)Mexp ixL·hLσ2
w
2X
γ
(xL
γ)2+κX
γ6=γ
xL
γxL
γ,(S42)
Now define c(κ) = σ2
w(1 κ)I+κJand notice
[c(κ)]1=1
σ2
w(1 κ)Iκ
Mρ + (1 κ)J1
σ2
w(1 κ)I1
MJ+1κ
M2κJ,(S43)
P(hL) = lim
κ1
1
p(2π)M|c(κ)|exp 1
2(hL)[c(κ)]1hL<