ArticlePDF Available

On the number of response regions of deep feed forward networks with piece-wise linear activations

Authors:

Abstract and Figures

This paper explores the complexity of deep feedforward networks with linear pre-synaptic couplings and rectified linear activations. This is a contribution to the growing body of work contrasting the representational power of deep and shallow network architectures. In particular, we offer a framework for comparing deep and shallow models that belong to the family of piecewise linear functions based on computational geometry. We look at a deep rectifier multi-layer perceptron (MLP) with linear outputs units and compare it with a single layer version of the model. In the asymptotic regime, when the number of inputs stays constant, if the shallow model has $kn$ hidden units and $n_0$ inputs, then the number of linear regions is $O(k^{n_0}n^{n_0})$. For a $k$ layer model with $n$ hidden units on each layer it is $\Omega(\left\lfloor {n}/ {n_0}\right\rfloor^{k-1}n^{n_0})$. The number $\left\lfloor{n}/{n_0}\right\rfloor^{k-1}$ grows faster than $k^{n_0}$ when $n$ tends to infinity or when $k$ tends to infinity and $n \geq 2n_0$. Additionally, even when $k$ is small, if we restrict $n$ to be $2n_0$, we can show that a deep model has considerably more linear regions that a shallow one. We consider this as a first step towards understanding the complexity of these models and specifically towards providing suitable mathematical tools for future analysis.
Content may be subject to copyright.
On the number of response regions of deep
feedforward networks with piecewise linear
activations
Razvan Pascanu
Universit´
e de Montr´
eal
Montr´
eal QC H3C 3J7 Canada
r.pascanu@gmail.com
Guido Mont´
ufar
Max Planck Institute for Mathematics in the Sciences
Inselstraße 22, 04103 Leipzig, Germany
montufar@mis.mpg.de
Yoshua Bengio
Universit´
e de Montr´
eal
Montr´
eal QC H3C 3J7 Canada
yoshua.bengio@umontreal.ca
Abstract
This paper explores the complexity of deep feedforward networks with linear pre-
synaptic couplings and rectified linear activations. This is a contribution to the
growing body of work contrasting the representational power of deep and shallow
network architectures. In particular, we offer a framework for comparing deep
and shallow models that belong to the family of piecewise linear functions based
on computational geometry. We look at a deep rectifier multi-layer perceptron
(MLP) with linear outputs units and compare it with a single layer version of the
model. In the asymptotic regime, when the number of inputs stays constant, if
the shallow model has kn hidden units and n0inputs, then the number of linear
regions is O(kn0nn0). For a klayer model with nhidden units on each layer it is
Ω(bn/n0ck1nn0). The number bn/n0ck1grows faster than kn0when ntends
to infinity or when ktends to infinity and n2n0. Additionally, even when kis
small, if we restrict nto be 2n0, we can show that a deep model has considerably
more linear regions that a shallow one. We consider this as a first step towards
understanding the complexity of these models and specifically towards providing
suitable mathematical tools for future analysis.
Keywords: Deep learning, artificial neural network, rectifier unit, hyperplane ar-
rangement, representational power
1 Introduction
Deep systems are believed to play an important role in information processing of intelligent agents.
A common hypothesis underlying this belief is that deep models can be exponentially more efficient
at representing some functions than their shallow counterparts (see Bengio, 2009).
The argument is usually a compositional one. Higher layers in a deep model can re-use primitives
constructed by the lower layers in order to build gradually more complex functions. For example,
on a vision task, one would hope that the first layer learns Gabor filters capable to detect edges of
different orientation. These edges are then put together at the second layer to form part-of-object
shapes. On higher layers, these part-of-object shapes are combined further to obtain detectors for
more complex part-of-object shapes or objects. Such a behaviour is empirically illustrated, for
1
instance, in Zeiler and Fergus (2013); Lee et al. (2009). On the other hand, a shallow model has to
construct detectors of target objects based only on the detectors learnt by the first layer.
The representational power of computational systems with shallow and deep architectures has been
studied intensively. A well known result Hajnal et al. (1993) derived lower complexity bounds for
shallow threshold networks. Other works have explored the representational power of generative
models based on Boltzmann machines Mont´
ufar et al. (2011); Martens et al. (2013) and deep belief
networks (Sutskever and Hinton, 2008; Le Roux and Bengio, 2010; Mont´
ufar and Ay, 2011), or have
compared mixtures and products of experts models (Mont´
ufar and Morton, 2012).
In addition to such inspections, a wealth of evidence for the validity of this hypothesis comes from
deep models consistently outperforming shallow ones on a variety of tasks and datasets (see, e.g.,
Goodfellow et al., 2013; Hinton et al., 2012b,a). However, theoretical results on the representational
power of deep models are limited, usually due to the composition of nonlinear functions in deep
models, which makes mathematical analysis difficult. Up to now, theoretical results have focussed
on circuit operations (neural net unit computations) that are substantially different from those being
used in real state-of-the-art deep learning applications, such as logic gates (H˚
astad, 1986), linear +
threshold units with non-negative weights (H˚
astad and Goldmann, 1991) or polynomials (Bengio
and Delalleau, 2011). Bengio and Delalleau (2011) show that deep sum-product networks (Poon
and Domingos, 2011) can use exponentially less nodes to express some families of polynomials
compared to the shallow ones.
The present note analyzes the representational power of deep MLPs with rectifier units. Rectifier
units (Glorot et al., 2011; Nair and Hinton, 2010) and piecewise linearly activated units in general
(like the maxout unit (Goodfellow et al., 2013)), are becoming popular choices in designing deep
models, and most current state-of-the-art results involve using one of such activations (Goodfellow
et al., 2013; Hinton et al., 2012b). Glorot et al. (2011) show that rectifier units have several properties
that make the optimization problem easier than the more traditional case using smooth and bounded
activations, such as tanh or sigmoid.
In this work we take advantage of the piecewise linear nature of the rectifier unit to mathematically
analyze the behaviour of deep rectifier MLPs. Given that the model is a composition of piecewise
linear functions, it is itself a piecewise linear function. We compare the flexibility of a deep model
with that of a shallow model by counting the number of linear regions they define over the input
space for a fixed number of hidden units. This is the number of pieces available to the model
in order to approximate some arbitrary nonlinear function. For example, if we want to perfectly
approximate some curved boundary between two classes, a rectifier MLP will have to use infinitely
many linear regions. In practice we have a finite number of pieces, and if we assume that we can
perfectly learn their optimal slopes, then the number of linear regions becomes a good proxy for
how well the model approximates this boundary. In this sense, the number of linear regions is an
upper bound for the flexibility of the model. In practice, the linear pieces are not independent and
the model may not be able to learn the right slope for each linear region. Specifically, for deep
models there is a correlation between regions, which results from the sharing of parameters between
the functions that describe the output on each region.
This is by no means a negative observation. If all the linear regions of the deep model were inde-
pendent of each other, by having many more linear regions, deep models would grossly overfit. The
correlation of the linear regions of a deep model results in its ability to generalize, by allowing it to
better represent only a small family of structured functions. These are functions that look compli-
cated (e.g., a distribution with a huge number of modes) but that have an underlying structure that
the network can ‘compress’ into its parameters. The number of regions, which indicates the number
of variations that the network can represent, provides a measure of how well it can fit this family of
structured functions (whose approximation potentially needs infinitely many linear regions).
We believe that this approach, based on counting the number of linear regions, is extensible to
any other piecewise linear activation function and also to other architectures, including the maxout
activation and the convolutional networks with rectifier activations.
We know the maximal number of regions of linearity of functions computable by a shallow model
with a fixed number of hidden units. This number is given by a well studied geometrical problem.
The main insight of the present work is to provide a geometrical construction that describes the
regions of linearity of functions computed by deep models. We show that in the asymptotic regime,
2
these functions have many more linear regions than the ones computed by shallow models, for the
same number of hidden units.
For the single layer case, each hidden unit divides the input space in two, whereby the boundary is
given by a hyperplane. For all input values on one side of the hyperplane, the unit outputs a positive
value. For all input values on the other side of the hyperplane, the unit outputs 0. Therefore, the
question that we are asking is: Into how many regions do nhyperplanes split space? This question
is studied in geometry under the name of hyperplane arrangements, with classic results such as
Zaslavsky’s theorem. Section 3 provides a quick introduction to the subject.
For the multilayer version of the model we rely on the following intuition. By using the rectifier
nonlinearity, we identify multiple regions of the input space which are mapped by a given layer into
an equivalent set of activations and represent thus equivalent inputs for the next layers. That is, a
hidden layer can perform a kind of or operation by reacting similarly to several different inputs. Any
subsequent computation made on these activations is replicated on all equivalent inputs.
This paper is organized as follows. In Section 2 we provide definitions and basic observations about
piecewise linear functions. In Section 3 we discuss rectifier networks with one single hidden layer
and describe their properties in terms of hyperplane arrangements which are fairly well known in
the literature. In Section 4 we discuss deep rectifier networks and prove our main result, Theorem 1,
which describes their complexity in terms of the number of regions of linearity of functions that they
represent. Details about the asymptotic behaviour of the results derived in Sections 3 and 4 are given
in the Appendix A. In Section 5 we analyze a special type of deep rectifier MLP and show that even
for a small number of hidden layers it can generate a large number of linear regions. In Section 6
we offer a discussion of the results.
2 Preliminaries
We consider classes of functions (models) defined in the following way.
Definition 1. Arectifier feedforward network is a layered feedforward network, or multilayer per-
ceptron (MLP), as shown in Fig. 1, with following properties. Each hidden unit receives as inputs
the real valued activations x1, .. . , xnof all units in the previous layer, computes the weighted sum
s=X
i[n]
wixi+b,
and outputs the rectified value
rect(s) = max{0, s}.
The real parameters w1, . . . , wnare the input weights and bis the bias of the unit. The output layer
is a linear layer, that is, the units in the last layer compute a linear combination of their inputs and
output it unrectified.
Given a vector of naturals n= (n0, n1, . . . , nL), we denote by Fnthe set of all functions Rn0
RnLthat can be computed by a rectifier feedforward network with n0inputs and nlunits in layer l
for l[L]. The elements of Fnare continuous piecewise linear functions.
We denote by R(n)the maximum of the number of regions of linearity or response regions over all
functions from Fn. For clarity, given a function f:Rn0RnL, a connected open subset RRn0
is called a region of linearity or linear region or response region of fif the restriction f|Ris a linear
function and for any open set ˜
R)Rthe restriction f|˜
Ris not a linear function. In the next sections
we will compute bounds on R(n)for different choices of n. We are especially interested in the
comparison of shallow networks with one single very wide hidden layer and deep networks with
many narrow hidden layers.
In the remainder of this section we state three simple lemmas.
The next lemma states that a piecewise linear function f= (fi)i[k]has as many regions of linearity
as there are distinct intersections of regions of linearity of the coordinates fi.
Lemma 1. Consider a width klayer of rectifier units. Let Ri={Ri
1, . . . , Ri
Ni}be the regions of
linearity of the function fi:Rn0Rcomputed by the i-th unit, for all i[k]. Then the regions of
3
1st
hidden
layer
2nd
hidden
layer
Input Output
h(1) =
rect(W(1)x+b(1) )
h(2) =
rect(W(2)h(1) +b(2) )
x
W(out)h(2)
Figure 1: Illustration of a rectifier feedforward network with two hidden layers.
linearity of the function f= (fi)i[k]:Rn0Rkcomputed by the rectifier layer are the elements
of the set {Rj1,...,jk=R1
j1∩ · ·· ∩ Rk
jk}(j1,...,jk)[N1]×···×[Nk].
Proof. A function f= (f1, . . . , fk): RnRkis linear iff all its coordinates f1, . . . , fkare.
In regard to the number of regions of linearity of the functions represented by rectifier networks,
the number of output dimensions, i.e., the number of linear output units, is irrelevant. This is the
statement of the next lemma.
Lemma 2. The number of (linear) output units of a rectifier feedforward network does not affect the
maximal number of regions of linearity that it can realize.
Proof. Let f:Rn0Rkbe the map of inputs to activations in the last hidden layer of a deep
feedforward rectifier model. Let h=gfbe the map of inputs to activations of the output units,
given by composition of fwith the linear output layer, h(x) = W(out) f(x) + b(out). If the row span
of W(out) is not orthogonal to any difference of gradients of neighbouring regions of linearity of f,
then gcaptures all discontinuities of f. In this case both functions fand hhave the same number
of regions of linearity.
If the number of regions of fis finite, then the number of differences of gradients is finite and there
is a vector outside the union of their orthogonal spaces. Hence a matrix with a single row (a single
output unit) suffices to capture all transitions between different regions of linearity of f.
Lemma 3. A layer of nrectifier units with n0inputs can compute any function that can be computed
by the composition of a linear layer with n0inputs and n0
0outputs and a rectifier layer with n0
0inputs
and n1outputs, for any n0, n0
0, n1N.
Proof. A rectifier layer computes functions of the form x7→ rect(Wx +b), with WRn1×n0
and bRn1. The argument Wx +bis an affine function of x. The claim follows from the fact
that any composition of affine functions is an affine function.
3 One hidden layer
Let us look at the number of response regions of a single hidden layer MLP with n0input units and
nhidden units. We first formulate the rectifier unit as follows:
rect(s) = I(s)·s, (1)
4
where Iis the indicator function defined as:
I(s) = 1,if s > 0
0,otherwise .(2)
We can now write the single hidden layer MLP with nyoutputs as the function f:Rn0Rny;
f(x) = W(out)diag
I(W(1)
1: x+b(1)
1)
.
.
.
I(W(1)
n1:x+b(1)
n1)
W(1)x+b(1) +b(out) .(3)
From this formulation it is clear that each unit iin the hidden layer has two operational modes.
One is when the unit takes value 0and one when it takes a non-zero value. The boundary between
these two operational modes is given by the hyperplane Hiconsisting of all inputs xRn0with
W(1)
i,:x+b(1)
i= 0. Below this hyperplane, the activation of the unit is constant equal to zero, and
above, it is linear with gradient equal to W(1)
i,:. It follows that the number of regions of linearity of
a single layer MLP is equal to the number of regions formed by the set of hyperplanes {Hi}i[n1].
A finite set of hyperplanes in a common n0-dimensional Euclidian space is called an n0-dimensional
hyperplane arrangement. A region of an arrangement A={HiRn0}i[n]is a connected
component of the complement of the union of the hyperplanes, i.e., a connected component of
Rn0\(i[n]Hi). To make this clearer, consider an arrangement Aconsisting of hyperplanes
Hi={xRn0:Wi,:x+bi= 0}for all i[n], for some WRn×n0and some bRn.
A region of Ais a set of points of the form R={xRn0: sgn(Wx +b) = s}for some sign
vector s∈ {−,+}n.
A region of an arrangement is relatively bounded if its intersection with the space spanned by the
normals of the hyperplanes is bounded. We denote by r(A)the number of regions and by b(A)the
number of relatively bounded regions of an arrangement A. The essentialization of an arrangement
A={Hi}iis the arrangement consisting of the hyperplanes HiN for all i, defined in the span N
of the normals of the hyperplanes Hi. For example, the essentialization of an arrangement of two
non-parallel planes in R3is an arrangement of two lines in a plane.
Problem 1. How many regions are generated by an arrangement of nhyperplanes in Rn0?
The general answer to Problem 1 is given by Zaslavsky’s theorem (Zaslavsky, 1975, Theorem A),
which is one of the central results from the theory of hyperplane arrangements.
We will only need the special case of hyperplanes in general position, which realize the maximal
possible number of regions. Formally, an n-dimensional arrangement Ais in general position if for
any subset {H1,...Hp}⊆Athe following holds. (1) If pn, then dim(H1·· ·Hp) = np.
(2) If p > n, then H1∩ · ·· ∩ Hp=. An arrangement is in general position if the weights W,
bdefining its hyperplanes are generic. This means that any arrangement can be perturbed by an
arbitrarily small perturbation in such a way that the resulting arrangement is in general position.
For arrangements in general position, Zaslavsky’s theorem can be stated in the following way (see
Stanley, 2004, Proposition 2.4).
Proposition 1. Let Abe an arrangement of mhyperplanes in general position in Rn0. Then
r(A) =
n0
X
s=0 m
s
b(A) = m1
n0.
In particular, the number of regions of a 2-dimensional arrangement Amof mlines in general
position is equal to
r(Am) = m
2+m+ 1.(4)
5
L1
L2
R12
R1
R
R2
L1
L2
R12
R1
R
R2
R23 R123
R13
L3
Figure 2: Induction step of the hyperplane sweep method for counting the regions of line arrange-
ments in the plane.
For the purpose of illustration, we sketch a proof of eq. (4) using the sweep hyperplane method. We
proceed by induction over the number of lines m.
Base case m= 0.It is obvious that in this case there is a single region, corresponding to the entire
plane. Therefore, r(A0) = 1.
Induction step. Assume that for mlines the number of regions is r(Am) = m
2+m+ 1, and add
a new line Lm+1 to the arrangement. Since we assumed the lines are in general position, Lm+1
intersects each of the existing lines Lkat a different point. Fig. 2 depicts the situation for m= 2.
The mintersection points split the line Lm+1 into m+ 1 segments. Each of these segments cuts
a region of Amin two pieces. Therefore, by adding the line Lm+1 we get m+ 1 new regions. In
Fig. 2 the two intersection points result in three segments that split each of the regions R1, R01, R0
in two. Hence
r(Am+1) = r(Am) + m+ 1 = m(m1)
2+m+1+m+ 1 = m(m+ 1)
2+ (m+ 1) + 1
=m+ 1
2+ (m+ 1) + 1.
For the number of response regions of MLPs with one single hidden layer we obtain the following.
Proposition 2. The regions of linearity of a function in the model F(n0,n1,1) with n0inputs and n1
hidden units are given by the regions of an arrangement of n1hyperplanes in n0-dimensional space.
The maximal number of regions of such an arrangement is R(n0, n1, ny) = Pn0
j=0 n1
j.
Proof. This is a consequence of Lemma 1. The maximal number of regions is produced by an n0-
dimensional arrangement of n1hyperplanes in general position, which is given in Proposition 1.
4 Multiple hidden layers
In order to show that a khidden layer model can be more expressive than a single hidden layer one
with the same number of hidden units, we will need the next three propositions.
Proposition 3. Any arrangement can be scaled down and shifted such that all regions of the ar-
rangement intersect the unit ball.
Proof. Let Abe an arrangement and let Sbe a ball of radius rand center c. Let dbe the supremum
of the distance from the origin to a point in a bounded region of the essentialization of the arrange-
ment A. Consider the map φ:Rn0Rn0defined by φ(x) = r
2d·x+c. Then A0=φ(A)is an
arrangement satisfying the claim. It is easy to see that any point with norm bounded by dis mapped
to a point inside the ball S.
6
AA0
S
Figure 3: An arrangement Aand a scaled-shifted version A0whose regions intersect the ball S.
The proposition is illustrated in Fig. 3.
We need some additional notations in order to formulate the next proposition. Given a hyperplane
H={x:w>x+b= 0}, we consider the region H={x:w>x+b < 0}, and the region
H+={x:w>x+b0}. If we think about the corresponding rectifier unit, then H+is the region
where the unit is active and His the region where the unit is dead.
Let Rbe a region delimited by the hyperplanes {H1, . . . , Hn}. We denote by R+⊆ {1, . . . , n}the
set of all hyperplane-indices jwith RH+
j. In other words, R+is the list of hidden units that are
active (non-zero) in the input-space region R.
The following proposition describes the combinatorics of 2-dimensional arrangements in general
position. More precisely, the proposition describes the combinatorics of n-dimensional arrange-
ments with 2-dimensional essentialization in general position. Recall that the essentialization of
an arrangement is the arrangement that it defines in the subspace spanned by the normals of its
hyperplanes.
The proposition guarantees the existence of input weights and bias for a rectifier layer such that for
any list of consecutive units, there is a region of inputs for which exactly the units from that list are
active.
Proposition 4. For any n0, n N,n2, there exists an n0-dimensional arrangement Aof n
hyperplanes such that for any pair a, b ∈ {1, . . . , n}with a<b, there is a region Rof Awith
R+={a, a + 1, . . . , b}.
We show that the hyperplanes of a 2-dimensional arrangement in general position can be indexed in
such a way that the claim of the proposition holds. For higher dimensional arrangements the state-
ment follows trivially, applying the 2-dimensional statement to the intersection of the arrangement
with a 2-subspace.
Proof of Proposition 4. Consider first the case n0= 2. We define the first line L1of the arrangement
to be the x-axis of the standard coordinate system. To define the second line L2, we consider a circle
S1of radius rR+centered at the origin. We define L2to be the tangent of S1at an angle α1to
the y-axis, where 0< α1<π
2. The top left panel of Fig. 4 depicts the situation. In the figure, R
corresponds to inputs for which no rectifier unit is active, R1corresponds to inputs where the first
unit is active, R2to inputs where the second unit is active, and R12 to inputs where both units are
active. This arrangement has the claimed properties.
Now assume that there is an arrangement of nlines with the claimed properties. To add an (n+ 1)-th
line, we first consider the maximal distance dmax from the origin to the intersection of two lines
LiLjwith 1i < j < n. We also consider the radius-(dmax +r)circle Sncentered at the origin.
The circle Sncontains all intersection of any of the first nlines. We now choose an angle αnwith
0< αn< αn1and define Ln+1 as the tangent of Snthat forms an angle αnwith the y-axis. Fig. 4
depicts adding the third and fourth line to the arrangement.
After adding line Ln+1, we have that the arrangement
1. is in general position.
2. has regions R0
1, . . . , R0
n+1 with R0+
i={i, i + 1, . . . , n + 1}for all i[n+ 1].
7
R
R1L1
L2
R12
R2
S1
y-axis
α1
R4
R
R1L1
L2
L3
L4
R12
R123
R1234
R23 R234
R3
R34
R2
S1
S2
S3
y-axis
α3
α2
α1
Figure 4: Illustration of the hyperplane arrangement discussed in Proposition 4, in the 2-dimensional
case. On the left we have arrangements of two and three lines, and on the right an arrangement of
four lines.
The regions of the arrangement are stable under perturbation of the angles and radii used to define
the lines. Any slight perturbation of these parameters preserves the list of regions. Therefore, the
arrangement is in general position.
The second property comes from the order in which Ln+1 intersects all previous lines. Ln+1 in-
tersects the lines in the order in which they were added to the arrangement: L1, L2, . . . , Ln. The
intersection of Ln+1 and Li,Bin+1 =Ln+1 Li, is above the lines Li+1, Li+2, . . . , Ln, and hence
the segment Bi1n+1Bin+1 between the intersection with Li1and with Li, has to cut the region
in which only units ito nare active.
The intersection order is ensured by the choice of angles αiand the fact that the lines are tangent
to the circles Si. For any i < j and Bij =LiLjlet Tij be the line parallel to the y-axis passing
through Bij . Each line Tij divides the space in two. Let Hij be the half-space to the right of Tij .
Within any half-space Hij, the intersection Hij Liis above Hij Lj, because the angle αi1of
Liwith the y-axis is larger than αj1(this means Ljhas a stepper decrease). Since Ln+1 is tangent
to the circle that contains all points Bij , the line Ln+1 will intersect lines Liand Ljin Hij , and
therefore it has to intersect Lifirst.
For n0>2we can consider an arrangement that is essentially 2-dimensional and has the properties
of the arrangement described above. To do this, we construct a 2-dimensional arrangement in a
2-subspace of Rn0and then extend each of the lines Liof the arrangement to a hyperplane Hithat
crosses Liorthogonally. The resulting arrangement satisfies all claims of the proposition.
The next proposition guarantees the existence of a collection of affine maps with shared bias, which
map a collection of regions to a common output.
Proposition 5. Consider two integers n0and p. Let Sdenote the n0-dimensional unit ball and let
R1, . . . , RpRn0be some regions with non-empty interiors. Then there is a choice of weights
cRn0and U1,...,UpRn0×n0for which gi(Ri)⊇ S for all i[p], where gi:Rn0
Rn0;y7→ Uiy+c.
8
1st
hidden
layer
Intermediate
layer
Input
2nd
hidden
layer
h(1)
g(h(1))
rect(V(2)g(h(1) ) + d(2))
Figure 5: Illustration of Example 1. The units represented by squares build an intermediary layer of
linear units between the first and the second hidden layers. The computation of such an intermediary
linear layer can be absorbed in the second hidden layer of rectifier units (Lemma 3). The connectivity
map depicts the maps g1by dashed arrows and g2by dashed-dotted arrows.
Proof. To see this, consider the following construction. For each region Riconsider a ball SiRi
of radius riR+and center si= (si1,...,sin0)Rn0. For each j= 1, . . . , n0, consider p
positive numbers u1j, . . . , upj such that uijsij=ukj skjfor all 1k < i p. This can be
done fixing u1jequal to 1and solving the equation for all other numbers. Let ηRbe such that
riηuij >1for any jand i. Scaling each region Riby Ui= diag(ηui0, . . . , ηuin0)transforms the
center of Sito the same point for all i. By the choice of η, the minor radius of all transformed balls
is larger than 1.
We can now set cto be minus the common center of the scaled balls, to obtain the map:
gi(x) = diag (ηui1, . . . , ηuin0)xdiag (ηu11, . . . , ηu1n0)s1,for all 1ip.
These gisatisfy claimed property, namely that gi(Ri)contains the unit ball, for all i.
Before proceeding, we discuss an example illustrating how the previous propositions and lemmas
are put together to prove our main result below, in Theorem 1.
Example 1. Consider a rectifier MLP with n0= 2, such that the input space is R2, and assume
that the network has only two hidden layers, each consisting of n= 2n0units. Each unit in the
first hidden layer defines a hyperplane in R2, namely the hyperplane that separates the inputs for
which it is active, from the inputs for which it is not active. Hence the first hidden layer defines
an arrangement of nhyperplanes in R2. By Proposition 4, this arrangement can be made such
that it delimits regions of inputs R1, ..., Rn0R2with the following property. For each input
in any given one of these regions, exactly one pair of units in the first hidden layer is active, and,
furthermore, the pairs of units that are active on different regions are disjoint.
By the definition of rectifier units, each hidden unit computes a linear function within the half-space
of inputs where it is active. In turn, the image of Riby the pair of units that is active in Riis a
polyhedron in R2. For each region Ri, denote corresponding polyhedron by Si.
Recall that a rectifier layer computes a map of the form f:RnRm;x7→ rect(Wx +b). Hence
a rectifier layer with ninputs and moutputs can compute any composition f0gof an affine map
g:RnRkand a map f0computed by a rectifier layer with kinputs and moutputs (Lemma 3).
Consider the map computed by the rectifier units in the second hidden layer, i.e., the map that takes
activations from the first hidden layer and outputs activations from the second hidden layer. We
think of this map as a composition f0gof an affine map g:RnR2and a map f0computed by
a rectifier layer with 2inputs. The map gcan be interpreted as an intermediary layer consisting of
two linear units, as illustrated in Fig. 5.
9
g1(R1)
g2(R2)
R1
R2
g1g2
S
Figure 6: Constructing jn1
n0kPn0
k=0 n2
kresponse regions in a model with two layers.
Within each input region Ri, only two units in the first hidden layer are active. Therefore, for each
input region Ri, the output of the intermediary layer is an affine transformation of Si. Furthermore,
the weights of the intermediary layer can be chosen in such a way that the image of each Ricontains
the unit ball.
Now, f0is the map computed by a rectifier layer with 2inputs and noutputs. It is possible to define
this map in such a way that it has Rregions of linearity within the unit ball, where Ris the number
of regions of a 2-dimensional arrangement of nhyperplanes in general position.
We see that the entire network computes a function which has Rregions of linearity within each one
of the input regions R1, . . . , Rn0. Each input region Riis mapped by the concatenation of first and
intermediate (notional) layer to a subset of R2which contains the unit ball. Then, the second layer
computes a function which partitions the unit ball into many pieces. The partition computed by the
second layer gets replicated in each of the input regions Ri, resulting in a subdivision of the input
space in exponentially many pieces (exponential in the number of network layers).
Now we are ready to state our main result on the number of response regions of rectifier deep
feedforward networks:
Theorem 1. A model with n0inputs and khidden layers of widths n1, n2, . . . , nkcan divide the
input space in Qk1
i=1 jni
n0kPn0
i=0 nk
ior possibly more regions.
Proof of Theorem 1. Let the first hidden layer define an arrangement like the one from Proposition 4.
Then there are p=jn1
n0kinput-space regions RiRn0,i[p]with the following property. For
each input vector from the region Ri, exactly n0units from the first hidden layer are active. We
denote this set of units by Ii. Furthermore, by Proposition 4, for inputs in distinct regions Ri, the
corresponding set of active units is disjoint; that is, IiIj=for all i, j [p],i6=j.
To be more specific, for an input vectors from R1, exactly the first n0units of the first hidden layer
are active, that is, for these input vectors the value of h(1)
jis non-zero if and only if jI1=
{1, . . . , n0}. For input vectors from R2, only the next n0units of the first hidden layer are active,
that is, the units with index in I2={n0+ 1,...,2n0}, and so on.
Now we consider a ‘fictitious’ intermediary layer consisting of n0linear units between the first and
second hidden layers. As this intermediary layer computes an affine function, it can be absorbed
into the second hidden layer (see Lemma 3). We use it only for making the next arguments clearer.
10
The map taking activations from the first hidden layer to activations from the second hidden layer is
rect(W(2)x+b(2) ), where W(2) Rn2×n1,b(2) Rn2.
We can write the input and bias weight matrices as W(2) =U(2)V(2) and b(2) =d(2) +V(2)c(2),
where U(2) Rn0×n1,cRn0, and V(2) Rn2×n0,dRn2.
The weights U(2) and c(2) describe the affine function computed by the intermediary layer, x7→
U(2)x+c. The weights V(2) and d(2) are the input and bias weights of the rectifier layer following
the intermediary layer.
We now consider the sub-matrix U(2)
iof U(2) consisting of the columns of U(2) with indices Ii, for
all i[p]. Then U(2) =hU(2)
1|···|U(2)
p|˜
U(2)i, where ˜
U(2) is the sub-matrix of U(2) consisting
of its last n1pn0columns. In the sequel we set all entries of ˜
U(2) equal to zero.
The map g:Rn1Rn0;g(x) = U(2)x+c(2) is thus written as the sum g=Pi[p]gi+c(2) ,
where gi:Rn0Rn0;gi(x) = U(2)
ix, for all i[p].
Let Sibe the image of the input-space region Riby the first hidden layer. By Proposition 5, there
is a choice of the weights U(2)
iand bias c(2) such that the image of Siby xU(2)
i(x) + c(2)
contains the n0-dimensional unit ball. Now, for all inputs vectors from Ri, only the units Iiof the
first hidden layer are active. Therefore, g|Ri=gi|Ri+c(2) . This implies that the image g(Ri)of
the input-space region Riby the intermediary layer contains the unit ball, for all i[p].
We can now choose V(2) and d(2) in such a way that the rectifier function Rn0Rn2;y7→
rect(V(2)y+d(2) )defines an arrangement Aof n2hyperplanes with the property that each region
of Aintersects the unit ball at an open neighborhood.
In consequence, the map from input-space to activations of the second hidden layer has r(A)regions
of linearity within each input-space region Ri. Fig. 6 illustrates the situation. All inputs that are
mapped to the same activation of the first hidden layer, are treated as equivalent on the subsequent
layers. In this sense, an arrangement Adefined on the set of common outputs of R1, . . . , Rpat the
first hidden layer, is ‘replicated’ in each input region R1, . . . , Rp.
The subsequent layers of the network can be analyzed in a similar way as done above for the first
two layers. In particular, the weights V(2) and d(2) can be chosen in such a way that they define
an arrangement with the properties from Proposition 4. Then, the map taking activations from the
second hidden layer to activations from the third hidden layer, can be analyzed by considering again
a fictitious intermediary layer between the second and third layers, and so forth, as done above.
For the last hidden layer we choose the input weights V(k)and bias d(k)defining an n0-dimensional
arrangement of nkhyperplanes in general position. The map of inputs to activations of the last hid-
den layer has thus Qk1
i=1 jni
n0kPn0
i=0 nk
iregions of linearity. This number is a lower bound on
the maximal number of regions of linearity of functions computable by the network. This completes
the proof. The intuition of the construction is illustrated in Fig. 7.
In the Appendix A we derive an asymptotic expansion of the bound given in Theorem 1.
5 A special class of deep models
In this section we consider deep rectifier models with n0input units and hidden layers of width
n= 2n0. This restriction allows us to construct a very efficient deep model in terms of number
of response regions. The analysis that we provide in this section complements the results from the
previous section, showing that rectifier MLPs can compute functions with many response regions,
even when defined with relatively few hidden layers.
Example 2. Let us assume we have a 2-dimensional input, i.e., n0= 2, and a layer of n= 4
rectifiers f1,f2,f3, and f4, followed by a linear projection. We construct the rectifier layer in such a
way that it divides the input space into four ‘square’ cones; each of them corresponding to the inputs
11
g(2)
1(R1)
g(2)
2(R1)
R1
R2
g(3)
1(g(2)
2(R2)))
g(3)
1(g(2)
1(R2))
g(3)
1(g(2)
1(R1))
g(3)
1(g(2)
1(R1))
g(2)
1g(2)
2
g(3)
1
g(3)
2
S
S
Figure 7: Constructing jn2
n0kjn1
n0kPn0
k=0 n3
kresponse regions in a model with three layers.
where two of the rectifier units are active. We define the four rectifiers as:
f1(x) = max n0,[1,0]>xo,
f2(x) = max n0,[1,0]>xo,
f3(x) = max n0,[0,1]>xo,
f4(x) = max n0,[0,1]>xo,
where x= [x1, x2]>Rn0. By adding pairs of coordinates of f= [f1, f2, f3, f4]>, we can
effectively mimic a layer consisting of two absolute-value units g1and g2:
g1(x)
g2(x)=1100
0011
f1(x)
f2(x)
f3(x)
f4(x)
=abs(x1)
abs(x2).(5)
The absolute-value unit gidivides the input space along the i-th coordinate axis, taking values which
are symmetric about that axis. The combination of g1and g2is then a function with four regions of
linearity;
S1={(x1, x2)|x10, x20}
S2={(x1, x2)|x10, x2<0}
S3={(x1, x2)|x1<0, x20}
S4={(x1, x2)|x1<0, x2<0}.
Since the values of giare symmetric about the i-th coordinate axis, each point x∈ Sihas a corre-
sponding point y∈ Sjwith g(x) = g(y), for all iand j.
We can apply the same procedure to the image of [g1, g2]to recursively divide the input space,
as illustrated in Fig. 8. For instance, if we apply this procedure one more time, we get four regions
12
PS1
PS3
PS4PS2
-4
-2
4
2
x0
x1
(a)
x0
x1P
(b)
Figure 8: Illustration of Example 2. (a) A rectifier layer with two pairs of units, where each pair
computes the absolute value of one of two input coordinates. Each input quadrant is mapped to the
positive quadrant. (b) Depiction of a two layer model. Both layers simulate the absolute value of
their input coordinates.
within each Si, resulting in 16 regions in total, within the input space. On the last layer, we may place
rectifiers in any way suitable for the task of interest (e.g., classification). The partition computed by
the last layer will be copied to each of the input space regions that produced the same input for the
last layer. Fig. 9 shows a function that can be implemented efficiently by a deep model using the
previous observations.
(a) (b) (c)
Figure 9: (a) Illustration of the partition computed by 8rectifier units on the outputs (x1, x2)of the
preceding layer. The color is a heat map of x1x2. (b) Heat map of a function computed by a
rectifier network with 2inputs, 2hidden layers of width 4, and one linear output unit. The black
lines delimit the regions of linearity of the function. (c) Heat map of a function computed by a 4
layer model with a total of 24 hidden units. It takes at least 137 hidden units on a shallow model to
represent the same function.
The foregoing discussion can be easily generalized to n0>2input variables and khidden layers,
each consisting of 2n0rectifiers. In that case, the maximal number of linear regions of functions
computable by the network is lower-bounded as follows.
Theorem 2. The maximal number of regions of linearity of functions computable by a rec-
tifier neural network with n0input variables and khidden layers of width 2n0is at least
2(k1)n0Pn0
j=0 2n0
j.
Proof. We prove this constructively. We define the rectifier units in each hidden layer in pairs, with
the sum of each pair giving the absolute value of a coordinate axis. We interpret also the sum of such
pairs as the actual input coordinates of the subsequent hidden layers. The rectifiers in the first hidden
layer are defined in pairs, such that the sum of each pair is the absolute value of one of the input
dimensions, with bias equal to (1
2,...,1
2). In the next hidden layers, the rectifiers are defined
13
in a similar way, with the difference that each pair computes the absolute value of the sum of two
of their inputs. The last hidden layer is defined in such a way that it computes a piece-wise linear
function with the maximal number of pieces, all of them intersecting the unit cube in Rn0. The
maximal number of regions of linearity of mrectifier units with n0-dimensional input is Pn0
j=0 m
j.
This partition is multiplied in each previous layer 2n0times.
The theorem shows that even for a small number of layers k, we can have many more linear regions
in a deep model than in a shallow one. For example, if we set the input dimensionality to n0= 2,
a shallow model with 4n0units will have at most 37 linear regions. The equivalent deep model
with two layers of 2n0units can produce 44 linear regions. For 6n0hidden units the shallow model
computes at most 79 regions, while the equivalent three layer model can compute 176 regions.
6 Discussion and conclusions
In this paper we introduced a novel way of understanding the expressiveness of neural networks
with piecewise linear activations. We count the number of regions of linearity, also called response
regions, of the functions that they can represent. The number of response regions tells us how well
the models can approximate arbitrary curved shapes. Computational Geometry provides us the tool
to make such statements.
We found that deep and narrow rectifier MPLs can generate many more regions of linearity than
their shallow counterparts with the same number of computational units or of parameters. We can
express this in terms of the ratio between the maximal number of response regions and the number
of parameters of both model classes. For a deep model with n0=O(1) inputs and khidden layers
of width n, the maximal number of response regions per parameter behaves as
n
n0k1nn02
k!.
For a shallow model with n0=O(1) inputs, the maximal number of response regions per parameter
behaves as
Okn01nn01.
We see that the deep model can generate many more response regions per parameter than the shallow
model; exponentially more regions per parameter in terms of the number of hidden layers k, and at
least order (k2) polynomially more regions per parameter in terms of the layer width n. In
particular, there are deep models which use fewer parameters to produce more linear regions than
their shallow counterparts. Details about the asymptotic expansions are given in the Appendix A.
In this paper we only considered linear output units, but this is not a restriction, as the output ac-
tivation itself is not parametrized. If there is a target function ftarg that we want to model with a
rectifier MLP with σas its output activation function, then there exists a function f0
targ such that
σ(f0
targ) = ftarg, when σhas an inverse (e.g., with sigmoid), f0
targ =σ1(ftarg). For activations that
do not have an inverse, like softmax, there are infinitely many functions f0
targ that work. We just
need to pick one, e.g., for softmax we can pick log(ftarg). By analyzing how well we can model f0
targ
with a linear output rectifier MLP we get an indirect measure of how well we can model ftarg with
an MLP that has σas its output activation.
Another interesting observation is that we recover a high ratio of nto n0if the data lives near a low-
dimensional manifold (effectively like reducing the input size n0). One-layer models can reach the
upper bound of response regions only by spanning all the dimensions of the input. In other words,
shallow models are not capable of concentrating linear response regions in any lower dimensional
subspace of the input. If, as commonly assumed, data lives near a low dimensional manifold, then
we care only about the number of response regions that a model can generate in the directions of the
data manifold. One way of thinking about this is principal component analysis (PCA), where one
finds that only few input space directions (say on the MNIST database) are relevant to the underlying
data. In such a situation, one cares about the number of response regions that a model can generate
only within the directions in which the data does change. In such situations nn0, and our results
show a clear advantage of using deep models.
14
We believe that the proposed framework can be used to answer many other interesting questions
about these models. For example, one can look at how the number of response regions is affected
by different constraints of the model, like shared weights. We think that this approach can also be
used to study other kinds of piecewise linear models, such as convolutional networks with rectifier
units or maxout networks, or also for comparing between different piecewise linear models.
A Asymptotic
Here we derive asymptotic expressions of the formulas contained in Proposition 2 and Theorem 1.
We use following standard notation:
f(n) = O(g(n)) means that there is a positive constant c2such that f(n)c2g(n)for all
nlarger than some N.
f(n) = Θ(g(n)) means that there are two positive constants c1and c2such that c1g(n)
f(n)c2g(n)for all nlarger than some N.
f(n) = Ω(g(n)) means that there is a positive constant c1such that f(n)c1g(n)for all
nlarger than some N.
Proposition 6.
Consider a single layer rectified MLP with kn units and n0inputs. Then the maximal
number of regions of linearity of the functions represented by this network is
R(n0, kn, 1) =
n0
X
s=0 kn
s,
and
R(n0, kn, 1) = O(kn0nn0),when n0=O(1).
Consider a k layer rectified MLP with hidden layers of width nand n0inputs. Then the
maximal number of regions of linearity of the functions represented by this network satisfies
R(n0, n, . . . , n, 1) k1
Y
i=1 n
n0!n0
X
s=0 n
s,
and
R(n0, n, . . . , n, 1) = Ω n
n0k1
nn0!,when n0=O(1).
Proof. Here only the asymptotic expressions remain to be shown. It is known that
n0
X
s=0 m
s= Θ 12n0
m1m
n0!,when n0m
2m. (6)
Furthermore, it is known that
m
s=ms
s!1 + O(1
m),when s=O(1).(7)
When n0is constant, n0=O(1), we have that
kn
n0=kn0
n0!nn01 + O1
kn .
In this case, it follows that
n0
X
s=0 kn
s= Θ 12n0
kn 1kn
n0!= Θ (kn0nn0)and also
n0
X
s=0 n
s= Θ(nn0).
15
Furthermore,
k1
Y
i=1 n
n0!n0
X
s=0 n
s= Θ n
n0k1
nn0!.
We now analyze the number of response regions as a function of the number of parameters. When
kand n0are fixed, then bn/n0ck1grows polynomially in n, and kn0is constant. On the other
hand, when nis fixed with n > 2n0, then bn/n0ck1grows exponentially in k, and kn0grows
polynomially in k.
Proposition 7. The number of parameters of a deep model with n0=O(1) inputs, nout =O(1)
outputs, and khidden layers of width nis
(k1)n2+ (k+n0+nout)n+nout =O(kn2).
The number of parameters of a shallow model with n0=O(1) inputs, nout =O(1) outputs, and
kn hidden units is
(n0+nout)kn +n+nout =O(kn).
Proof. For the deep model, each layer, except the first and last, has an input weight matrix with n2
entries and a bias vector of length n. This gives a total of (k1)n2+ (k1)nparameters. The first
layer has nn0input weights and nbias. The output layer has nnout input weight matrix and nout
bias. If we sum these together we get
(k1)n2+n(k+n0+nout) + nout =O(kn2).
For the shallow model, the hidden layer has knn0input weights and kn bias. The output weights
has knnout input weights and nout bias. Summing these together we get
kn(n0+nout) + n+nout =O(kn).
The number of linear regions per parameter can be given as follows.
Proposition 8. Consider a fixed number of inputs n0and a fixed number of outputs nout. The
maximal ratio of the number of response regions to the number of parameters of a deep model with
klayers of width nis
n
n0k1nn02
k!.
In the case of a shallow model with kn hidden units, the ratio is
Okn01nn01.
Proof. This is by combining Proposition 6 and Proposition 7.
We see that fixing the number of parameters, deep models can compute functions with many more
regions of linearity that those computable by shallow models. The ratio is exponential in the number
of hidden layers kand thus in the number of hidden units.
Acknowledgments
We would like to thank KyungHyun Cho, C¸ a˘
glar G¨
ulc¸ehre, and anonymous ICLR reviewers for their comments.
Razvan Pascanu is supported by a DeepMind Fellowship.
16
References
Y. Bengio. Learning deep architectures for AI. Foundations and trends R
in Machine Learning, 2(1):1–127,
2009.
Y. Bengio and O. Delalleau. On the expressive power of deep architectures. In J. Kivinen, C. Szepesvri,
E. Ukkonen, and T. Zeugmann, editors, Algorithmic Learning Theory, volume 6925 of Lecture Notes in
Computer Science, pages 18–36. Springer Berlin Heidelberg, 2011.
X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In AISTATS, 2011.
I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In ICML’2013,
2013.
A. Hajnal, W. Maass, P. Pudlk, M. Szegedy, and G. Turn. Threshold circuits of bounded depth. Journal of
Computer and System Sciences, 46(2):129–154, 1993.
J. H˚
astad. Almost optimal lower bounds for small depth circuits. In Proceedings of the 18th annual ACM
Symposium on Theory of Computing, pages 6–20, Berkeley, California, 1986. ACM Press.
J. H˚
astad and M. Goldmann. On the power of small-depth threshold circuits. Computational Complexity, 1:
113–129, 1991.
G. Hinton, L. Deng, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and
B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing
Magazine, 29(6):82–97, Nov. 2012a.
G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinv. Improving neural networks by
preventing co-adaptation of feature detectors. Technical report, arXiv:1207.0580, 2012b.
N. Le Roux and Y. Bengio. Deep belief networks are compact universal approximators. Neural Computation,
22(8):2192–2207, 2010.
H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised
learning of hierarchical representations. Montreal (QC), Canada, 2009.
J. Martens, A. Chattopadhya, T. Pitassi, and R. Zemel. On the expressive power of restricted boltzmann
machines. In Advances in Neural Information Processing Systems 26, pages 2877–2885. 2013.
G. Mont´
ufar and N. Ay. Refinements of universal approximation results for deep belief networks and restricted
Boltzmann machines. Neural Computation, 23(5):1306–1319, 2011.
G. Mont´
ufar and J. Morton. When does a mixture of products contain a product of mixtures? arXiv preprint
arXiv:1206.0387, 2012.
G. Mont´
ufar, J. Rauh, and N. Ay. Expressive power and approximation errors of restricted Boltzmann machines.
Advances in Neural Information Processing Systems, 24:415–423, 2011.
V. Nair and G. E. Hinton. Rectified linear units improve restricted Boltzmann machines. pages 807–814, 2010.
H. Poon and P. Domingos. Sum-product networks: A new deep architecture. In Computer Vision Workshops
(ICCV Workshops), 2011 IEEE International Conference on, pages 689–690, 2011.
R. Stanley. An introduction to hyperplane arrangements. In Lect. notes, IAS/Park City Math. Inst., 2004.
I. Sutskever and G. E. Hinton. Deep, narrow sigmoid belief networks are universal approximators. Neural
Computation, 20(11):2629–2636, 2008.
T. Zaslavsky. Facing Up to Arrangements: Face-Count Formulas for Partitions of Space by Hyperplanes.
Number no. 154 in Memoirs of the American Mathematical Society. American Mathematical Society, 1975.
M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. Technical report,
arXiv:1311.2901, 2013.
17
... These can be considered both through the underlying geometry of the polytope partitioning of the network and through the linear function associated with the network within each polytope. Prior work has been done on establishing theoretical bounds on the number of regions that it is possible for a network to have (Pascanu et al., 2013;Montufar et al., 2014;Raghu et al., 2017) and on investigating metrics involving these structures (Novak et al., 2018). ...
... /frai. . nodes in a given layer, and depth, that is, the number of layers (Pascanu et al., 2013;Montufar et al., 2014;Raghu et al., 2017). The main result of this study is that the maximum number of linear regions a network can have grows polynomially in the width and exponentially in the depth (Raghu et al., 2017). ...
... There are also similarities and relationships between different W i or A i -because they are coming from the same network weight matrices with rows removed, there is an inherent structure in the specific values used to construct them. An additional note to make is that the number of regions, m, has the potential to be very large, with exponential growth in the depth and polynomial growth in the width of the network (Pascanu et al., 2013;Montufar et al., 2014;Raghu et al., 2017). Experimentally, trained networks have been shown to typically exhibit polynomial growth with the number of ReLU activations of the network, where the degree of the polynomial is the input dimension (Hanin and Rolnick, 2019b). ...
Article
Full-text available
A ReLU neural network functions as a continuous piecewise linear map from an input space to an output space. The weights in the neural network determine a partitioning of the input space into convex polytopes, where each polytope is associated with a distinct affine mapping. The structure of this partitioning, together with the affine map attached to each polytope, can be analyzed to investigate the behavior of the associated neural network. We investigate simple problems to build intuition on how these regions act and both how they can potentially be reduced in number and how similar structures occur across different networks. To validate these intuitions, we apply them to networks trained on MNIST to demonstrate similarity between those networks and the potential for them to be reduced in complexity.
... The number of polyhedra present in the input space, or within a bounded region thereof, serves as a measure of the network's expressivity and complexity. Bounds, both upper and lower, on the maximum number of attainable polyhedra for a given ReLU FFNN architecture can be found in multiple studies such as Pascanu et al. (2013), Montufar et al. (2014), Raghu et al. (2017), Arora et al. (2018), Serra et al. (2018), Hanin and Rolnick (2019), Hinz and van de Geer (2019), and Safran et al. (2022). In an alternative approach, Wang (2022) investigated local properties of the polyhedra, such as inspheres, hyperplane directions, decision boundaries, and the relevance of surrounding regions, to analyze the behavior of neural networks. ...
... The relationship between the number of polyhedra decomposed by an FFNN and the network depth/width has been extensively explored in prior studies such as Pascanu et al. (2013), Montufar et al. (2014), Raghu et al. (2017), and Hanin and Rolnick (2019).This study aims to investigate the relationship between the mean square error (MSE) of the objective function on the validation dataset and the number of polyhedra. The examination involves exploring different network structures while keeping a consistent number of nodes or layers. ...
Article
Full-text available
This paper investigates the integration of multiple geometries present within a ReLU-based neural network. A ReLU neural network determines a piecewise affine linear continuous map, M , from an input space ℝ m to an output space ℝ n . The piecewise behavior corresponds to a polyhedral decomposition of ℝ m . Each polyhedron in the decomposition can be labeled with a binary vector (whose length equals the number of ReLU nodes in the network) and with an affine linear function (which agrees with M when restricted to points in the polyhedron). We develop a toolbox that calculates the binary vector for a polyhedra containing a given data point with respect to a given ReLU FFNN. We utilize this binary vector to derive bounding facets for the corresponding polyhedron, extraction of “active” bits within the binary vector, enumeration of neighboring binary vectors, and visualization of the polyhedral decomposition (Python code is available at https://github.com/cglrtrgy/GoL_Toolbox ). Polyhedra in the polyhedral decomposition of ℝ m are neighbors if they share a facet. Binary vectors for neighboring polyhedra differ in exactly 1 bit. Using the toolbox, we analyze the Hamming distance between the binary vectors for polyhedra containing points from adversarial/nonadversarial datasets revealing distinct geometric properties. A bisection method is employed to identify sample points with a Hamming distance of 1 along the shortest Euclidean distance path, facilitating the analysis of local geometric interplay between Euclidean geometry and the polyhedral decomposition along the path. Additionally, we study the distribution of Chebyshev centers and related radii across different polyhedra, shedding light on the polyhedral shape, size, clustering, and aiding in the understanding of decision boundaries.
... To ensure the quality of the GEDI waveform data, we filtered the GEDI footprints using the following four criteria: (1) exclude footprints that were acquired (shooting time) during the daytime to eliminate the potential influence of solar radiation on the waveform, (2) exclude footprints with low sensitivity (beam sensitivity < 0.95), (3) exclude footprints with low elevation accuracy (elevation difference between the GEDI-derived elevation and the SRTM elevation is larger than 32 m, approximately the 75 % confidence interval of elevation difference in this study (Pascanu et al., 2013), (4) exclude any data with low data quality (degrade > 0, stale_return_flag = 1). See details in Table 1. ...
... The existing studies on pavement skid resistance using CNN models tend to achieve favorable outcomes, which are closely related to the preparation of large datasets, model architecture preferences, and training methods. Still, there is a dynamic balance between performance and overhead of neural networks (Bengio & Lecun, 2007;Pascanu et al., 2013). Well-designed structures and sample expansion can reduce unnecessary resource consumption (Zhuang Liu et al., 2022). ...
Article
Full-text available
The surface texture of asphalt pavement has a significant effect on skid resistance performance. However, its contribution to the performance of skid resistance is non‐homogeneous and subjects to local validity. There are also a few deep learning models that take into account the effective contact texture region. This paper proposes a convolutional neural network model based on the effective contact texture region, containing macro‐ and micro‐scale awareness sub‐modules. In this study, the asphalt mixture with varying gradations was designed to accurately obtain the effective contact texture region. Then, the textures were disentangled into macro‐ and micro‐texture scales by applying the fast Fourier transform and fed into the model for training. Finally, the area of effective contact texture region was calculated, and the effective contact ratio parameter was then proposed using the triangulation algorithm. The results showed that the effective contact texture area of pavement varies by the asphalt mixture type. The effective contact ratio parameter exhibited a significant positive correlation ( Pearson correlation coefficient is 0.901, R ² = 0.8129) with skid resistance performance and was also influenced by key sieve aggregate content from 2.36 to 4.75 mm. The data of effective contact texture region following disentanglement significantly released the model performance (the relative error dropped to 1.81%). The model exhibited improved precision and performance, which can be utilized as an efficient, non‐contact alternative method for skid resistance analysis.
Article
Full-text available
Transfer learning for partial differential equations (PDEs) is to develop a pre-trained neural network that can be used to solve a wide class of PDEs. Existing transfer learning approaches require much information about the target PDEs such as its formulation and/or data of its solution for pre-training. In this work, we propose to design transferable neural feature spaces for the shallow neural networks from purely function approximation perspectives without using PDE information. The construction of the feature space involves the re-parameterization of the hidden neurons and uses auxiliary functions to tune the resulting feature space. Theoretical analysis shows the high quality of the produced feature space, i.e., uniformly distributed neurons. We use the proposed feature space as the pre-determined feature space of a random feature model, and use existing least squares solvers to obtain the weights of the output layer. Extensive numerical experiments verify the outstanding performance of our method, including significantly improved transferability, e.g., using the same feature space for various PDEs with different domains and boundary conditions, and the superior accuracy, e.g., several orders of magnitude smaller mean squared error than the state of the art methods.
Chapter
Traditional CNNs lack specific interpretable indicators for configuring kernel hyperparameters. Practitioners typically tune model performance via tedious trial and error experiments, resulting in longer training time and expensive computational costs. To address this issue, a novel systematic method, namely CKNA, that considers comprehensive kernel hyperparameters to optimize the network architecture is proposed. Firstly, we innovatively define channel-dispersion and pseudo-kernel to characterize behaviors differentiated by kernel values across model layers. Based on these concepts, the kernel scaling algorithm and the flat wave algorithm are proposed to automatically adjust kernel size and numbers, respectively. Furthermore, we decompose the channel-expand and down-sample operations into adjacent layers to break the bottleneck effect existing in kernel elements. CKNA is applied to classic group-wise CNNs, and experimental results on ImageNet demonstrate that the optimized models obtain at least a 5.73% improvement in classification accuracy.
Preprint
Full-text available
A ReLU neural network leads to a finite polyhedral decomposition of input space and a corresponding finite dual graph. We show that while this dual graph is a coarse quantization of input space, it is sufficiently robust that it can be combined with persistent homology to detect homological signals of manifolds in the input space from samples. This property holds for a variety of networks trained for a wide range of purposes that have nothing to do with this topological application. We found this feature to be surprising and interesting; we hope it will also be useful.
Conference Paper
Full-text available
We present explicit classes of probability distributions that can be learned by Restricted Boltzmann Machines (RBMs) depending on the number of units that they contain, and which are representative for the expressive power of the model. We use this to show that the maximal Kullback-Leibler divergence to the RBM model with $n$ visible and $m$ hidden units is bounded from above by $n - \left\lfloor \log(m+1) \right\rfloor - \frac{m+1}{2^{\left\lfloor\log(m+1)\right\rfloor}} \approx (n -1) - \log(m+1)$. In this way we can specify the number of hidden units that guarantees a sufficiently rich model containing different classes of distributions and respecting a given error tolerance.
Article
Full-text available
When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This "overfitting" is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors. Instead, each neuron learns to detect a feature that is generally helpful for producing the correct answer given the combinatorially large variety of internal contexts in which it must operate. Random "dropout" gives big improvements on many benchmark tasks and sets new records for speech and object recognition.
Article
Full-text available
We prove results on the relative representational power of mixtures of product distributions and restricted Boltzmann machines (products of mixtures of pairs of product distributions). Tools of independent interest are mode-based polyhedral approximations sensitive enough to compare full-dimensional models, and characterizations of possible modes and support sets of multivariate probability distributions that can be represented by both model classes. We find, in particular, that an exponentially larger mixture model, requiring an exponentially larger number of parameters, is required to represent probability distributions that can be represented by the restricted Boltzmann machines. The title question is intimately related to questions in coding theory and point configurations in hyperplane arrangements.
Conference Paper
The key limiting factor in graphical model inference and learning is the complexity of the partition function. We thus ask the question: what are the most general conditions under which the partition function is tractable? The answer leads to a newkind of deep architecture, which we call sum-product networks (SPNs). SPNs are directed acyclic graphs with variables as leaves, sums and products as internal nodes, and weighted edges. We show that if an SPN is complete and consistent it represents the partition function and all marginals of some graphical model, and give semantics to its nodes. Essentially all tractable graphical models can be cast as SPNs, but SPNs are also strictly more general. We then propose learning algorithms for SPNs, based on backpropagation and EM. Experiments show that inference and learning with SPNs can be both faster and more accurate than with standard deep networks. For example, SPNs perform face completion better than state-of-the-art deep networks for this task. SPNs also have intriguing potential connections to the architecture of the cortex.
Article
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Mathematics, 1974. Includes bibliographical references (leaves 135-139). Vita.
Article
Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks with many hidden layers, that are trained using new methods have been shown to outperform Gaussian mixture models on a variety of speech recognition benchmarks, sometimes by a large margin. This paper provides an overview of this progress and represents the shared views of four research groups who have had recent successes in using deep neural networks for acoustic modeling in speech recognition.
Conference Paper
Large Convolutional Neural Network models have recently demonstrated impressive classification performance on the ImageNet benchmark \cite{Kriz12}. However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we address both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. We also perform an ablation study to discover the performance contribution from different model layers. This enables us to find model architectures that outperform Krizhevsky \etal on the ImageNet classification benchmark. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.
Article
The key limiting factor in graphical model inference and learning is the complexity of the partition function. We thus ask the question: what are general conditions under which the partition function is tractable? The answer leads to a new kind of deep architecture, which we call sum-product networks (SPNs). SPNs are directed acyclic graphs with variables as leaves, sums and products as internal nodes, and weighted edges. We show that if an SPN is complete and consistent it represents the partition function and all marginals of some graphical model, and give semantics to its nodes. Essentially all tractable graphical models can be cast as SPNs, but SPNs are also strictly more general. We then propose learning algorithms for SPNs, based on backpropagation and EM. Experiments show that inference and learning with SPNs can be both faster and more accurate than with standard deep networks. For example, SPNs perform image completion better than state-of-the-art deep networks for this task. SPNs also have intriguing potential connections to the architecture of the cortex.