Content uploaded by Guido Montufar
Author content
All content in this area was uploaded by Guido Montufar on Feb 20, 2014
Content may be subject to copyright.
On the number of response regions of deep
feedforward networks with piecewise linear
activations
Razvan Pascanu
Universit´
e de Montr´
eal
Montr´
eal QC H3C 3J7 Canada
r.pascanu@gmail.com
Guido Mont´
ufar
Max Planck Institute for Mathematics in the Sciences
Inselstraße 22, 04103 Leipzig, Germany
montufar@mis.mpg.de
Yoshua Bengio
Universit´
e de Montr´
eal
Montr´
eal QC H3C 3J7 Canada
yoshua.bengio@umontreal.ca
Abstract
This paper explores the complexity of deep feedforward networks with linear pre-
synaptic couplings and rectified linear activations. This is a contribution to the
growing body of work contrasting the representational power of deep and shallow
network architectures. In particular, we offer a framework for comparing deep
and shallow models that belong to the family of piecewise linear functions based
on computational geometry. We look at a deep rectifier multi-layer perceptron
(MLP) with linear outputs units and compare it with a single layer version of the
model. In the asymptotic regime, when the number of inputs stays constant, if
the shallow model has kn hidden units and n0inputs, then the number of linear
regions is O(kn0nn0). For a klayer model with nhidden units on each layer it is
Ω(bn/n0ck−1nn0). The number bn/n0ck−1grows faster than kn0when ntends
to infinity or when ktends to infinity and n≥2n0. Additionally, even when kis
small, if we restrict nto be 2n0, we can show that a deep model has considerably
more linear regions that a shallow one. We consider this as a first step towards
understanding the complexity of these models and specifically towards providing
suitable mathematical tools for future analysis.
Keywords: Deep learning, artificial neural network, rectifier unit, hyperplane ar-
rangement, representational power
1 Introduction
Deep systems are believed to play an important role in information processing of intelligent agents.
A common hypothesis underlying this belief is that deep models can be exponentially more efficient
at representing some functions than their shallow counterparts (see Bengio, 2009).
The argument is usually a compositional one. Higher layers in a deep model can re-use primitives
constructed by the lower layers in order to build gradually more complex functions. For example,
on a vision task, one would hope that the first layer learns Gabor filters capable to detect edges of
different orientation. These edges are then put together at the second layer to form part-of-object
shapes. On higher layers, these part-of-object shapes are combined further to obtain detectors for
more complex part-of-object shapes or objects. Such a behaviour is empirically illustrated, for
1
instance, in Zeiler and Fergus (2013); Lee et al. (2009). On the other hand, a shallow model has to
construct detectors of target objects based only on the detectors learnt by the first layer.
The representational power of computational systems with shallow and deep architectures has been
studied intensively. A well known result Hajnal et al. (1993) derived lower complexity bounds for
shallow threshold networks. Other works have explored the representational power of generative
models based on Boltzmann machines Mont´
ufar et al. (2011); Martens et al. (2013) and deep belief
networks (Sutskever and Hinton, 2008; Le Roux and Bengio, 2010; Mont´
ufar and Ay, 2011), or have
compared mixtures and products of experts models (Mont´
ufar and Morton, 2012).
In addition to such inspections, a wealth of evidence for the validity of this hypothesis comes from
deep models consistently outperforming shallow ones on a variety of tasks and datasets (see, e.g.,
Goodfellow et al., 2013; Hinton et al., 2012b,a). However, theoretical results on the representational
power of deep models are limited, usually due to the composition of nonlinear functions in deep
models, which makes mathematical analysis difficult. Up to now, theoretical results have focussed
on circuit operations (neural net unit computations) that are substantially different from those being
used in real state-of-the-art deep learning applications, such as logic gates (H˚
astad, 1986), linear +
threshold units with non-negative weights (H˚
astad and Goldmann, 1991) or polynomials (Bengio
and Delalleau, 2011). Bengio and Delalleau (2011) show that deep sum-product networks (Poon
and Domingos, 2011) can use exponentially less nodes to express some families of polynomials
compared to the shallow ones.
The present note analyzes the representational power of deep MLPs with rectifier units. Rectifier
units (Glorot et al., 2011; Nair and Hinton, 2010) and piecewise linearly activated units in general
(like the maxout unit (Goodfellow et al., 2013)), are becoming popular choices in designing deep
models, and most current state-of-the-art results involve using one of such activations (Goodfellow
et al., 2013; Hinton et al., 2012b). Glorot et al. (2011) show that rectifier units have several properties
that make the optimization problem easier than the more traditional case using smooth and bounded
activations, such as tanh or sigmoid.
In this work we take advantage of the piecewise linear nature of the rectifier unit to mathematically
analyze the behaviour of deep rectifier MLPs. Given that the model is a composition of piecewise
linear functions, it is itself a piecewise linear function. We compare the flexibility of a deep model
with that of a shallow model by counting the number of linear regions they define over the input
space for a fixed number of hidden units. This is the number of pieces available to the model
in order to approximate some arbitrary nonlinear function. For example, if we want to perfectly
approximate some curved boundary between two classes, a rectifier MLP will have to use infinitely
many linear regions. In practice we have a finite number of pieces, and if we assume that we can
perfectly learn their optimal slopes, then the number of linear regions becomes a good proxy for
how well the model approximates this boundary. In this sense, the number of linear regions is an
upper bound for the flexibility of the model. In practice, the linear pieces are not independent and
the model may not be able to learn the right slope for each linear region. Specifically, for deep
models there is a correlation between regions, which results from the sharing of parameters between
the functions that describe the output on each region.
This is by no means a negative observation. If all the linear regions of the deep model were inde-
pendent of each other, by having many more linear regions, deep models would grossly overfit. The
correlation of the linear regions of a deep model results in its ability to generalize, by allowing it to
better represent only a small family of structured functions. These are functions that look compli-
cated (e.g., a distribution with a huge number of modes) but that have an underlying structure that
the network can ‘compress’ into its parameters. The number of regions, which indicates the number
of variations that the network can represent, provides a measure of how well it can fit this family of
structured functions (whose approximation potentially needs infinitely many linear regions).
We believe that this approach, based on counting the number of linear regions, is extensible to
any other piecewise linear activation function and also to other architectures, including the maxout
activation and the convolutional networks with rectifier activations.
We know the maximal number of regions of linearity of functions computable by a shallow model
with a fixed number of hidden units. This number is given by a well studied geometrical problem.
The main insight of the present work is to provide a geometrical construction that describes the
regions of linearity of functions computed by deep models. We show that in the asymptotic regime,
2
these functions have many more linear regions than the ones computed by shallow models, for the
same number of hidden units.
For the single layer case, each hidden unit divides the input space in two, whereby the boundary is
given by a hyperplane. For all input values on one side of the hyperplane, the unit outputs a positive
value. For all input values on the other side of the hyperplane, the unit outputs 0. Therefore, the
question that we are asking is: Into how many regions do nhyperplanes split space? This question
is studied in geometry under the name of hyperplane arrangements, with classic results such as
Zaslavsky’s theorem. Section 3 provides a quick introduction to the subject.
For the multilayer version of the model we rely on the following intuition. By using the rectifier
nonlinearity, we identify multiple regions of the input space which are mapped by a given layer into
an equivalent set of activations and represent thus equivalent inputs for the next layers. That is, a
hidden layer can perform a kind of or operation by reacting similarly to several different inputs. Any
subsequent computation made on these activations is replicated on all equivalent inputs.
This paper is organized as follows. In Section 2 we provide definitions and basic observations about
piecewise linear functions. In Section 3 we discuss rectifier networks with one single hidden layer
and describe their properties in terms of hyperplane arrangements which are fairly well known in
the literature. In Section 4 we discuss deep rectifier networks and prove our main result, Theorem 1,
which describes their complexity in terms of the number of regions of linearity of functions that they
represent. Details about the asymptotic behaviour of the results derived in Sections 3 and 4 are given
in the Appendix A. In Section 5 we analyze a special type of deep rectifier MLP and show that even
for a small number of hidden layers it can generate a large number of linear regions. In Section 6
we offer a discussion of the results.
2 Preliminaries
We consider classes of functions (models) defined in the following way.
Definition 1. Arectifier feedforward network is a layered feedforward network, or multilayer per-
ceptron (MLP), as shown in Fig. 1, with following properties. Each hidden unit receives as inputs
the real valued activations x1, .. . , xnof all units in the previous layer, computes the weighted sum
s=X
i∈[n]
wixi+b,
and outputs the rectified value
rect(s) = max{0, s}.
The real parameters w1, . . . , wnare the input weights and bis the bias of the unit. The output layer
is a linear layer, that is, the units in the last layer compute a linear combination of their inputs and
output it unrectified.
Given a vector of naturals n= (n0, n1, . . . , nL), we denote by Fnthe set of all functions Rn0→
RnLthat can be computed by a rectifier feedforward network with n0inputs and nlunits in layer l
for l∈[L]. The elements of Fnare continuous piecewise linear functions.
We denote by R(n)the maximum of the number of regions of linearity or response regions over all
functions from Fn. For clarity, given a function f:Rn0→RnL, a connected open subset R⊆Rn0
is called a region of linearity or linear region or response region of fif the restriction f|Ris a linear
function and for any open set ˜
R)Rthe restriction f|˜
Ris not a linear function. In the next sections
we will compute bounds on R(n)for different choices of n. We are especially interested in the
comparison of shallow networks with one single very wide hidden layer and deep networks with
many narrow hidden layers.
In the remainder of this section we state three simple lemmas.
The next lemma states that a piecewise linear function f= (fi)i∈[k]has as many regions of linearity
as there are distinct intersections of regions of linearity of the coordinates fi.
Lemma 1. Consider a width klayer of rectifier units. Let Ri={Ri
1, . . . , Ri
Ni}be the regions of
linearity of the function fi:Rn0→Rcomputed by the i-th unit, for all i∈[k]. Then the regions of
3
1st
hidden
layer
2nd
hidden
layer
Input Output
h(1) =
rect(W(1)x+b(1) )
h(2) =
rect(W(2)h(1) +b(2) )
x
W(out)h(2)
Figure 1: Illustration of a rectifier feedforward network with two hidden layers.
linearity of the function f= (fi)i∈[k]:Rn0→Rkcomputed by the rectifier layer are the elements
of the set {Rj1,...,jk=R1
j1∩ · ·· ∩ Rk
jk}(j1,...,jk)∈[N1]×···×[Nk].
Proof. A function f= (f1, . . . , fk): Rn→Rkis linear iff all its coordinates f1, . . . , fkare.
In regard to the number of regions of linearity of the functions represented by rectifier networks,
the number of output dimensions, i.e., the number of linear output units, is irrelevant. This is the
statement of the next lemma.
Lemma 2. The number of (linear) output units of a rectifier feedforward network does not affect the
maximal number of regions of linearity that it can realize.
Proof. Let f:Rn0→Rkbe the map of inputs to activations in the last hidden layer of a deep
feedforward rectifier model. Let h=g◦fbe the map of inputs to activations of the output units,
given by composition of fwith the linear output layer, h(x) = W(out) f(x) + b(out). If the row span
of W(out) is not orthogonal to any difference of gradients of neighbouring regions of linearity of f,
then gcaptures all discontinuities of ∇f. In this case both functions fand hhave the same number
of regions of linearity.
If the number of regions of fis finite, then the number of differences of gradients is finite and there
is a vector outside the union of their orthogonal spaces. Hence a matrix with a single row (a single
output unit) suffices to capture all transitions between different regions of linearity of f.
Lemma 3. A layer of nrectifier units with n0inputs can compute any function that can be computed
by the composition of a linear layer with n0inputs and n0
0outputs and a rectifier layer with n0
0inputs
and n1outputs, for any n0, n0
0, n1∈N.
Proof. A rectifier layer computes functions of the form x7→ rect(Wx +b), with W∈Rn1×n0
and b∈Rn1. The argument Wx +bis an affine function of x. The claim follows from the fact
that any composition of affine functions is an affine function.
3 One hidden layer
Let us look at the number of response regions of a single hidden layer MLP with n0input units and
nhidden units. We first formulate the rectifier unit as follows:
rect(s) = I(s)·s, (1)
4
where Iis the indicator function defined as:
I(s) = 1,if s > 0
0,otherwise .(2)
We can now write the single hidden layer MLP with nyoutputs as the function f:Rn0→Rny;
f(x) = W(out)diag
I(W(1)
1: x+b(1)
1)
.
.
.
I(W(1)
n1:x+b(1)
n1)
W(1)x+b(1) +b(out) .(3)
From this formulation it is clear that each unit iin the hidden layer has two operational modes.
One is when the unit takes value 0and one when it takes a non-zero value. The boundary between
these two operational modes is given by the hyperplane Hiconsisting of all inputs x∈Rn0with
W(1)
i,:x+b(1)
i= 0. Below this hyperplane, the activation of the unit is constant equal to zero, and
above, it is linear with gradient equal to W(1)
i,:. It follows that the number of regions of linearity of
a single layer MLP is equal to the number of regions formed by the set of hyperplanes {Hi}i∈[n1].
A finite set of hyperplanes in a common n0-dimensional Euclidian space is called an n0-dimensional
hyperplane arrangement. A region of an arrangement A={Hi⊂Rn0}i∈[n]is a connected
component of the complement of the union of the hyperplanes, i.e., a connected component of
Rn0\(∪i∈[n]Hi). To make this clearer, consider an arrangement Aconsisting of hyperplanes
Hi={x∈Rn0:Wi,:x+bi= 0}for all i∈[n], for some W∈Rn×n0and some b∈Rn.
A region of Ais a set of points of the form R={x∈Rn0: sgn(Wx +b) = s}for some sign
vector s∈ {−,+}n.
A region of an arrangement is relatively bounded if its intersection with the space spanned by the
normals of the hyperplanes is bounded. We denote by r(A)the number of regions and by b(A)the
number of relatively bounded regions of an arrangement A. The essentialization of an arrangement
A={Hi}iis the arrangement consisting of the hyperplanes Hi∩N for all i, defined in the span N
of the normals of the hyperplanes Hi. For example, the essentialization of an arrangement of two
non-parallel planes in R3is an arrangement of two lines in a plane.
Problem 1. How many regions are generated by an arrangement of nhyperplanes in Rn0?
The general answer to Problem 1 is given by Zaslavsky’s theorem (Zaslavsky, 1975, Theorem A),
which is one of the central results from the theory of hyperplane arrangements.
We will only need the special case of hyperplanes in general position, which realize the maximal
possible number of regions. Formally, an n-dimensional arrangement Ais in general position if for
any subset {H1,...Hp}⊆Athe following holds. (1) If p≤n, then dim(H1∩·· ·∩Hp) = n−p.
(2) If p > n, then H1∩ · ·· ∩ Hp=∅. An arrangement is in general position if the weights W,
bdefining its hyperplanes are generic. This means that any arrangement can be perturbed by an
arbitrarily small perturbation in such a way that the resulting arrangement is in general position.
For arrangements in general position, Zaslavsky’s theorem can be stated in the following way (see
Stanley, 2004, Proposition 2.4).
Proposition 1. Let Abe an arrangement of mhyperplanes in general position in Rn0. Then
r(A) =
n0
X
s=0 m
s
b(A) = m−1
n0.
In particular, the number of regions of a 2-dimensional arrangement Amof mlines in general
position is equal to
r(Am) = m
2+m+ 1.(4)
5
L1
L2
R12
R1
R∅
R2
L1
L2
R12
R1
R∅
R2
R23 R123
R13
L3
Figure 2: Induction step of the hyperplane sweep method for counting the regions of line arrange-
ments in the plane.
For the purpose of illustration, we sketch a proof of eq. (4) using the sweep hyperplane method. We
proceed by induction over the number of lines m.
Base case m= 0.It is obvious that in this case there is a single region, corresponding to the entire
plane. Therefore, r(A0) = 1.
Induction step. Assume that for mlines the number of regions is r(Am) = m
2+m+ 1, and add
a new line Lm+1 to the arrangement. Since we assumed the lines are in general position, Lm+1
intersects each of the existing lines Lkat a different point. Fig. 2 depicts the situation for m= 2.
The mintersection points split the line Lm+1 into m+ 1 segments. Each of these segments cuts
a region of Amin two pieces. Therefore, by adding the line Lm+1 we get m+ 1 new regions. In
Fig. 2 the two intersection points result in three segments that split each of the regions R1, R01, R0
in two. Hence
r(Am+1) = r(Am) + m+ 1 = m(m−1)
2+m+1+m+ 1 = m(m+ 1)
2+ (m+ 1) + 1
=m+ 1
2+ (m+ 1) + 1.
For the number of response regions of MLPs with one single hidden layer we obtain the following.
Proposition 2. The regions of linearity of a function in the model F(n0,n1,1) with n0inputs and n1
hidden units are given by the regions of an arrangement of n1hyperplanes in n0-dimensional space.
The maximal number of regions of such an arrangement is R(n0, n1, ny) = Pn0
j=0 n1
j.
Proof. This is a consequence of Lemma 1. The maximal number of regions is produced by an n0-
dimensional arrangement of n1hyperplanes in general position, which is given in Proposition 1.
4 Multiple hidden layers
In order to show that a khidden layer model can be more expressive than a single hidden layer one
with the same number of hidden units, we will need the next three propositions.
Proposition 3. Any arrangement can be scaled down and shifted such that all regions of the ar-
rangement intersect the unit ball.
Proof. Let Abe an arrangement and let Sbe a ball of radius rand center c. Let dbe the supremum
of the distance from the origin to a point in a bounded region of the essentialization of the arrange-
ment A. Consider the map φ:Rn0→Rn0defined by φ(x) = r
2d·x+c. Then A0=φ(A)is an
arrangement satisfying the claim. It is easy to see that any point with norm bounded by dis mapped
to a point inside the ball S.
6
AA0
S
Figure 3: An arrangement Aand a scaled-shifted version A0whose regions intersect the ball S.
The proposition is illustrated in Fig. 3.
We need some additional notations in order to formulate the next proposition. Given a hyperplane
H={x:w>x+b= 0}, we consider the region H−={x:w>x+b < 0}, and the region
H+={x:w>x+b≥0}. If we think about the corresponding rectifier unit, then H+is the region
where the unit is active and H−is the region where the unit is dead.
Let Rbe a region delimited by the hyperplanes {H1, . . . , Hn}. We denote by R+⊆ {1, . . . , n}the
set of all hyperplane-indices jwith R⊂H+
j. In other words, R+is the list of hidden units that are
active (non-zero) in the input-space region R.
The following proposition describes the combinatorics of 2-dimensional arrangements in general
position. More precisely, the proposition describes the combinatorics of n-dimensional arrange-
ments with 2-dimensional essentialization in general position. Recall that the essentialization of
an arrangement is the arrangement that it defines in the subspace spanned by the normals of its
hyperplanes.
The proposition guarantees the existence of input weights and bias for a rectifier layer such that for
any list of consecutive units, there is a region of inputs for which exactly the units from that list are
active.
Proposition 4. For any n0, n ∈N,n≥2, there exists an n0-dimensional arrangement Aof n
hyperplanes such that for any pair a, b ∈ {1, . . . , n}with a<b, there is a region Rof Awith
R+={a, a + 1, . . . , b}.
We show that the hyperplanes of a 2-dimensional arrangement in general position can be indexed in
such a way that the claim of the proposition holds. For higher dimensional arrangements the state-
ment follows trivially, applying the 2-dimensional statement to the intersection of the arrangement
with a 2-subspace.
Proof of Proposition 4. Consider first the case n0= 2. We define the first line L1of the arrangement
to be the x-axis of the standard coordinate system. To define the second line L2, we consider a circle
S1of radius r∈R+centered at the origin. We define L2to be the tangent of S1at an angle α1to
the y-axis, where 0< α1<π
2. The top left panel of Fig. 4 depicts the situation. In the figure, R∅
corresponds to inputs for which no rectifier unit is active, R1corresponds to inputs where the first
unit is active, R2to inputs where the second unit is active, and R12 to inputs where both units are
active. This arrangement has the claimed properties.
Now assume that there is an arrangement of nlines with the claimed properties. To add an (n+ 1)-th
line, we first consider the maximal distance dmax from the origin to the intersection of two lines
Li∩Ljwith 1≤i < j < n. We also consider the radius-(dmax +r)circle Sncentered at the origin.
The circle Sncontains all intersection of any of the first nlines. We now choose an angle αnwith
0< αn< αn−1and define Ln+1 as the tangent of Snthat forms an angle αnwith the y-axis. Fig. 4
depicts adding the third and fourth line to the arrangement.
After adding line Ln+1, we have that the arrangement
1. is in general position.
2. has regions R0
1, . . . , R0
n+1 with R0+
i={i, i + 1, . . . , n + 1}for all i∈[n+ 1].
7
R∅
R1L1
L2
R12
R2
S1
y-axis
α1
R∅
R1L1
L2
L3
R12
R123
R23
R3
R2
S1
S2
y-axis
α1
α2
R4
R∅
R1L1
L2
L3
L4
R12
R123
R1234
R23 R234
R3
R34
R2
S1
S2
S3
y-axis
α3
α2
α1
Figure 4: Illustration of the hyperplane arrangement discussed in Proposition 4, in the 2-dimensional
case. On the left we have arrangements of two and three lines, and on the right an arrangement of
four lines.
The regions of the arrangement are stable under perturbation of the angles and radii used to define
the lines. Any slight perturbation of these parameters preserves the list of regions. Therefore, the
arrangement is in general position.
The second property comes from the order in which Ln+1 intersects all previous lines. Ln+1 in-
tersects the lines in the order in which they were added to the arrangement: L1, L2, . . . , Ln. The
intersection of Ln+1 and Li,Bin+1 =Ln+1 ∩Li, is above the lines Li+1, Li+2, . . . , Ln, and hence
the segment Bi−1n+1Bin+1 between the intersection with Li−1and with Li, has to cut the region
in which only units ito nare active.
The intersection order is ensured by the choice of angles αiand the fact that the lines are tangent
to the circles Si. For any i < j and Bij =Li∩Ljlet Tij be the line parallel to the y-axis passing
through Bij . Each line Tij divides the space in two. Let Hij be the half-space to the right of Tij .
Within any half-space Hij, the intersection Hij ∩Liis above Hij ∩Lj, because the angle αi−1of
Liwith the y-axis is larger than αj−1(this means Ljhas a stepper decrease). Since Ln+1 is tangent
to the circle that contains all points Bij , the line Ln+1 will intersect lines Liand Ljin Hij , and
therefore it has to intersect Lifirst.
For n0>2we can consider an arrangement that is essentially 2-dimensional and has the properties
of the arrangement described above. To do this, we construct a 2-dimensional arrangement in a
2-subspace of Rn0and then extend each of the lines Liof the arrangement to a hyperplane Hithat
crosses Liorthogonally. The resulting arrangement satisfies all claims of the proposition.
The next proposition guarantees the existence of a collection of affine maps with shared bias, which
map a collection of regions to a common output.
Proposition 5. Consider two integers n0and p. Let Sdenote the n0-dimensional unit ball and let
R1, . . . , Rp⊆Rn0be some regions with non-empty interiors. Then there is a choice of weights
c∈Rn0and U1,...,Up∈Rn0×n0for which gi(Ri)⊇ S for all i∈[p], where gi:Rn0→
Rn0;y7→ Uiy+c.
8
1st
hidden
layer
Intermediate
layer
Input
2nd
hidden
layer
h(1)
g(h(1))
rect(V(2)g(h(1) ) + d(2))
Figure 5: Illustration of Example 1. The units represented by squares build an intermediary layer of
linear units between the first and the second hidden layers. The computation of such an intermediary
linear layer can be absorbed in the second hidden layer of rectifier units (Lemma 3). The connectivity
map depicts the maps g1by dashed arrows and g2by dashed-dotted arrows.
Proof. To see this, consider the following construction. For each region Riconsider a ball Si⊆Ri
of radius ri∈R+and center si= (si1,...,sin0)∈Rn0. For each j= 1, . . . , n0, consider p
positive numbers u1j, . . . , upj such that uijsij=ukj skjfor all 1≤k < i ≤p. This can be
done fixing u1jequal to 1and solving the equation for all other numbers. Let η∈Rbe such that
riηuij >1for any jand i. Scaling each region Riby Ui= diag(ηui0, . . . , ηuin0)transforms the
center of Sito the same point for all i. By the choice of η, the minor radius of all transformed balls
is larger than 1.
We can now set cto be minus the common center of the scaled balls, to obtain the map:
gi(x) = diag (ηui1, . . . , ηuin0)x−diag (ηu11, . . . , ηu1n0)s1,for all 1≤i≤p.
These gisatisfy claimed property, namely that gi(Ri)contains the unit ball, for all i.
Before proceeding, we discuss an example illustrating how the previous propositions and lemmas
are put together to prove our main result below, in Theorem 1.
Example 1. Consider a rectifier MLP with n0= 2, such that the input space is R2, and assume
that the network has only two hidden layers, each consisting of n= 2n0units. Each unit in the
first hidden layer defines a hyperplane in R2, namely the hyperplane that separates the inputs for
which it is active, from the inputs for which it is not active. Hence the first hidden layer defines
an arrangement of nhyperplanes in R2. By Proposition 4, this arrangement can be made such
that it delimits regions of inputs R1, ..., Rn0⊆R2with the following property. For each input
in any given one of these regions, exactly one pair of units in the first hidden layer is active, and,
furthermore, the pairs of units that are active on different regions are disjoint.
By the definition of rectifier units, each hidden unit computes a linear function within the half-space
of inputs where it is active. In turn, the image of Riby the pair of units that is active in Riis a
polyhedron in R2. For each region Ri, denote corresponding polyhedron by Si.
Recall that a rectifier layer computes a map of the form f:Rn→Rm;x7→ rect(Wx +b). Hence
a rectifier layer with ninputs and moutputs can compute any composition f0◦gof an affine map
g:Rn→Rkand a map f0computed by a rectifier layer with kinputs and moutputs (Lemma 3).
Consider the map computed by the rectifier units in the second hidden layer, i.e., the map that takes
activations from the first hidden layer and outputs activations from the second hidden layer. We
think of this map as a composition f0◦gof an affine map g:Rn→R2and a map f0computed by
a rectifier layer with 2inputs. The map gcan be interpreted as an intermediary layer consisting of
two linear units, as illustrated in Fig. 5.
9
g1(R1)
g2(R2)
R1
R2
g1g2
S
Figure 6: Constructing jn1
n0kPn0
k=0 n2
kresponse regions in a model with two layers.
Within each input region Ri, only two units in the first hidden layer are active. Therefore, for each
input region Ri, the output of the intermediary layer is an affine transformation of Si. Furthermore,
the weights of the intermediary layer can be chosen in such a way that the image of each Ricontains
the unit ball.
Now, f0is the map computed by a rectifier layer with 2inputs and noutputs. It is possible to define
this map in such a way that it has Rregions of linearity within the unit ball, where Ris the number
of regions of a 2-dimensional arrangement of nhyperplanes in general position.
We see that the entire network computes a function which has Rregions of linearity within each one
of the input regions R1, . . . , Rn0. Each input region Riis mapped by the concatenation of first and
intermediate (notional) layer to a subset of R2which contains the unit ball. Then, the second layer
computes a function which partitions the unit ball into many pieces. The partition computed by the
second layer gets replicated in each of the input regions Ri, resulting in a subdivision of the input
space in exponentially many pieces (exponential in the number of network layers).
Now we are ready to state our main result on the number of response regions of rectifier deep
feedforward networks:
Theorem 1. A model with n0inputs and khidden layers of widths n1, n2, . . . , nkcan divide the
input space in Qk−1
i=1 jni
n0kPn0
i=0 nk
ior possibly more regions.
Proof of Theorem 1. Let the first hidden layer define an arrangement like the one from Proposition 4.
Then there are p=jn1
n0kinput-space regions Ri⊆Rn0,i∈[p]with the following property. For
each input vector from the region Ri, exactly n0units from the first hidden layer are active. We
denote this set of units by Ii. Furthermore, by Proposition 4, for inputs in distinct regions Ri, the
corresponding set of active units is disjoint; that is, Ii∩Ij=∅for all i, j ∈[p],i6=j.
To be more specific, for an input vectors from R1, exactly the first n0units of the first hidden layer
are active, that is, for these input vectors the value of h(1)
jis non-zero if and only if j∈I1=
{1, . . . , n0}. For input vectors from R2, only the next n0units of the first hidden layer are active,
that is, the units with index in I2={n0+ 1,...,2n0}, and so on.
Now we consider a ‘fictitious’ intermediary layer consisting of n0linear units between the first and
second hidden layers. As this intermediary layer computes an affine function, it can be absorbed
into the second hidden layer (see Lemma 3). We use it only for making the next arguments clearer.
10
The map taking activations from the first hidden layer to activations from the second hidden layer is
rect(W(2)x+b(2) ), where W(2) ∈Rn2×n1,b(2) ∈Rn2.
We can write the input and bias weight matrices as W(2) =U(2)V(2) and b(2) =d(2) +V(2)c(2),
where U(2) ∈Rn0×n1,c∈Rn0, and V(2) ∈Rn2×n0,d∈Rn2.
The weights U(2) and c(2) describe the affine function computed by the intermediary layer, x7→
U(2)x+c. The weights V(2) and d(2) are the input and bias weights of the rectifier layer following
the intermediary layer.
We now consider the sub-matrix U(2)
iof U(2) consisting of the columns of U(2) with indices Ii, for
all i∈[p]. Then U(2) =hU(2)
1|···|U(2)
p|˜
U(2)i, where ˜
U(2) is the sub-matrix of U(2) consisting
of its last n1−pn0columns. In the sequel we set all entries of ˜
U(2) equal to zero.
The map g:Rn1→Rn0;g(x) = U(2)x+c(2) is thus written as the sum g=Pi∈[p]gi+c(2) ,
where gi:Rn0→Rn0;gi(x) = U(2)
ix, for all i∈[p].
Let Sibe the image of the input-space region Riby the first hidden layer. By Proposition 5, there
is a choice of the weights U(2)
iand bias c(2) such that the image of Siby x→U(2)
i(x) + c(2)
contains the n0-dimensional unit ball. Now, for all inputs vectors from Ri, only the units Iiof the
first hidden layer are active. Therefore, g|Ri=gi|Ri+c(2) . This implies that the image g(Ri)of
the input-space region Riby the intermediary layer contains the unit ball, for all i∈[p].
We can now choose V(2) and d(2) in such a way that the rectifier function Rn0→Rn2;y7→
rect(V(2)y+d(2) )defines an arrangement Aof n2hyperplanes with the property that each region
of Aintersects the unit ball at an open neighborhood.
In consequence, the map from input-space to activations of the second hidden layer has r(A)regions
of linearity within each input-space region Ri. Fig. 6 illustrates the situation. All inputs that are
mapped to the same activation of the first hidden layer, are treated as equivalent on the subsequent
layers. In this sense, an arrangement Adefined on the set of common outputs of R1, . . . , Rpat the
first hidden layer, is ‘replicated’ in each input region R1, . . . , Rp.
The subsequent layers of the network can be analyzed in a similar way as done above for the first
two layers. In particular, the weights V(2) and d(2) can be chosen in such a way that they define
an arrangement with the properties from Proposition 4. Then, the map taking activations from the
second hidden layer to activations from the third hidden layer, can be analyzed by considering again
a fictitious intermediary layer between the second and third layers, and so forth, as done above.
For the last hidden layer we choose the input weights V(k)and bias d(k)defining an n0-dimensional
arrangement of nkhyperplanes in general position. The map of inputs to activations of the last hid-
den layer has thus Qk−1
i=1 jni
n0kPn0
i=0 nk
iregions of linearity. This number is a lower bound on
the maximal number of regions of linearity of functions computable by the network. This completes
the proof. The intuition of the construction is illustrated in Fig. 7.
In the Appendix A we derive an asymptotic expansion of the bound given in Theorem 1.
5 A special class of deep models
In this section we consider deep rectifier models with n0input units and hidden layers of width
n= 2n0. This restriction allows us to construct a very efficient deep model in terms of number
of response regions. The analysis that we provide in this section complements the results from the
previous section, showing that rectifier MLPs can compute functions with many response regions,
even when defined with relatively few hidden layers.
Example 2. Let us assume we have a 2-dimensional input, i.e., n0= 2, and a layer of n= 4
rectifiers f1,f2,f3, and f4, followed by a linear projection. We construct the rectifier layer in such a
way that it divides the input space into four ‘square’ cones; each of them corresponding to the inputs
11
g(2)
1(R1)
g(2)
2(R1)
R1
R2
g(3)
1(g(2)
2(R2)))
g(3)
1(g(2)
1(R2))
g(3)
1(g(2)
1(R1))
g(3)
1(g(2)
1(R1))
g(2)
1g(2)
2
g(3)
1
g(3)
2
S
S
Figure 7: Constructing jn2
n0kjn1
n0kPn0
k=0 n3
kresponse regions in a model with three layers.
where two of the rectifier units are active. We define the four rectifiers as:
f1(x) = max n0,[1,0]>xo,
f2(x) = max n0,[−1,0]>xo,
f3(x) = max n0,[0,1]>xo,
f4(x) = max n0,[0,−1]>xo,
where x= [x1, x2]>∈Rn0. By adding pairs of coordinates of f= [f1, f2, f3, f4]>, we can
effectively mimic a layer consisting of two absolute-value units g1and g2:
g1(x)
g2(x)=1100
0011
f1(x)
f2(x)
f3(x)
f4(x)
=abs(x1)
abs(x2).(5)
The absolute-value unit gidivides the input space along the i-th coordinate axis, taking values which
are symmetric about that axis. The combination of g1and g2is then a function with four regions of
linearity;
S1={(x1, x2)|x1≥0, x2≥0}
S2={(x1, x2)|x1≥0, x2<0}
S3={(x1, x2)|x1<0, x2≥0}
S4={(x1, x2)|x1<0, x2<0}.
Since the values of giare symmetric about the i-th coordinate axis, each point x∈ Sihas a corre-
sponding point y∈ Sjwith g(x) = g(y), for all iand j.
We can apply the same procedure to the image of [g1, g2]to recursively divide the input space,
as illustrated in Fig. 8. For instance, if we apply this procedure one more time, we get four regions
12
PS1
PS3
PS4PS2
-4
-2
4
2
x0
x1
(a)
x0
x1P
(b)
Figure 8: Illustration of Example 2. (a) A rectifier layer with two pairs of units, where each pair
computes the absolute value of one of two input coordinates. Each input quadrant is mapped to the
positive quadrant. (b) Depiction of a two layer model. Both layers simulate the absolute value of
their input coordinates.
within each Si, resulting in 16 regions in total, within the input space. On the last layer, we may place
rectifiers in any way suitable for the task of interest (e.g., classification). The partition computed by
the last layer will be copied to each of the input space regions that produced the same input for the
last layer. Fig. 9 shows a function that can be implemented efficiently by a deep model using the
previous observations.
(a) (b) (c)
Figure 9: (a) Illustration of the partition computed by 8rectifier units on the outputs (x1, x2)of the
preceding layer. The color is a heat map of x1−x2. (b) Heat map of a function computed by a
rectifier network with 2inputs, 2hidden layers of width 4, and one linear output unit. The black
lines delimit the regions of linearity of the function. (c) Heat map of a function computed by a 4
layer model with a total of 24 hidden units. It takes at least 137 hidden units on a shallow model to
represent the same function.
The foregoing discussion can be easily generalized to n0>2input variables and khidden layers,
each consisting of 2n0rectifiers. In that case, the maximal number of linear regions of functions
computable by the network is lower-bounded as follows.
Theorem 2. The maximal number of regions of linearity of functions computable by a rec-
tifier neural network with n0input variables and khidden layers of width 2n0is at least
2(k−1)n0Pn0
j=0 2n0
j.
Proof. We prove this constructively. We define the rectifier units in each hidden layer in pairs, with
the sum of each pair giving the absolute value of a coordinate axis. We interpret also the sum of such
pairs as the actual input coordinates of the subsequent hidden layers. The rectifiers in the first hidden
layer are defined in pairs, such that the sum of each pair is the absolute value of one of the input
dimensions, with bias equal to (−1
2,...,−1
2). In the next hidden layers, the rectifiers are defined
13
in a similar way, with the difference that each pair computes the absolute value of the sum of two
of their inputs. The last hidden layer is defined in such a way that it computes a piece-wise linear
function with the maximal number of pieces, all of them intersecting the unit cube in Rn0. The
maximal number of regions of linearity of mrectifier units with n0-dimensional input is Pn0
j=0 m
j.
This partition is multiplied in each previous layer 2n0times.
The theorem shows that even for a small number of layers k, we can have many more linear regions
in a deep model than in a shallow one. For example, if we set the input dimensionality to n0= 2,
a shallow model with 4n0units will have at most 37 linear regions. The equivalent deep model
with two layers of 2n0units can produce 44 linear regions. For 6n0hidden units the shallow model
computes at most 79 regions, while the equivalent three layer model can compute 176 regions.
6 Discussion and conclusions
In this paper we introduced a novel way of understanding the expressiveness of neural networks
with piecewise linear activations. We count the number of regions of linearity, also called response
regions, of the functions that they can represent. The number of response regions tells us how well
the models can approximate arbitrary curved shapes. Computational Geometry provides us the tool
to make such statements.
We found that deep and narrow rectifier MPLs can generate many more regions of linearity than
their shallow counterparts with the same number of computational units or of parameters. We can
express this in terms of the ratio between the maximal number of response regions and the number
of parameters of both model classes. For a deep model with n0=O(1) inputs and khidden layers
of width n, the maximal number of response regions per parameter behaves as
Ω n
n0k−1nn0−2
k!.
For a shallow model with n0=O(1) inputs, the maximal number of response regions per parameter
behaves as
Okn0−1nn0−1.
We see that the deep model can generate many more response regions per parameter than the shallow
model; exponentially more regions per parameter in terms of the number of hidden layers k, and at
least order (k−2) polynomially more regions per parameter in terms of the layer width n. In
particular, there are deep models which use fewer parameters to produce more linear regions than
their shallow counterparts. Details about the asymptotic expansions are given in the Appendix A.
In this paper we only considered linear output units, but this is not a restriction, as the output ac-
tivation itself is not parametrized. If there is a target function ftarg that we want to model with a
rectifier MLP with σas its output activation function, then there exists a function f0
targ such that
σ(f0
targ) = ftarg, when σhas an inverse (e.g., with sigmoid), f0
targ =σ−1(ftarg). For activations that
do not have an inverse, like softmax, there are infinitely many functions f0
targ that work. We just
need to pick one, e.g., for softmax we can pick log(ftarg). By analyzing how well we can model f0
targ
with a linear output rectifier MLP we get an indirect measure of how well we can model ftarg with
an MLP that has σas its output activation.
Another interesting observation is that we recover a high ratio of nto n0if the data lives near a low-
dimensional manifold (effectively like reducing the input size n0). One-layer models can reach the
upper bound of response regions only by spanning all the dimensions of the input. In other words,
shallow models are not capable of concentrating linear response regions in any lower dimensional
subspace of the input. If, as commonly assumed, data lives near a low dimensional manifold, then
we care only about the number of response regions that a model can generate in the directions of the
data manifold. One way of thinking about this is principal component analysis (PCA), where one
finds that only few input space directions (say on the MNIST database) are relevant to the underlying
data. In such a situation, one cares about the number of response regions that a model can generate
only within the directions in which the data does change. In such situations nn0, and our results
show a clear advantage of using deep models.
14
We believe that the proposed framework can be used to answer many other interesting questions
about these models. For example, one can look at how the number of response regions is affected
by different constraints of the model, like shared weights. We think that this approach can also be
used to study other kinds of piecewise linear models, such as convolutional networks with rectifier
units or maxout networks, or also for comparing between different piecewise linear models.
A Asymptotic
Here we derive asymptotic expressions of the formulas contained in Proposition 2 and Theorem 1.
We use following standard notation:
•f(n) = O(g(n)) means that there is a positive constant c2such that f(n)≤c2g(n)for all
nlarger than some N.
•f(n) = Θ(g(n)) means that there are two positive constants c1and c2such that c1g(n)≤
f(n)≤c2g(n)for all nlarger than some N.
•f(n) = Ω(g(n)) means that there is a positive constant c1such that f(n)≥c1g(n)for all
nlarger than some N.
Proposition 6.
•Consider a single layer rectified MLP with kn units and n0inputs. Then the maximal
number of regions of linearity of the functions represented by this network is
R(n0, kn, 1) =
n0
X
s=0 kn
s,
and
R(n0, kn, 1) = O(kn0nn0),when n0=O(1).
•Consider a k layer rectified MLP with hidden layers of width nand n0inputs. Then the
maximal number of regions of linearity of the functions represented by this network satisfies
R(n0, n, . . . , n, 1) ≥ k−1
Y
i=1 n
n0!n0
X
s=0 n
s,
and
R(n0, n, . . . , n, 1) = Ω n
n0k−1
nn0!,when n0=O(1).
Proof. Here only the asymptotic expressions remain to be shown. It is known that
n0
X
s=0 m
s= Θ 1−2n0
m−1m
n0!,when n0≤m
2−√m. (6)
Furthermore, it is known that
m
s=ms
s!1 + O(1
m),when s=O(1).(7)
When n0is constant, n0=O(1), we have that
kn
n0=kn0
n0!nn01 + O1
kn .
In this case, it follows that
n0
X
s=0 kn
s= Θ 1−2n0
kn −1kn
n0!= Θ (kn0nn0)and also
n0
X
s=0 n
s= Θ(nn0).
15
Furthermore,
k−1
Y
i=1 n
n0!n0
X
s=0 n
s= Θ n
n0k−1
nn0!.
We now analyze the number of response regions as a function of the number of parameters. When
kand n0are fixed, then bn/n0ck−1grows polynomially in n, and kn0is constant. On the other
hand, when nis fixed with n > 2n0, then bn/n0ck−1grows exponentially in k, and kn0grows
polynomially in k.
Proposition 7. The number of parameters of a deep model with n0=O(1) inputs, nout =O(1)
outputs, and khidden layers of width nis
(k−1)n2+ (k+n0+nout)n+nout =O(kn2).
The number of parameters of a shallow model with n0=O(1) inputs, nout =O(1) outputs, and
kn hidden units is
(n0+nout)kn +n+nout =O(kn).
Proof. For the deep model, each layer, except the first and last, has an input weight matrix with n2
entries and a bias vector of length n. This gives a total of (k−1)n2+ (k−1)nparameters. The first
layer has nn0input weights and nbias. The output layer has nnout input weight matrix and nout
bias. If we sum these together we get
(k−1)n2+n(k+n0+nout) + nout =O(kn2).
For the shallow model, the hidden layer has knn0input weights and kn bias. The output weights
has knnout input weights and nout bias. Summing these together we get
kn(n0+nout) + n+nout =O(kn).
The number of linear regions per parameter can be given as follows.
Proposition 8. Consider a fixed number of inputs n0and a fixed number of outputs nout. The
maximal ratio of the number of response regions to the number of parameters of a deep model with
klayers of width nis
Ω n
n0k−1nn0−2
k!.
In the case of a shallow model with kn hidden units, the ratio is
Okn0−1nn0−1.
Proof. This is by combining Proposition 6 and Proposition 7.
We see that fixing the number of parameters, deep models can compute functions with many more
regions of linearity that those computable by shallow models. The ratio is exponential in the number
of hidden layers kand thus in the number of hidden units.
Acknowledgments
We would like to thank KyungHyun Cho, C¸ a˘
glar G¨
ulc¸ehre, and anonymous ICLR reviewers for their comments.
Razvan Pascanu is supported by a DeepMind Fellowship.
16
References
Y. Bengio. Learning deep architectures for AI. Foundations and trends R
in Machine Learning, 2(1):1–127,
2009.
Y. Bengio and O. Delalleau. On the expressive power of deep architectures. In J. Kivinen, C. Szepesvri,
E. Ukkonen, and T. Zeugmann, editors, Algorithmic Learning Theory, volume 6925 of Lecture Notes in
Computer Science, pages 18–36. Springer Berlin Heidelberg, 2011.
X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In AISTATS, 2011.
I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In ICML’2013,
2013.
A. Hajnal, W. Maass, P. Pudlk, M. Szegedy, and G. Turn. Threshold circuits of bounded depth. Journal of
Computer and System Sciences, 46(2):129–154, 1993.
J. H˚
astad. Almost optimal lower bounds for small depth circuits. In Proceedings of the 18th annual ACM
Symposium on Theory of Computing, pages 6–20, Berkeley, California, 1986. ACM Press.
J. H˚
astad and M. Goldmann. On the power of small-depth threshold circuits. Computational Complexity, 1:
113–129, 1991.
G. Hinton, L. Deng, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and
B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing
Magazine, 29(6):82–97, Nov. 2012a.
G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinv. Improving neural networks by
preventing co-adaptation of feature detectors. Technical report, arXiv:1207.0580, 2012b.
N. Le Roux and Y. Bengio. Deep belief networks are compact universal approximators. Neural Computation,
22(8):2192–2207, 2010.
H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised
learning of hierarchical representations. Montreal (QC), Canada, 2009.
J. Martens, A. Chattopadhya, T. Pitassi, and R. Zemel. On the expressive power of restricted boltzmann
machines. In Advances in Neural Information Processing Systems 26, pages 2877–2885. 2013.
G. Mont´
ufar and N. Ay. Refinements of universal approximation results for deep belief networks and restricted
Boltzmann machines. Neural Computation, 23(5):1306–1319, 2011.
G. Mont´
ufar and J. Morton. When does a mixture of products contain a product of mixtures? arXiv preprint
arXiv:1206.0387, 2012.
G. Mont´
ufar, J. Rauh, and N. Ay. Expressive power and approximation errors of restricted Boltzmann machines.
Advances in Neural Information Processing Systems, 24:415–423, 2011.
V. Nair and G. E. Hinton. Rectified linear units improve restricted Boltzmann machines. pages 807–814, 2010.
H. Poon and P. Domingos. Sum-product networks: A new deep architecture. In Computer Vision Workshops
(ICCV Workshops), 2011 IEEE International Conference on, pages 689–690, 2011.
R. Stanley. An introduction to hyperplane arrangements. In Lect. notes, IAS/Park City Math. Inst., 2004.
I. Sutskever and G. E. Hinton. Deep, narrow sigmoid belief networks are universal approximators. Neural
Computation, 20(11):2629–2636, 2008.
T. Zaslavsky. Facing Up to Arrangements: Face-Count Formulas for Partitions of Space by Hyperplanes.
Number no. 154 in Memoirs of the American Mathematical Society. American Mathematical Society, 1975.
M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. Technical report,
arXiv:1311.2901, 2013.
17