ArticlePDF Available

On the number of response regions of deep feed forward networks with piece-wise linear activations

December 2013

December 2013

Source
arXiv

Authors:

Razvan Pascanu

Université de Montréal

Guido Montufar

University of California, Los Angeles

Y. Bengio

Université de Montréal

This paper explores the complexity of deep feedforward networks with linear pre-synaptic couplings and rectified linear activations. This is a contribution to the growing body of work contrasting the representational power of deep and shallow network architectures. In particular, we offer a framework for comparing deep and shallow models that belong to the family of piecewise linear functions based on computational geometry. We look at a deep rectifier multi-layer perceptron (MLP) with linear outputs units and compare it with a single layer version of the model. In the asymptotic regime, when the number of inputs stays constant, if the shallow model has $kn$ hidden units and $n_0$ inputs, then the number of linear regions is $O(k^{n_0}n^{n_0})$. For a $k$ layer model with $n$ hidden units on each layer it is $\Omega(\left\lfloor {n}/ {n_0}\right\rfloor^{k-1}n^{n_0})$. The number $\left\lfloor{n}/{n_0}\right\rfloor^{k-1}$ grows faster than $k^{n_0}$ when $n$ tends to infinity or when $k$ tends to infinity and $n \geq 2n_0$. Additionally, even when $k$ is small, if we restrict $n$ to be $2n_0$, we can show that a deep model has considerably more linear regions that a shallow one. We consider this as a first step towards understanding the complexity of these models and specifically towards providing suitable mathematical tools for future analysis.

Illustration of a rectifier feedforward network with two hidden layers.

…

Induction step of the hyperplane sweep method for counting the regions of line arrangements in the plane.

Figures - uploaded by Guido Montufar

Content may be subject to copyright.

Content uploaded by Guido Montufar

Content may be subject to copyright.

On the number of response regions of deep

feedforward networks with piecewise linear

activations

Razvan Pascanu

Universit´

e de Montr´

eal

Montr´

eal QC H3C 3J7 Canada

r.pascanu@gmail.com

Guido Mont´

ufar

Max Planck Institute for Mathematics in the Sciences

Inselstraße 22, 04103 Leipzig, Germany

montufar@mis.mpg.de

Yoshua Bengio

Universit´

e de Montr´

eal

Montr´

eal QC H3C 3J7 Canada

yoshua.bengio@umontreal.ca

Abstract

This paper explores the complexity of deep feedforward networks with linear pre-

synaptic couplings and rectiﬁed linear activations. This is a contribution to the

growing body of work contrasting the representational power of deep and shallow

network architectures. In particular, we offer a framework for comparing deep

and shallow models that belong to the family of piecewise linear functions based

on computational geometry. We look at a deep rectiﬁer multi-layer perceptron

(MLP) with linear outputs units and compare it with a single layer version of the

model. In the asymptotic regime, when the number of inputs stays constant, if

the shallow model has kn hidden units and n0inputs, then the number of linear

regions is O(kn0nn0). For a klayer model with nhidden units on each layer it is

Ω(bn/n0ck−1nn0). The number bn/n0ck−1grows faster than kn0when ntends

to inﬁnity or when ktends to inﬁnity and n≥2n0. Additionally, even when kis

small, if we restrict nto be 2n0, we can show that a deep model has considerably

more linear regions that a shallow one. We consider this as a ﬁrst step towards

understanding the complexity of these models and speciﬁcally towards providing

suitable mathematical tools for future analysis.

Keywords: Deep learning, artiﬁcial neural network, rectiﬁer unit, hyperplane ar-

rangement, representational power

1 Introduction

Deep systems are believed to play an important role in information processing of intelligent agents.

A common hypothesis underlying this belief is that deep models can be exponentially more efﬁcient

at representing some functions than their shallow counterparts (see Bengio, 2009).

The argument is usually a compositional one. Higher layers in a deep model can re-use primitives

constructed by the lower layers in order to build gradually more complex functions. For example,

on a vision task, one would hope that the ﬁrst layer learns Gabor ﬁlters capable to detect edges of

different orientation. These edges are then put together at the second layer to form part-of-object

shapes. On higher layers, these part-of-object shapes are combined further to obtain detectors for

more complex part-of-object shapes or objects. Such a behaviour is empirically illustrated, for

instance, in Zeiler and Fergus (2013); Lee et al. (2009). On the other hand, a shallow model has to

construct detectors of target objects based only on the detectors learnt by the ﬁrst layer.

The representational power of computational systems with shallow and deep architectures has been

studied intensively. A well known result Hajnal et al. (1993) derived lower complexity bounds for

shallow threshold networks. Other works have explored the representational power of generative

models based on Boltzmann machines Mont´

ufar et al. (2011); Martens et al. (2013) and deep belief

networks (Sutskever and Hinton, 2008; Le Roux and Bengio, 2010; Mont´

ufar and Ay, 2011), or have

compared mixtures and products of experts models (Mont´

ufar and Morton, 2012).

In addition to such inspections, a wealth of evidence for the validity of this hypothesis comes from

deep models consistently outperforming shallow ones on a variety of tasks and datasets (see, e.g.,

Goodfellow et al., 2013; Hinton et al., 2012b,a). However, theoretical results on the representational

power of deep models are limited, usually due to the composition of nonlinear functions in deep

models, which makes mathematical analysis difﬁcult. Up to now, theoretical results have focussed

on circuit operations (neural net unit computations) that are substantially different from those being

used in real state-of-the-art deep learning applications, such as logic gates (H˚

astad, 1986), linear +

threshold units with non-negative weights (H˚

astad and Goldmann, 1991) or polynomials (Bengio

and Delalleau, 2011). Bengio and Delalleau (2011) show that deep sum-product networks (Poon

and Domingos, 2011) can use exponentially less nodes to express some families of polynomials

compared to the shallow ones.

The present note analyzes the representational power of deep MLPs with rectiﬁer units. Rectiﬁer

units (Glorot et al., 2011; Nair and Hinton, 2010) and piecewise linearly activated units in general

(like the maxout unit (Goodfellow et al., 2013)), are becoming popular choices in designing deep

models, and most current state-of-the-art results involve using one of such activations (Goodfellow

et al., 2013; Hinton et al., 2012b). Glorot et al. (2011) show that rectiﬁer units have several properties

that make the optimization problem easier than the more traditional case using smooth and bounded

activations, such as tanh or sigmoid.

In this work we take advantage of the piecewise linear nature of the rectiﬁer unit to mathematically

analyze the behaviour of deep rectiﬁer MLPs. Given that the model is a composition of piecewise

linear functions, it is itself a piecewise linear function. We compare the ﬂexibility of a deep model

with that of a shallow model by counting the number of linear regions they deﬁne over the input

space for a ﬁxed number of hidden units. This is the number of pieces available to the model

in order to approximate some arbitrary nonlinear function. For example, if we want to perfectly

approximate some curved boundary between two classes, a rectiﬁer MLP will have to use inﬁnitely

many linear regions. In practice we have a ﬁnite number of pieces, and if we assume that we can

perfectly learn their optimal slopes, then the number of linear regions becomes a good proxy for

how well the model approximates this boundary. In this sense, the number of linear regions is an

upper bound for the ﬂexibility of the model. In practice, the linear pieces are not independent and

the model may not be able to learn the right slope for each linear region. Speciﬁcally, for deep

models there is a correlation between regions, which results from the sharing of parameters between

the functions that describe the output on each region.

This is by no means a negative observation. If all the linear regions of the deep model were inde-

pendent of each other, by having many more linear regions, deep models would grossly overﬁt. The

correlation of the linear regions of a deep model results in its ability to generalize, by allowing it to

better represent only a small family of structured functions. These are functions that look compli-

cated (e.g., a distribution with a huge number of modes) but that have an underlying structure that

the network can ‘compress’ into its parameters. The number of regions, which indicates the number

of variations that the network can represent, provides a measure of how well it can ﬁt this family of

structured functions (whose approximation potentially needs inﬁnitely many linear regions).

We believe that this approach, based on counting the number of linear regions, is extensible to

any other piecewise linear activation function and also to other architectures, including the maxout

activation and the convolutional networks with rectiﬁer activations.

We know the maximal number of regions of linearity of functions computable by a shallow model

with a ﬁxed number of hidden units. This number is given by a well studied geometrical problem.

The main insight of the present work is to provide a geometrical construction that describes the

regions of linearity of functions computed by deep models. We show that in the asymptotic regime,

these functions have many more linear regions than the ones computed by shallow models, for the

same number of hidden units.

For the single layer case, each hidden unit divides the input space in two, whereby the boundary is

given by a hyperplane. For all input values on one side of the hyperplane, the unit outputs a positive

value. For all input values on the other side of the hyperplane, the unit outputs 0. Therefore, the

question that we are asking is: Into how many regions do nhyperplanes split space? This question

is studied in geometry under the name of hyperplane arrangements, with classic results such as

Zaslavsky’s theorem. Section 3 provides a quick introduction to the subject.

For the multilayer version of the model we rely on the following intuition. By using the rectiﬁer

nonlinearity, we identify multiple regions of the input space which are mapped by a given layer into

an equivalent set of activations and represent thus equivalent inputs for the next layers. That is, a

hidden layer can perform a kind of or operation by reacting similarly to several different inputs. Any

subsequent computation made on these activations is replicated on all equivalent inputs.

This paper is organized as follows. In Section 2 we provide deﬁnitions and basic observations about

piecewise linear functions. In Section 3 we discuss rectiﬁer networks with one single hidden layer

and describe their properties in terms of hyperplane arrangements which are fairly well known in

the literature. In Section 4 we discuss deep rectiﬁer networks and prove our main result, Theorem 1,

which describes their complexity in terms of the number of regions of linearity of functions that they

represent. Details about the asymptotic behaviour of the results derived in Sections 3 and 4 are given

in the Appendix A. In Section 5 we analyze a special type of deep rectiﬁer MLP and show that even

for a small number of hidden layers it can generate a large number of linear regions. In Section 6

we offer a discussion of the results.

2 Preliminaries

We consider classes of functions (models) deﬁned in the following way.

Deﬁnition 1. Arectiﬁer feedforward network is a layered feedforward network, or multilayer per-

ceptron (MLP), as shown in Fig. 1, with following properties. Each hidden unit receives as inputs

the real valued activations x1, .. . , xnof all units in the previous layer, computes the weighted sum

s=X

i∈[n]

wixi+b,

and outputs the rectiﬁed value

rect(s) = max{0, s}.

The real parameters w1, . . . , wnare the input weights and bis the bias of the unit. The output layer

is a linear layer, that is, the units in the last layer compute a linear combination of their inputs and

output it unrectiﬁed.

Given a vector of naturals n= (n0, n1, . . . , nL), we denote by Fnthe set of all functions Rn0→

RnLthat can be computed by a rectiﬁer feedforward network with n0inputs and nlunits in layer l

for l∈[L]. The elements of Fnare continuous piecewise linear functions.

We denote by R(n)the maximum of the number of regions of linearity or response regions over all

functions from Fn. For clarity, given a function f:Rn0→RnL, a connected open subset R⊆Rn0

is called a region of linearity or linear region or response region of fif the restriction f|Ris a linear

function and for any open set ˜

R)Rthe restriction f|˜

Ris not a linear function. In the next sections

we will compute bounds on R(n)for different choices of n. We are especially interested in the

comparison of shallow networks with one single very wide hidden layer and deep networks with

many narrow hidden layers.

In the remainder of this section we state three simple lemmas.

The next lemma states that a piecewise linear function f= (fi)i∈[k]has as many regions of linearity

as there are distinct intersections of regions of linearity of the coordinates fi.

Lemma 1. Consider a width klayer of rectiﬁer units. Let Ri={Ri

1, . . . , Ri

Ni}be the regions of

linearity of the function fi:Rn0→Rcomputed by the i-th unit, for all i∈[k]. Then the regions of

1st

hidden

layer

2nd

hidden

layer

Input Output

h(1) =

rect(W(1)x+b(1) )

h(2) =

rect(W(2)h(1) +b(2) )

W(out)h(2)

Figure 1: Illustration of a rectiﬁer feedforward network with two hidden layers.

linearity of the function f= (fi)i∈[k]:Rn0→Rkcomputed by the rectiﬁer layer are the elements

of the set {Rj1,...,jk=R1

j1∩ · ·· ∩ Rk

jk}(j1,...,jk)∈[N1]×···×[Nk].

Proof. A function f= (f1, . . . , fk): Rn→Rkis linear iff all its coordinates f1, . . . , fkare.

In regard to the number of regions of linearity of the functions represented by rectiﬁer networks,

the number of output dimensions, i.e., the number of linear output units, is irrelevant. This is the

statement of the next lemma.

Lemma 2. The number of (linear) output units of a rectiﬁer feedforward network does not affect the

maximal number of regions of linearity that it can realize.

Proof. Let f:Rn0→Rkbe the map of inputs to activations in the last hidden layer of a deep

feedforward rectiﬁer model. Let h=g◦fbe the map of inputs to activations of the output units,

given by composition of fwith the linear output layer, h(x) = W(out) f(x) + b(out). If the row span

of W(out) is not orthogonal to any difference of gradients of neighbouring regions of linearity of f,

then gcaptures all discontinuities of ∇f. In this case both functions fand hhave the same number

of regions of linearity.

If the number of regions of fis ﬁnite, then the number of differences of gradients is ﬁnite and there

is a vector outside the union of their orthogonal spaces. Hence a matrix with a single row (a single

output unit) sufﬁces to capture all transitions between different regions of linearity of f.

Lemma 3. A layer of nrectiﬁer units with n0inputs can compute any function that can be computed

by the composition of a linear layer with n0inputs and n0

0outputs and a rectiﬁer layer with n0

0inputs

and n1outputs, for any n0, n0

0, n1∈N.

Proof. A rectiﬁer layer computes functions of the form x7→ rect(Wx +b), with W∈Rn1×n0

and b∈Rn1. The argument Wx +bis an afﬁne function of x. The claim follows from the fact

that any composition of afﬁne functions is an afﬁne function.

3 One hidden layer

Let us look at the number of response regions of a single hidden layer MLP with n0input units and

nhidden units. We ﬁrst formulate the rectiﬁer unit as follows:

rect(s) = I(s)·s, (1)

where Iis the indicator function deﬁned as:

I(s) = 1,if s > 0

0,otherwise .(2)

We can now write the single hidden layer MLP with nyoutputs as the function f:Rn0→Rny;

f(x) = W(out)diag 









I(W(1)

1: x+b(1)

I(W(1)

n1:x+b(1)

n1)









W(1)x+b(1) +b(out) .(3)

From this formulation it is clear that each unit iin the hidden layer has two operational modes.

One is when the unit takes value 0and one when it takes a non-zero value. The boundary between

these two operational modes is given by the hyperplane Hiconsisting of all inputs x∈Rn0with

W(1)

i,:x+b(1)

i= 0. Below this hyperplane, the activation of the unit is constant equal to zero, and

above, it is linear with gradient equal to W(1)

i,:. It follows that the number of regions of linearity of

a single layer MLP is equal to the number of regions formed by the set of hyperplanes {Hi}i∈[n1].

A ﬁnite set of hyperplanes in a common n0-dimensional Euclidian space is called an n0-dimensional

hyperplane arrangement. A region of an arrangement A={Hi⊂Rn0}i∈[n]is a connected

component of the complement of the union of the hyperplanes, i.e., a connected component of

Rn0\(∪i∈[n]Hi). To make this clearer, consider an arrangement Aconsisting of hyperplanes

Hi={x∈Rn0:Wi,:x+bi= 0}for all i∈[n], for some W∈Rn×n0and some b∈Rn.

A region of Ais a set of points of the form R={x∈Rn0: sgn(Wx +b) = s}for some sign

vector s∈ {−,+}n.

A region of an arrangement is relatively bounded if its intersection with the space spanned by the

normals of the hyperplanes is bounded. We denote by r(A)the number of regions and by b(A)the

number of relatively bounded regions of an arrangement A. The essentialization of an arrangement

A={Hi}iis the arrangement consisting of the hyperplanes Hi∩N for all i, deﬁned in the span N

of the normals of the hyperplanes Hi. For example, the essentialization of an arrangement of two

non-parallel planes in R3is an arrangement of two lines in a plane.

Problem 1. How many regions are generated by an arrangement of nhyperplanes in Rn0?

The general answer to Problem 1 is given by Zaslavsky’s theorem (Zaslavsky, 1975, Theorem A),

which is one of the central results from the theory of hyperplane arrangements.

We will only need the special case of hyperplanes in general position, which realize the maximal

possible number of regions. Formally, an n-dimensional arrangement Ais in general position if for

any subset {H1,...Hp}⊆Athe following holds. (1) If p≤n, then dim(H1∩·· ·∩Hp) = n−p.

(2) If p > n, then H1∩ · ·· ∩ Hp=∅. An arrangement is in general position if the weights W,

bdeﬁning its hyperplanes are generic. This means that any arrangement can be perturbed by an

arbitrarily small perturbation in such a way that the resulting arrangement is in general position.

For arrangements in general position, Zaslavsky’s theorem can be stated in the following way (see

Stanley, 2004, Proposition 2.4).

Proposition 1. Let Abe an arrangement of mhyperplanes in general position in Rn0. Then

r(A) =

s=0 m

s

b(A) = m−1

n0.

In particular, the number of regions of a 2-dimensional arrangement Amof mlines in general

position is equal to

r(Am) = m

2+m+ 1.(4)

R12

R∅

R12

R∅

R23 R123

R13

Figure 2: Induction step of the hyperplane sweep method for counting the regions of line arrange-

ments in the plane.

For the purpose of illustration, we sketch a proof of eq. (4) using the sweep hyperplane method. We

proceed by induction over the number of lines m.

Base case m= 0.It is obvious that in this case there is a single region, corresponding to the entire

plane. Therefore, r(A0) = 1.

Induction step. Assume that for mlines the number of regions is r(Am) = m

2+m+ 1, and add

a new line Lm+1 to the arrangement. Since we assumed the lines are in general position, Lm+1

intersects each of the existing lines Lkat a different point. Fig. 2 depicts the situation for m= 2.

The mintersection points split the line Lm+1 into m+ 1 segments. Each of these segments cuts

a region of Amin two pieces. Therefore, by adding the line Lm+1 we get m+ 1 new regions. In

Fig. 2 the two intersection points result in three segments that split each of the regions R1, R01, R0

in two. Hence

r(Am+1) = r(Am) + m+ 1 = m(m−1)

2+m+1+m+ 1 = m(m+ 1)

2+ (m+ 1) + 1

=m+ 1

2+ (m+ 1) + 1.

For the number of response regions of MLPs with one single hidden layer we obtain the following.

Proposition 2. The regions of linearity of a function in the model F(n0,n1,1) with n0inputs and n1

hidden units are given by the regions of an arrangement of n1hyperplanes in n0-dimensional space.

The maximal number of regions of such an arrangement is R(n0, n1, ny) = Pn0

j=0 n1

j.

Proof. This is a consequence of Lemma 1. The maximal number of regions is produced by an n0-

dimensional arrangement of n1hyperplanes in general position, which is given in Proposition 1.

4 Multiple hidden layers

In order to show that a khidden layer model can be more expressive than a single hidden layer one

with the same number of hidden units, we will need the next three propositions.

Proposition 3. Any arrangement can be scaled down and shifted such that all regions of the ar-

rangement intersect the unit ball.

Proof. Let Abe an arrangement and let Sbe a ball of radius rand center c. Let dbe the supremum

of the distance from the origin to a point in a bounded region of the essentialization of the arrange-

ment A. Consider the map φ:Rn0→Rn0deﬁned by φ(x) = r

2d·x+c. Then A0=φ(A)is an

arrangement satisfying the claim. It is easy to see that any point with norm bounded by dis mapped

to a point inside the ball S.

AA0

Figure 3: An arrangement Aand a scaled-shifted version A0whose regions intersect the ball S.

The proposition is illustrated in Fig. 3.

We need some additional notations in order to formulate the next proposition. Given a hyperplane

H={x:w>x+b= 0}, we consider the region H−={x:w>x+b < 0}, and the region

H+={x:w>x+b≥0}. If we think about the corresponding rectiﬁer unit, then H+is the region

where the unit is active and H−is the region where the unit is dead.

Let Rbe a region delimited by the hyperplanes {H1, . . . , Hn}. We denote by R+⊆ {1, . . . , n}the

set of all hyperplane-indices jwith R⊂H+

j. In other words, R+is the list of hidden units that are

active (non-zero) in the input-space region R.

The following proposition describes the combinatorics of 2-dimensional arrangements in general

position. More precisely, the proposition describes the combinatorics of n-dimensional arrange-

ments with 2-dimensional essentialization in general position. Recall that the essentialization of

an arrangement is the arrangement that it deﬁnes in the subspace spanned by the normals of its

hyperplanes.

The proposition guarantees the existence of input weights and bias for a rectiﬁer layer such that for

any list of consecutive units, there is a region of inputs for which exactly the units from that list are

active.

Proposition 4. For any n0, n ∈N,n≥2, there exists an n0-dimensional arrangement Aof n

hyperplanes such that for any pair a, b ∈ {1, . . . , n}with a<b, there is a region Rof Awith

R+={a, a + 1, . . . , b}.

We show that the hyperplanes of a 2-dimensional arrangement in general position can be indexed in

such a way that the claim of the proposition holds. For higher dimensional arrangements the state-

ment follows trivially, applying the 2-dimensional statement to the intersection of the arrangement

with a 2-subspace.

Proof of Proposition 4. Consider ﬁrst the case n0= 2. We deﬁne the ﬁrst line L1of the arrangement

to be the x-axis of the standard coordinate system. To deﬁne the second line L2, we consider a circle

S1of radius r∈R+centered at the origin. We deﬁne L2to be the tangent of S1at an angle α1to

the y-axis, where 0< α1<π

2. The top left panel of Fig. 4 depicts the situation. In the ﬁgure, R∅

corresponds to inputs for which no rectiﬁer unit is active, R1corresponds to inputs where the ﬁrst

unit is active, R2to inputs where the second unit is active, and R12 to inputs where both units are

active. This arrangement has the claimed properties.

Now assume that there is an arrangement of nlines with the claimed properties. To add an (n+ 1)-th

line, we ﬁrst consider the maximal distance dmax from the origin to the intersection of two lines

Li∩Ljwith 1≤i < j < n. We also consider the radius-(dmax +r)circle Sncentered at the origin.

The circle Sncontains all intersection of any of the ﬁrst nlines. We now choose an angle αnwith

0< αn< αn−1and deﬁne Ln+1 as the tangent of Snthat forms an angle αnwith the y-axis. Fig. 4

depicts adding the third and fourth line to the arrangement.

After adding line Ln+1, we have that the arrangement

1. is in general position.

2. has regions R0

1, . . . , R0

n+1 with R0+

i={i, i + 1, . . . , n + 1}for all i∈[n+ 1].

R∅

R1L1

R12

y-axis

α1

R∅

R1L1

R12

R123

R23

y-axis

α1

α2

R∅

R1L1

R12

R123

R1234

R23 R234

R34

y-axis

α3

α2

α1

Figure 4: Illustration of the hyperplane arrangement discussed in Proposition 4, in the 2-dimensional

case. On the left we have arrangements of two and three lines, and on the right an arrangement of

four lines.

The regions of the arrangement are stable under perturbation of the angles and radii used to deﬁne

the lines. Any slight perturbation of these parameters preserves the list of regions. Therefore, the

arrangement is in general position.

The second property comes from the order in which Ln+1 intersects all previous lines. Ln+1 in-

tersects the lines in the order in which they were added to the arrangement: L1, L2, . . . , Ln. The

intersection of Ln+1 and Li,Bin+1 =Ln+1 ∩Li, is above the lines Li+1, Li+2, . . . , Ln, and hence

the segment Bi−1n+1Bin+1 between the intersection with Li−1and with Li, has to cut the region

in which only units ito nare active.

The intersection order is ensured by the choice of angles αiand the fact that the lines are tangent

to the circles Si. For any i < j and Bij =Li∩Ljlet Tij be the line parallel to the y-axis passing

through Bij . Each line Tij divides the space in two. Let Hij be the half-space to the right of Tij .

Within any half-space Hij, the intersection Hij ∩Liis above Hij ∩Lj, because the angle αi−1of

Liwith the y-axis is larger than αj−1(this means Ljhas a stepper decrease). Since Ln+1 is tangent

to the circle that contains all points Bij , the line Ln+1 will intersect lines Liand Ljin Hij , and

therefore it has to intersect Liﬁrst.

For n0>2we can consider an arrangement that is essentially 2-dimensional and has the properties

of the arrangement described above. To do this, we construct a 2-dimensional arrangement in a

2-subspace of Rn0and then extend each of the lines Liof the arrangement to a hyperplane Hithat

crosses Liorthogonally. The resulting arrangement satisﬁes all claims of the proposition.

The next proposition guarantees the existence of a collection of afﬁne maps with shared bias, which

map a collection of regions to a common output.

Proposition 5. Consider two integers n0and p. Let Sdenote the n0-dimensional unit ball and let

R1, . . . , Rp⊆Rn0be some regions with non-empty interiors. Then there is a choice of weights

c∈Rn0and U1,...,Up∈Rn0×n0for which gi(Ri)⊇ S for all i∈[p], where gi:Rn0→

Rn0;y7→ Uiy+c.

1st

hidden

layer

Intermediate

layer

Input

2nd

hidden

layer

h(1)

g(h(1))

rect(V(2)g(h(1) ) + d(2))

Figure 5: Illustration of Example 1. The units represented by squares build an intermediary layer of

linear units between the ﬁrst and the second hidden layers. The computation of such an intermediary

linear layer can be absorbed in the second hidden layer of rectiﬁer units (Lemma 3). The connectivity

map depicts the maps g1by dashed arrows and g2by dashed-dotted arrows.

Proof. To see this, consider the following construction. For each region Riconsider a ball Si⊆Ri

of radius ri∈R+and center si= (si1,...,sin0)∈Rn0. For each j= 1, . . . , n0, consider p

positive numbers u1j, . . . , upj such that uijsij=ukj skjfor all 1≤k < i ≤p. This can be

done ﬁxing u1jequal to 1and solving the equation for all other numbers. Let η∈Rbe such that

riηuij >1for any jand i. Scaling each region Riby Ui= diag(ηui0, . . . , ηuin0)transforms the

center of Sito the same point for all i. By the choice of η, the minor radius of all transformed balls

is larger than 1.

We can now set cto be minus the common center of the scaled balls, to obtain the map:

gi(x) = diag (ηui1, . . . , ηuin0)x−diag (ηu11, . . . , ηu1n0)s1,for all 1≤i≤p.

These gisatisfy claimed property, namely that gi(Ri)contains the unit ball, for all i.

Before proceeding, we discuss an example illustrating how the previous propositions and lemmas

are put together to prove our main result below, in Theorem 1.

Example 1. Consider a rectiﬁer MLP with n0= 2, such that the input space is R2, and assume

that the network has only two hidden layers, each consisting of n= 2n0units. Each unit in the

ﬁrst hidden layer deﬁnes a hyperplane in R2, namely the hyperplane that separates the inputs for

which it is active, from the inputs for which it is not active. Hence the ﬁrst hidden layer deﬁnes

an arrangement of nhyperplanes in R2. By Proposition 4, this arrangement can be made such

that it delimits regions of inputs R1, ..., Rn0⊆R2with the following property. For each input

in any given one of these regions, exactly one pair of units in the ﬁrst hidden layer is active, and,

furthermore, the pairs of units that are active on different regions are disjoint.

By the deﬁnition of rectiﬁer units, each hidden unit computes a linear function within the half-space

of inputs where it is active. In turn, the image of Riby the pair of units that is active in Riis a

polyhedron in R2. For each region Ri, denote corresponding polyhedron by Si.

Recall that a rectiﬁer layer computes a map of the form f:Rn→Rm;x7→ rect(Wx +b). Hence

a rectiﬁer layer with ninputs and moutputs can compute any composition f0◦gof an afﬁne map

g:Rn→Rkand a map f0computed by a rectiﬁer layer with kinputs and moutputs (Lemma 3).

Consider the map computed by the rectiﬁer units in the second hidden layer, i.e., the map that takes

activations from the ﬁrst hidden layer and outputs activations from the second hidden layer. We

think of this map as a composition f0◦gof an afﬁne map g:Rn→R2and a map f0computed by

a rectiﬁer layer with 2inputs. The map gcan be interpreted as an intermediary layer consisting of

two linear units, as illustrated in Fig. 5.

g1(R1)

g2(R2)

g1g2

Figure 6: Constructing jn1

n0kPn0

k=0 n2

kresponse regions in a model with two layers.

Within each input region Ri, only two units in the ﬁrst hidden layer are active. Therefore, for each

input region Ri, the output of the intermediary layer is an afﬁne transformation of Si. Furthermore,

the weights of the intermediary layer can be chosen in such a way that the image of each Ricontains

the unit ball.

Now, f0is the map computed by a rectiﬁer layer with 2inputs and noutputs. It is possible to deﬁne

this map in such a way that it has Rregions of linearity within the unit ball, where Ris the number

of regions of a 2-dimensional arrangement of nhyperplanes in general position.

We see that the entire network computes a function which has Rregions of linearity within each one

of the input regions R1, . . . , Rn0. Each input region Riis mapped by the concatenation of ﬁrst and

intermediate (notional) layer to a subset of R2which contains the unit ball. Then, the second layer

computes a function which partitions the unit ball into many pieces. The partition computed by the

second layer gets replicated in each of the input regions Ri, resulting in a subdivision of the input

space in exponentially many pieces (exponential in the number of network layers).

Now we are ready to state our main result on the number of response regions of rectiﬁer deep

feedforward networks:

Theorem 1. A model with n0inputs and khidden layers of widths n1, n2, . . . , nkcan divide the

input space in Qk−1

i=1 jni

n0kPn0

i=0 nk

ior possibly more regions.

Proof of Theorem 1. Let the ﬁrst hidden layer deﬁne an arrangement like the one from Proposition 4.

Then there are p=jn1

n0kinput-space regions Ri⊆Rn0,i∈[p]with the following property. For

each input vector from the region Ri, exactly n0units from the ﬁrst hidden layer are active. We

denote this set of units by Ii. Furthermore, by Proposition 4, for inputs in distinct regions Ri, the

corresponding set of active units is disjoint; that is, Ii∩Ij=∅for all i, j ∈[p],i6=j.

To be more speciﬁc, for an input vectors from R1, exactly the ﬁrst n0units of the ﬁrst hidden layer

are active, that is, for these input vectors the value of h(1)

jis non-zero if and only if j∈I1=

{1, . . . , n0}. For input vectors from R2, only the next n0units of the ﬁrst hidden layer are active,

that is, the units with index in I2={n0+ 1,...,2n0}, and so on.

Now we consider a ‘ﬁctitious’ intermediary layer consisting of n0linear units between the ﬁrst and

second hidden layers. As this intermediary layer computes an afﬁne function, it can be absorbed

into the second hidden layer (see Lemma 3). We use it only for making the next arguments clearer.

The map taking activations from the ﬁrst hidden layer to activations from the second hidden layer is

rect(W(2)x+b(2) ), where W(2) ∈Rn2×n1,b(2) ∈Rn2.

We can write the input and bias weight matrices as W(2) =U(2)V(2) and b(2) =d(2) +V(2)c(2),

where U(2) ∈Rn0×n1,c∈Rn0, and V(2) ∈Rn2×n0,d∈Rn2.

The weights U(2) and c(2) describe the afﬁne function computed by the intermediary layer, x7→

U(2)x+c. The weights V(2) and d(2) are the input and bias weights of the rectiﬁer layer following

the intermediary layer.

We now consider the sub-matrix U(2)

iof U(2) consisting of the columns of U(2) with indices Ii, for

all i∈[p]. Then U(2) =hU(2)

1|···|U(2)

p|˜

U(2)i, where ˜

U(2) is the sub-matrix of U(2) consisting

of its last n1−pn0columns. In the sequel we set all entries of ˜

U(2) equal to zero.

The map g:Rn1→Rn0;g(x) = U(2)x+c(2) is thus written as the sum g=Pi∈[p]gi+c(2) ,

where gi:Rn0→Rn0;gi(x) = U(2)

ix, for all i∈[p].

Let Sibe the image of the input-space region Riby the ﬁrst hidden layer. By Proposition 5, there

is a choice of the weights U(2)

iand bias c(2) such that the image of Siby x→U(2)

i(x) + c(2)

contains the n0-dimensional unit ball. Now, for all inputs vectors from Ri, only the units Iiof the

ﬁrst hidden layer are active. Therefore, g|Ri=gi|Ri+c(2) . This implies that the image g(Ri)of

the input-space region Riby the intermediary layer contains the unit ball, for all i∈[p].

We can now choose V(2) and d(2) in such a way that the rectiﬁer function Rn0→Rn2;y7→

rect(V(2)y+d(2) )deﬁnes an arrangement Aof n2hyperplanes with the property that each region

of Aintersects the unit ball at an open neighborhood.

In consequence, the map from input-space to activations of the second hidden layer has r(A)regions

of linearity within each input-space region Ri. Fig. 6 illustrates the situation. All inputs that are

mapped to the same activation of the ﬁrst hidden layer, are treated as equivalent on the subsequent

layers. In this sense, an arrangement Adeﬁned on the set of common outputs of R1, . . . , Rpat the

ﬁrst hidden layer, is ‘replicated’ in each input region R1, . . . , Rp.

The subsequent layers of the network can be analyzed in a similar way as done above for the ﬁrst

two layers. In particular, the weights V(2) and d(2) can be chosen in such a way that they deﬁne

an arrangement with the properties from Proposition 4. Then, the map taking activations from the

second hidden layer to activations from the third hidden layer, can be analyzed by considering again

a ﬁctitious intermediary layer between the second and third layers, and so forth, as done above.

For the last hidden layer we choose the input weights V(k)and bias d(k)deﬁning an n0-dimensional

arrangement of nkhyperplanes in general position. The map of inputs to activations of the last hid-

den layer has thus Qk−1

i=1 jni

n0kPn0

i=0 nk

iregions of linearity. This number is a lower bound on

the maximal number of regions of linearity of functions computable by the network. This completes

the proof. The intuition of the construction is illustrated in Fig. 7.

In the Appendix A we derive an asymptotic expansion of the bound given in Theorem 1.

5 A special class of deep models

In this section we consider deep rectiﬁer models with n0input units and hidden layers of width

n= 2n0. This restriction allows us to construct a very efﬁcient deep model in terms of number

of response regions. The analysis that we provide in this section complements the results from the

previous section, showing that rectiﬁer MLPs can compute functions with many response regions,

even when deﬁned with relatively few hidden layers.

Example 2. Let us assume we have a 2-dimensional input, i.e., n0= 2, and a layer of n= 4

rectiﬁers f1,f2,f3, and f4, followed by a linear projection. We construct the rectiﬁer layer in such a

way that it divides the input space into four ‘square’ cones; each of them corresponding to the inputs

g(2)

1(R1)

g(2)

2(R1)

g(3)

1(g(2)

2(R2)))

g(3)

1(g(2)

1(R2))

g(3)

1(g(2)

1(R1))

g(3)

1(g(2)

1(R1))

g(2)

1g(2)

g(3)

Figure 7: Constructing jn2

n0kjn1

n0kPn0

k=0 n3

kresponse regions in a model with three layers.

where two of the rectiﬁer units are active. We deﬁne the four rectiﬁers as:

f1(x) = max n0,[1,0]>xo,

f2(x) = max n0,[−1,0]>xo,

f3(x) = max n0,[0,1]>xo,

f4(x) = max n0,[0,−1]>xo,

where x= [x1, x2]>∈Rn0. By adding pairs of coordinates of f= [f1, f2, f3, f4]>, we can

effectively mimic a layer consisting of two absolute-value units g1and g2:

g1(x)

g2(x)=1100

0011





f1(x)

f2(x)

f3(x)

f4(x)





=abs(x1)

abs(x2).(5)

The absolute-value unit gidivides the input space along the i-th coordinate axis, taking values which

are symmetric about that axis. The combination of g1and g2is then a function with four regions of

linearity;

S1={(x1, x2)|x1≥0, x2≥0}

S2={(x1, x2)|x1≥0, x2<0}

S3={(x1, x2)|x1<0, x2≥0}

S4={(x1, x2)|x1<0, x2<0}.

Since the values of giare symmetric about the i-th coordinate axis, each point x∈ Sihas a corre-

sponding point y∈ Sjwith g(x) = g(y), for all iand j.

We can apply the same procedure to the image of [g1, g2]to recursively divide the input space,

as illustrated in Fig. 8. For instance, if we apply this procedure one more time, we get four regions

PS1

PS3

PS4PS2

-4

-2

(a)

x1P

(b)

Figure 8: Illustration of Example 2. (a) A rectiﬁer layer with two pairs of units, where each pair

computes the absolute value of one of two input coordinates. Each input quadrant is mapped to the

positive quadrant. (b) Depiction of a two layer model. Both layers simulate the absolute value of

their input coordinates.

within each Si, resulting in 16 regions in total, within the input space. On the last layer, we may place

rectiﬁers in any way suitable for the task of interest (e.g., classiﬁcation). The partition computed by

the last layer will be copied to each of the input space regions that produced the same input for the

last layer. Fig. 9 shows a function that can be implemented efﬁciently by a deep model using the

previous observations.

(a) (b) (c)

Figure 9: (a) Illustration of the partition computed by 8rectiﬁer units on the outputs (x1, x2)of the

preceding layer. The color is a heat map of x1−x2. (b) Heat map of a function computed by a

rectiﬁer network with 2inputs, 2hidden layers of width 4, and one linear output unit. The black

lines delimit the regions of linearity of the function. (c) Heat map of a function computed by a 4

layer model with a total of 24 hidden units. It takes at least 137 hidden units on a shallow model to

represent the same function.

The foregoing discussion can be easily generalized to n0>2input variables and khidden layers,

each consisting of 2n0rectiﬁers. In that case, the maximal number of linear regions of functions

computable by the network is lower-bounded as follows.

Theorem 2. The maximal number of regions of linearity of functions computable by a rec-

tiﬁer neural network with n0input variables and khidden layers of width 2n0is at least

2(k−1)n0Pn0

j=0 2n0

j.

Proof. We prove this constructively. We deﬁne the rectiﬁer units in each hidden layer in pairs, with

the sum of each pair giving the absolute value of a coordinate axis. We interpret also the sum of such

pairs as the actual input coordinates of the subsequent hidden layers. The rectiﬁers in the ﬁrst hidden

layer are deﬁned in pairs, such that the sum of each pair is the absolute value of one of the input

dimensions, with bias equal to (−1

2,...,−1

2). In the next hidden layers, the rectiﬁers are deﬁned

in a similar way, with the difference that each pair computes the absolute value of the sum of two

of their inputs. The last hidden layer is deﬁned in such a way that it computes a piece-wise linear

function with the maximal number of pieces, all of them intersecting the unit cube in Rn0. The

maximal number of regions of linearity of mrectiﬁer units with n0-dimensional input is Pn0

j=0 m

j.

This partition is multiplied in each previous layer 2n0times.

The theorem shows that even for a small number of layers k, we can have many more linear regions

in a deep model than in a shallow one. For example, if we set the input dimensionality to n0= 2,

a shallow model with 4n0units will have at most 37 linear regions. The equivalent deep model

with two layers of 2n0units can produce 44 linear regions. For 6n0hidden units the shallow model

computes at most 79 regions, while the equivalent three layer model can compute 176 regions.

6 Discussion and conclusions

In this paper we introduced a novel way of understanding the expressiveness of neural networks

with piecewise linear activations. We count the number of regions of linearity, also called response

regions, of the functions that they can represent. The number of response regions tells us how well

the models can approximate arbitrary curved shapes. Computational Geometry provides us the tool

to make such statements.

We found that deep and narrow rectiﬁer MPLs can generate many more regions of linearity than

their shallow counterparts with the same number of computational units or of parameters. We can

express this in terms of the ratio between the maximal number of response regions and the number

of parameters of both model classes. For a deep model with n0=O(1) inputs and khidden layers

of width n, the maximal number of response regions per parameter behaves as

Ω n

n0k−1nn0−2

k!.

For a shallow model with n0=O(1) inputs, the maximal number of response regions per parameter

behaves as

Okn0−1nn0−1.

We see that the deep model can generate many more response regions per parameter than the shallow

model; exponentially more regions per parameter in terms of the number of hidden layers k, and at

least order (k−2) polynomially more regions per parameter in terms of the layer width n. In

particular, there are deep models which use fewer parameters to produce more linear regions than

their shallow counterparts. Details about the asymptotic expansions are given in the Appendix A.

In this paper we only considered linear output units, but this is not a restriction, as the output ac-

tivation itself is not parametrized. If there is a target function ftarg that we want to model with a

rectiﬁer MLP with σas its output activation function, then there exists a function f0

targ such that

σ(f0

targ) = ftarg, when σhas an inverse (e.g., with sigmoid), f0

targ =σ−1(ftarg). For activations that

do not have an inverse, like softmax, there are inﬁnitely many functions f0

targ that work. We just

need to pick one, e.g., for softmax we can pick log(ftarg). By analyzing how well we can model f0

targ

with a linear output rectiﬁer MLP we get an indirect measure of how well we can model ftarg with

an MLP that has σas its output activation.

Another interesting observation is that we recover a high ratio of nto n0if the data lives near a low-

dimensional manifold (effectively like reducing the input size n0). One-layer models can reach the

upper bound of response regions only by spanning all the dimensions of the input. In other words,

shallow models are not capable of concentrating linear response regions in any lower dimensional

subspace of the input. If, as commonly assumed, data lives near a low dimensional manifold, then

we care only about the number of response regions that a model can generate in the directions of the

data manifold. One way of thinking about this is principal component analysis (PCA), where one

ﬁnds that only few input space directions (say on the MNIST database) are relevant to the underlying

data. In such a situation, one cares about the number of response regions that a model can generate

only within the directions in which the data does change. In such situations nn0, and our results

show a clear advantage of using deep models.

We believe that the proposed framework can be used to answer many other interesting questions

about these models. For example, one can look at how the number of response regions is affected

by different constraints of the model, like shared weights. We think that this approach can also be

used to study other kinds of piecewise linear models, such as convolutional networks with rectiﬁer

units or maxout networks, or also for comparing between different piecewise linear models.

A Asymptotic

Here we derive asymptotic expressions of the formulas contained in Proposition 2 and Theorem 1.

We use following standard notation:

•f(n) = O(g(n)) means that there is a positive constant c2such that f(n)≤c2g(n)for all

nlarger than some N.

•f(n) = Θ(g(n)) means that there are two positive constants c1and c2such that c1g(n)≤

f(n)≤c2g(n)for all nlarger than some N.

•f(n) = Ω(g(n)) means that there is a positive constant c1such that f(n)≥c1g(n)for all

nlarger than some N.

Proposition 6.

•Consider a single layer rectiﬁed MLP with kn units and n0inputs. Then the maximal

number of regions of linearity of the functions represented by this network is

R(n0, kn, 1) =

s=0 kn

s,

and

R(n0, kn, 1) = O(kn0nn0),when n0=O(1).

•Consider a k layer rectiﬁed MLP with hidden layers of width nand n0inputs. Then the

maximal number of regions of linearity of the functions represented by this network satisﬁes

R(n0, n, . . . , n, 1) ≥ k−1

i=1 n

n0!n0

s=0 n

s,

and

R(n0, n, . . . , n, 1) = Ω n

n0k−1

nn0!,when n0=O(1).

Proof. Here only the asymptotic expressions remain to be shown. It is known that

s=0 m

s= Θ 1−2n0

m−1m

n0!,when n0≤m

2−√m. (6)

Furthermore, it is known that

m

s=ms

s!1 + O(1

m),when s=O(1).(7)

When n0is constant, n0=O(1), we have that

kn

n0=kn0

n0!nn01 + O1

kn .

In this case, it follows that

s=0 kn

s= Θ 1−2n0

kn −1kn

n0!= Θ (kn0nn0)and also

s=0 n

s= Θ(nn0).

Furthermore,

k−1

i=1 n

n0!n0

s=0 n

s= Θ n

n0k−1

nn0!.

We now analyze the number of response regions as a function of the number of parameters. When

kand n0are ﬁxed, then bn/n0ck−1grows polynomially in n, and kn0is constant. On the other

hand, when nis ﬁxed with n > 2n0, then bn/n0ck−1grows exponentially in k, and kn0grows

polynomially in k.

Proposition 7. The number of parameters of a deep model with n0=O(1) inputs, nout =O(1)

outputs, and khidden layers of width nis

(k−1)n2+ (k+n0+nout)n+nout =O(kn2).

The number of parameters of a shallow model with n0=O(1) inputs, nout =O(1) outputs, and

kn hidden units is

(n0+nout)kn +n+nout =O(kn).

Proof. For the deep model, each layer, except the ﬁrst and last, has an input weight matrix with n2

entries and a bias vector of length n. This gives a total of (k−1)n2+ (k−1)nparameters. The ﬁrst

layer has nn0input weights and nbias. The output layer has nnout input weight matrix and nout

bias. If we sum these together we get

(k−1)n2+n(k+n0+nout) + nout =O(kn2).

For the shallow model, the hidden layer has knn0input weights and kn bias. The output weights

has knnout input weights and nout bias. Summing these together we get

kn(n0+nout) + n+nout =O(kn).

The number of linear regions per parameter can be given as follows.

Proposition 8. Consider a ﬁxed number of inputs n0and a ﬁxed number of outputs nout. The

maximal ratio of the number of response regions to the number of parameters of a deep model with

klayers of width nis

Ω n

n0k−1nn0−2

k!.

In the case of a shallow model with kn hidden units, the ratio is

Okn0−1nn0−1.

Proof. This is by combining Proposition 6 and Proposition 7.

We see that ﬁxing the number of parameters, deep models can compute functions with many more

regions of linearity that those computable by shallow models. The ratio is exponential in the number

of hidden layers kand thus in the number of hidden units.

Acknowledgments

We would like to thank KyungHyun Cho, C¸ a˘

glar G¨

ulc¸ehre, and anonymous ICLR reviewers for their comments.

Razvan Pascanu is supported by a DeepMind Fellowship.

References

Y. Bengio. Learning deep architectures for AI. Foundations and trends R

in Machine Learning, 2(1):1–127,

2009.

Y. Bengio and O. Delalleau. On the expressive power of deep architectures. In J. Kivinen, C. Szepesvri,

E. Ukkonen, and T. Zeugmann, editors, Algorithmic Learning Theory, volume 6925 of Lecture Notes in

Computer Science, pages 18–36. Springer Berlin Heidelberg, 2011.

X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectiﬁer neural networks. In AISTATS, 2011.

I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In ICML’2013,

2013.

A. Hajnal, W. Maass, P. Pudlk, M. Szegedy, and G. Turn. Threshold circuits of bounded depth. Journal of

Computer and System Sciences, 46(2):129–154, 1993.

J. H˚

astad. Almost optimal lower bounds for small depth circuits. In Proceedings of the 18th annual ACM

Symposium on Theory of Computing, pages 6–20, Berkeley, California, 1986. ACM Press.

J. H˚

astad and M. Goldmann. On the power of small-depth threshold circuits. Computational Complexity, 1:

113–129, 1991.

G. Hinton, L. Deng, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and

B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing

Magazine, 29(6):82–97, Nov. 2012a.

G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinv. Improving neural networks by

preventing co-adaptation of feature detectors. Technical report, arXiv:1207.0580, 2012b.

N. Le Roux and Y. Bengio. Deep belief networks are compact universal approximators. Neural Computation,

22(8):2192–2207, 2010.

H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised

learning of hierarchical representations. Montreal (QC), Canada, 2009.

J. Martens, A. Chattopadhya, T. Pitassi, and R. Zemel. On the expressive power of restricted boltzmann

machines. In Advances in Neural Information Processing Systems 26, pages 2877–2885. 2013.

G. Mont´

ufar and N. Ay. Reﬁnements of universal approximation results for deep belief networks and restricted

Boltzmann machines. Neural Computation, 23(5):1306–1319, 2011.

G. Mont´

ufar and J. Morton. When does a mixture of products contain a product of mixtures? arXiv preprint

arXiv:1206.0387, 2012.

G. Mont´

ufar, J. Rauh, and N. Ay. Expressive power and approximation errors of restricted Boltzmann machines.

Advances in Neural Information Processing Systems, 24:415–423, 2011.

V. Nair and G. E. Hinton. Rectiﬁed linear units improve restricted Boltzmann machines. pages 807–814, 2010.

H. Poon and P. Domingos. Sum-product networks: A new deep architecture. In Computer Vision Workshops

(ICCV Workshops), 2011 IEEE International Conference on, pages 689–690, 2011.

R. Stanley. An introduction to hyperplane arrangements. In Lect. notes, IAS/Park City Math. Inst., 2004.

I. Sutskever and G. E. Hinton. Deep, narrow sigmoid belief networks are universal approximators. Neural

Computation, 20(11):2629–2636, 2008.

T. Zaslavsky. Facing Up to Arrangements: Face-Count Formulas for Partitions of Space by Hyperplanes.

Number no. 154 in Memoirs of the American Mathematical Society. American Mathematical Society, 1975.

M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. Technical report,

arXiv:1311.2901, 2013.

Locally linear attributes of ReLU neural networks

Article

Full-text available

Nov 2023

A ReLU neural network functions as a continuous piecewise linear map from an input space to an output space. The weights in the neural network determine a partitioning of the input space into convex polytopes, where each polytope is associated with a distinct affine mapping. The structure of this partitioning, together with the affine map attached to each polytope, can be analyzed to investigate the behavior of the associated neural network. We investigate simple problems to build intuition on how these regions act and both how they can potentially be reduced in number and how similar structures occur across different networks. To validate these intuitions, we apply them to networks trained on MNIST to demonstrate similarity between those networks and the potential for them to be reduced in complexity.

Integrating geometries of ReLU feedforward neural networks

Article

Full-text available

Nov 2023

This paper investigates the integration of multiple geometries present within a ReLU-based neural network. A ReLU neural network determines a piecewise affine linear continuous map, M , from an input space ℝ m to an output space ℝ n . The piecewise behavior corresponds to a polyhedral decomposition of ℝ m . Each polyhedron in the decomposition can be labeled with a binary vector (whose length equals the number of ReLU nodes in the network) and with an affine linear function (which agrees with M when restricted to points in the polyhedron). We develop a toolbox that calculates the binary vector for a polyhedra containing a given data point with respect to a given ReLU FFNN. We utilize this binary vector to derive bounding facets for the corresponding polyhedron, extraction of “active” bits within the binary vector, enumeration of neighboring binary vectors, and visualization of the polyhedral decomposition (Python code is available at https://github.com/cglrtrgy/GoL_Toolbox ). Polyhedra in the polyhedral decomposition of ℝ m are neighbors if they share a facet. Binary vectors for neighboring polyhedra differ in exactly 1 bit. Using the toolbox, we analyze the Hamming distance between the binary vectors for polyhedra containing points from adversarial/nonadversarial datasets revealing distinct geometric properties. A bisection method is employed to identify sample points with a Hamming distance of 1 along the shortest Euclidean distance path, facilitating the analysis of local geometric interplay between Euclidean geometry and the polyhedral decomposition along the path. Additionally, we study the distribution of Chebyshev centers and related radii across different polyhedra, shedding light on the polyhedral shape, size, clustering, and aiding in the understanding of decision boundaries.

Bamboo classification based on GEDI, time-series Sentinel-2 images and whale-optimized, dual-channel DenseNet: A case study in Zhejiang province, China

Article

Full-text available

Mar 2024
ISPRS J PHOTOGRAMM

Effective contact texture region aware pavement skid resistance prediction via convolutional neural network

Article

Full-text available

Jun 2023

The surface texture of asphalt pavement has a significant effect on skid resistance performance. However, its contribution to the performance of skid resistance is non‐homogeneous and subjects to local validity. There are also a few deep learning models that take into account the effective contact texture region. This paper proposes a convolutional neural network model based on the effective contact texture region, containing macro‐ and micro‐scale awareness sub‐modules. In this study, the asphalt mixture with varying gradations was designed to accurately obtain the effective contact texture region. Then, the textures were disentangled into macro‐ and micro‐texture scales by applying the fast Fourier transform and fed into the model for training. Finally, the area of effective contact texture region was calculated, and the effective contact ratio parameter was then proposed using the triangulation algorithm. The results showed that the effective contact texture area of pavement varies by the asphalt mixture type. The effective contact ratio parameter exhibited a significant positive correlation ( Pearson correlation coefficient is 0.901, R ² = 0.8129) with skid resistance performance and was also influenced by key sieve aggregate content from 2.36 to 4.75 mm. The data of effective contact texture region following disentanglement significantly released the model performance (the relative error dropped to 1.81%). The model exhibited improved precision and performance, which can be utilized as an efficient, non‐contact alternative method for skid resistance analysis.

The training accuracy of two-layer neural networks: its estimation and understanding using random datasets

Conference Paper

Sep 2023

Transferable Neural Networks for Partial Differential Equations

Article

Full-text available

Feb 2024
J SCI COMPUT

Transfer learning for partial differential equations (PDEs) is to develop a pre-trained neural network that can be used to solve a wide class of PDEs. Existing transfer learning approaches require much information about the target PDEs such as its formulation and/or data of its solution for pre-training. In this work, we propose to design transferable neural feature spaces for the shallow neural networks from purely function approximation perspectives without using PDE information. The construction of the feature space involves the re-parameterization of the hidden neurons and uses auxiliary functions to tune the resulting feature space. Theoretical analysis shows the high quality of the produced feature space, i.e., uniformly distributed neurons. We use the proposed feature space as the pre-determined feature space of a random feature model, and use existing least squares solvers to obtain the weights of the output layer. Extensive numerical experiments verify the outstanding performance of our method, including significantly improved transferability, e.g., using the same feature space for various PDEs with different domains and boundary conditions, and the superior accuracy, e.g., several orders of magnitude smaller mean squared error than the state of the art methods.

CKNA: Kernel Hyperparameters Optimization Method for Group-Wise CNNs

Chapter

Sep 2023

Traditional CNNs lack specific interpretable indicators for configuring kernel hyperparameters. Practitioners typically tune model performance via tedious trial and error experiments, resulting in longer training time and expensive computational costs. To address this issue, a novel systematic method, namely CKNA, that considers comprehensive kernel hyperparameters to optimize the network architecture is proposed. Firstly, we innovatively define channel-dispersion and pseudo-kernel to characterize behaviors differentiated by kernel values across model layers. Based on these concepts, the kernel scaling algorithm and the flat wave algorithm are proposed to automatically adjust kernel size and numbers, respectively. Furthermore, we decompose the channel-expand and down-sample operations into adjacent layers to break the bottleneck effect existing in kernel elements. CKNA is applied to classic group-wise CNNs, and experimental results on ImageNet demonstrate that the optimized models obtain at least a 5.73% improvement in classification accuracy.

Lower and upper bounds for numbers of linear regions of graph convolutional networks

Article

Sep 2023
NEURAL NETWORKS

On the Activation Space of ReLU equipped Deep Neural Networks

Article

Jan 2023

ReLU Neural Networks, Polyhedral Decompositions, and Persistent Homolog

Preprint

Full-text available

Jun 2023

A ReLU neural network leads to a finite polyhedral decomposition of input space and a corresponding finite dual graph. We show that while this dual graph is a coarse quantization of input space, it is sufficiently robust that it can be combined with persistent homology to detect homological signals of manifolds in the input space from samples. This property holds for a variety of networks trained for a wide range of purposes that have nothing to do with this topological application. We found this feature to be surprising and interesting; we hope it will also be useful.

Expressive Power and Approximation Errors of Restricted Boltzmann Machines

Conference Paper

Full-text available

Dec 2011

We present explicit classes of probability distributions that can be learned by Restricted Boltzmann Machines (RBMs) depending on the number of units that they contain, and which are representative for the expressive power of the model. We use this to show that the maximal Kullback-Leibler divergence to the RBM model with $n$ visible and $m$ hidden units is bounded from above by $n - \left\lfloor \log(m+1) \right\rfloor - \frac{m+1}{2^{\left\lfloor\log(m+1)\right\rfloor}} \approx (n -1) - \log(m+1)$. In this way we can specify the number of hidden units that guarantees a sufficiently rich model containing different classes of distributions and respecting a given error tolerance.

Improving neural networks by preventing co-adaptation of feature detectors

Article

Full-text available

Jul 2012

When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This "overfitting" is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors. Instead, each neuron learns to detect a feature that is generally helpful for producing the correct answer given the combinatorially large variety of internal contexts in which it must operate. Random "dropout" gives big improvements on many benchmark tasks and sets new records for speech and object recognition.

When Does a Mixture of Products Contain a Product of Mixtures?

Article

Full-text available

Feb 2015
SIAM J DISCRETE MATH

We prove results on the relative representational power of mixtures of product distributions and restricted Boltzmann machines (products of mixtures of pairs of product distributions). Tools of independent interest are mode-based polyhedral approximations sensitive enough to compare full-dimensional models, and characterizations of possible modes and support sets of multivariate probability distributions that can be represented by both model classes. We find, in particular, that an exponentially larger mixture model, requiring an exponentially larger number of parameters, is required to represent probability distributions that can be represented by the restricted Boltzmann machines. The title question is intimately related to questions in coding theory and point configurations in hyperplane arrangements.

Sum-Product Networks : A New Deep Architecture

Conference Paper

Jan 2011

The key limiting factor in graphical model inference and learning is the complexity of the partition function. We thus ask the question: what are the most general conditions under which the partition function is tractable? The answer leads to a newkind of deep architecture, which we call sum-product networks (SPNs). SPNs are directed acyclic graphs with variables as leaves, sums and products as internal nodes, and weighted edges. We show that if an SPN is complete and consistent it represents the partition function and all marginals of some graphical model, and give semantics to its nodes. Essentially all tractable graphical models can be cast as SPNs, but SPNs are also strictly more general. We then propose learning algorithms for SPNs, based on backpropagation and EM. Experiments show that inference and learning with SPNs can be both faster and more accurate than with standard deep networks. For example, SPNs perform face completion better than state-of-the-art deep networks for this task. SPNs also have intriguing potential connections to the architecture of the cortex.

An introduction to hyperplane arrangements

Chapter

Oct 2007

Richard Stanley

Visualizing and understanding convolutional networks

Article

Jan 2013

Facing up to Arrangements: Face-Count Formulas for Partitions of Space by Hyperplanes

Article

Jan 1975

T. Zaslavsky

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Mathematics, 1974. Includes bibliographical references (leaves 135-139). Vita.

Deep Neural Networks for Acoustic Modeling in Speech Recognition

Article

Nov 2012

Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks with many hidden layers, that are trained using new methods have been shown to outperform Gaussian mixture models on a variety of speech recognition benchmarks, sometimes by a large margin. This paper provides an overview of this progress and represents the shared views of four research groups who have had recent successes in using deep neural networks for acoustic modeling in speech recognition.

Visualizing and Understanding Convolutional Neural Networks

Conference Paper

Nov 2013

Large Convolutional Neural Network models have recently demonstrated impressive classification performance on the ImageNet benchmark \cite{Kriz12}. However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we address both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. We also perform an ablation study to discover the performance contribution from different model layers. This enables us to find model architectures that outperform Krizhevsky \etal on the ImageNet classification benchmark. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.

Sum-Product Networks: A New Deep Architecture

Article

Feb 2012

The key limiting factor in graphical model inference and learning is the complexity of the partition function. We thus ask the question: what are general conditions under which the partition function is tractable? The answer leads to a new kind of deep architecture, which we call sum-product networks (SPNs). SPNs are directed acyclic graphs with variables as leaves, sums and products as internal nodes, and weighted edges. We show that if an SPN is complete and consistent it represents the partition function and all marginals of some graphical model, and give semantics to its nodes. Essentially all tractable graphical models can be cast as SPNs, but SPNs are also strictly more general. We then propose learning algorithms for SPNs, based on backpropagation and EM. Experiments show that inference and learning with SPNs can be both faster and more accurate than with standard deep networks. For example, SPNs perform image completion better than state-of-the-art deep networks for this task. SPNs also have intriguing potential connections to the architecture of the cortex.

On the number of response regions of deep feed forward networks with piece-wise linear activations

Abstract and Figures

Recommended publications

Alexander Duality for Functions: the Persistent Behavior of Land and Water and Shore

Geometric and Topological Inference

Summing a polynomial function over integral points of a polygon. User's guide

Moser's Shadow Problem