Content uploaded by Venu Govindaraju
Author content
All content in this area was uploaded by Venu Govindaraju on Nov 20, 2014
Content may be subject to copyright.
Competitive Mixtures of Simple Neurons
Karthik Sridharan Matthew J. Beal Venu Govindaraju
{ks236,mbeal,govind}@cse.buffalo.edu
Department of Computer Science and Engineering
State University of New York at Buffalo
Buffalo, NY 14260-2000, USA
Abstract
We propose a competitive finite mixture of neurons (or
perceptrons) for solving binary classification problems.
Our classifier includes a prior for the weights between dif-
ferent neurons such that it prefers mixture models made up
from neurons having classification boundaries as orthog-
onal to each other as possible. We derive an EM algo-
rithm for learning the mixing proportions and weights of
each neuron, consisting of an exact E step and a partial M
step, and show that our model covers the regions of high
posterior probability in weight space and tends to reduce
overfitting. We demonstrate the way in which our mixture
classifier works using a toy 2-dimensional data set, show-
ing the effective use of strategically positioned components
in the mixture. We further compare its performance against
SVMs and one-hidden-layer neural networks on four real-
world data sets from the UCI repository, and show that even
a relatively small number of neurons with appopriate com-
petitive priors can achieve superior classification accura-
cies on held-out test data.
1 Introduction
One of the main challenges in the problem of classifica-
tion is to find a hypothesis that does not overfit the train-
ing data; there is always a trade-off between training and
test accuracies. High training accuracy is often indicative
of an overfitted classifier that may lead to poor generaliza-
tion performance on test data. An effective way to alle-
viate the problem of overfitting is to combine the classifi-
cation results of several classifiers. A Bayes-optimal rule
for combining the results of several classifiers is to take a
linear weighted sum of their predictions, weighting by the
posterior probability of each classifier having generated the
training data set. Due to the fact that even for moderate
dimensionalities of parameter space it is difficult to sam-
ple uniformly over the parameters’ posterior distribution,
this weighted prediction is often approximated using a very
large set of bagged or bootstrap-trained classifiers [1]. An
alternative to such bagged estimates is to use Markov chain
Monte Carlo methods, in which a trajectory is simulated
through the parameter space that converges to a stationary
distribution that is the true posterior distribution [8]. How-
ever, both these methods are computationally intensive.
We propose a solution where we use much fewer fi-
nite combinations of simple neurons that each form simple
linear boundaries, and combine them in a mixture model
formalism to model various parts of the data. Indeed
such a mixture model has been used before for classifica-
tion [5]. However, in constrast to standard mixture model
approaches, we introduce a penalty into the cost function of
each neuron in the form of a squared cosine term between
the weights of other neurons, such that the model learns
solutions that are not only good in terms of the classifica-
tion performance, but are also encouraged to be different
from each other. Our algorithm can be thought of as a mix-
ture of experts model with competitive penalties between
the experts (as shown in Figure 1). We use an EM algo-
rithm [2] for learning the weights of each of the neurons
and the mixing proportions between them. Since we treat
the loss function of a neuron as a negative log probabil-
ity, we automatically have a probabilistic interpretation and
hence can evaluate posterior responsibilities of neurons for
data given priors on each of them.
2 Preliminaries
Consider a binary classification problem with feature
space Xand binary target space T={0,1}. Given ntrain-
ing samples {x1, ..., xn}from space Xwith binary labels
{t1, ..., tn} ∈ T , our task is to find a function f:X → T
mapping a given input to a binary target classification. If we
consider a single neuron having a logistic output function,
the function f(x;w) = (1 + e−w>x+b)−1is thresholded to
obtain the classification, where wis the set of weights for
the neuron. One possible general error function that can be
minimized during the training of the neuron is
Eα(w) = α
2
n
X
i=1
(ti−f(xi;w))2.
If we assume that this loss function is the negative log prob-
ability of the data, then by this interpretation we are in fact
1
modeling each data point ithe random variable ti−f(xi;w)
as a Gaussian distribution with mean zero and precision α.
3 Mixture of Neurons
Consider kneurons with weights given by w1, ..., wk.
Note that without loss of generality we can model a bias
into the neuron by appending a bias weight and augmenting
the inputs with an extra dimension that is fixed to a value of
1. Each neuron in the mixture attempts to reduce its total
squared prediction error over all the data, but we also add
a term in the cost function that explicitly pushes each neu-
ron’s parameters away from the remaining neurons. Thus
our final classifier is the expected output of a set of classi-
fiers, each trained to classify well, but at the same time as
different from each other as possible. This overall cost func-
tion for the classifier modeling the ith data point is given by
Ei=α
2
k
X
j=1
(ti−f(xi;wj))2+β
2
k
X
j=1
k
X
l=1,l6=j
cos2(wj,wl)
(1)
where the squared cosine term is employed to penalized
similar weights in different neurons. This term is impor-
tant to break the symmetry between the kclassifiers, since
without it the weights of all the neurons would converge to
the same value (modulo undersirable local minima), which
is clearly an uninteresting predictive ensemble. By mini-
mizing the inner product, in weight space we are effectively
making the neuron weights as orthogonal to each other as
possible thus covering the posterior more thoroughly. In
data space, since the angle between any two weight vectors
is the angle between the corresponding separating hyper-
planes, reducing the squared cosine term causes a splaying
of the separating hyperplanes to the widest possible angles
whilst also striving for good classification capability. If we
consider this cost function as the negative log probability of
the datum, not only are we trying to model the prediction er-
ror with a Gaussian distribution with precision αand mean
0, but we are also including a prior stating that we believe
that the cosine of the weight of each neuron with respect
to other neurons is also from a Gaussian distribution with
mean 0(orthogonal) and precision β.
Now let Θw={w1, . . . , wk}denote the setting of the
collection of weights for the kneurons. Let the hidden in-
dicator variable hitaking on value jdenote the ith datum
being modeled by the jth neuron with probability πj, then
the probability of observing datum (xi, ti)given the pa-
rameters and indicator hi=jis P(xi, ti,Θw|hi=j)∝
e−α
2(ti−f(xi;wj))2+β
2Pk
l=1,l6=jcos2(wj,wl)=e−Eij .
3.1 EM Algorithm
We now derive in brief an EM algorithm [2] to estimate
the parameters of our model. The EM algorithm we con-
sider uses a batch update of weights, i.e. all the instances are
Figure 1. Model of the competitive classifier.
considered together and the parameters of all the neurons
are updated simultaneously. The probability of observing n
i.i.d. training samples {xi, ti}n
i=1 given the weights Θwand
mixing proportions πis given by
P({xi, ti}n
i=1|Θw,π) =
n
Y
i=1
k
X
j=1
P(hi=j|π)P(xi, ti|hi=j, Θw)
where each hiis some hidden variable distributed accord-
ing to the prior mixing proportions π. Taking the logarithm
of this probability, and introducing a set of variational dis-
tributions Q={Qi(hi)}n
i=1 over the hidden indicators for
each of the data points, using Jensen’s inequality we obtain
a lower bound on the likelihood of the parameters Θwfor
the model, which we denote F(Θw,{Qi(hi)})
L(Θw)≡log P(x1, ..., xn, t1, ..., tn|Θw)
≥
n
X
i=1
k
X
hi=1
Qi(hi) log P(hi)P(xi, ti|hi,Θw)
Qi(hi)
≡F(Q, Θw).(2)
Therefore we have L(Θw)≥ F(Q, Θw), and since
we cannot directly optimize L(Θw)we instead opti-
mize F(Q, Θw)and this guarantees that the negative log-
likelihood of the data always decreases. The EM algorithm
alternately optimizes the distributions {Qi}and the weights
Θw={w1, . . . , wk}keeping one constant while optimiz-
ing the other, thus performing a co-ordinate ascent.
E step: In the E step we optimize F(Q, Θw)with respect
to each Qi(hi). Since Qi(·)is a probability distribution
we use a Lagrage multiplier λito enforce normalization.
Taking derivatives of Fwith respect to each Qi(hi)yields
∂F(Q, Θw)
∂Qi(hi)= log P(xi, ti, hi|Θw)−log Qi(hi)+1−λi,
which upon setting to zero, finding the extremum, and solv-
ing for λi, yields
Qi(hi=j) = πjP(xi, ti|hi=j, Θw)
Pk
l=1 πlP(xi, ti|hi=l, Θw),(3)
where πj=P(hi=j|Θw)is the prior mixing proportion
associated with neuron j.
2
M Step: From Equation (2) we have that
F(Q, Θw) =
n
X
i=1
k
X
j=1
Qi(j) log πjP(xi, ti|hi=j, Θw)
Qi(j)
In the M step we maximize F(Q, Θw)with respect to
each of the weights wjand the priors πj. To find the
optimal prior πwe differentiate with respect to each of
its elements πjwhile enforcing normalization with a La-
grange multiplier λ0, resulting in Pn
i=1
Qi(j)
πj+λ0=
0⇒πj=−Pn
i=1 Qi(j)
λ0. From normalization, λ0=
−Pk
j=1 Pn
i=1 Qi(j) = −n, so to maximize F(Qi,Θw)
we set
πj=Pn
i=1 Qi(j)
n(4)
Now to maximize F(Qi,Θw)with respect to each of
the weights wjwe use gradient ascent. The gradient of
F(Q, Θw)w.r.t. each of the weights wjis given by
∂F(Q, Θw)
∂wj
=
n
X
i=1
Qi(j)∂log P(xi, ti|hi=j, Θw))
∂wj
=−
n
X
i=1
Qi(j)∂Ei,j
∂wj
+c0,
where c0is some constant. With learning rate η, the jth
neuron is updated following the negative gradient:
∆wj=η
n
X
i=1
Qi(j)(ti−f(xi;wj))(1−f(xi;wj))f(xi;wj)xi
−ηβ X
l6=j
(wj)>wl
|wj||wl|
wl−wl
wjw>
l
|wj|2
|wj||wl|.(5)
we follow the negative gradient, the M step is partial.
4 Performance Evaluation
We demonstrate the working of our proposed method on
some 2-d toy examples. Figure 2(a) shows the classification
of 2-d data in which class 1 is a Gaussian with mean zero
and unit spherical covariance and class 2 is made of 2 Gaus-
sians with unit spherical covariances and means (2,0) and
(0,2). This data set is successfully classified using mixture
of just two neurons. Figure 2(b) shows a binary classifica-
tion problem where class 1 consists of points drawn from a
Gaussian of mean 0and unit spherical covariance and is sur-
rounded in a circular fashion by class 2 points. The Bayes
boundary is shown as a circle of radius 1.5, as well as the
boundary obtained using our proposed approach with just 4
neurons; we have effectively used the 4 neurons to form a
closed boundary resembling the required boundary.
Lastly, one of the compelling reasons to design neural
networks with hidden layers is to be able to classify the
XOR function. We show that we can achieve this with a
mixture of 4 neurons. Figure 2(c) shows the hyperplanes
laid down by the 4 neurons and Figure 2(d) shows the deci-
sion boundary formed by this mixture of neurons. We see
that the simple mixture of neurons is able to discern even
this nonlinear boundary well.
To test the performance of the proposed mixture of neu-
rons, we used four 2-class real datasets from the UCI Ma-
chine Learning repository [3]: PIMA, WDBC, ION and
BUPA (see Table 1 for dimensionalities and numbers of
instances). In all the experiments we used randomly se-
lected 60% of the data for training and 40% for testing. Ta-
ble 1 summarizes the accuracies of the proposed method
against Gaussian-kernel SVMs, polynomial-kernel SVMs
and a simple backpropagation neural network. The results
are averaged over 20 trials and in each trial the same train-
ing and testing data was given to all four classifiers. First,
we see that our method with competition between classi-
fiers (β > 0) always beats a simple combination of classi-
fiers (β= 0). We also see that the proposed method outper-
forms neural networks in all cases, and significantly beats
polynomial- and Gaussian-kernel SVMs in (a different) 3
out of the 4 data sets. Further, we note the stability of our
method as evidenced in its consistently low standard error.
Finally we note that all algorithms had their hyperparame-
ters individually tuned to report their best possible results:
in our proposed method αand βparameters were set to
0.03 and 0.01 respectively. According to (1), αand βtrade
off the costs of classification and orthogonality of classifier
boundaries, respectively, and this ratio of 3 : 1 was found to
be optimal. For the experiments above, a mixture of 16 neu-
rons was found to be effective (through cross-validation).
5 Related Work
In [6] a hierarchical mixture of experts is proposed, in
which a probabilistic formulation of the experts is trained
using an EM algorithm. A precursor to the work in [6]
is a model described in [5], which is most similar to our
proposed method. The authors even suggest the idea of
using of a cost function that introduces competition be-
tween the classifiers, but do not elaborate on this or per-
form such experiments. In our approach we do indeed find
that our cosine squared cost function does help in creating
a competition between the classifiers. We have also deter-
mined that there is an intriguing link between our proposed
classifier and the margin-based classifiers such as that pro-
posed in [4]. This link becomes clear when we use the pro-
posed approach for a linearly separable classification prob-
lem where, apart from the mean squared error term, we
use the cosine squared term to update only the bias term
of the weight of the classifiers. In this case all the neu-
rons have approximately the same orientation and the co-
sine squared term for the bias pushes the neurons as far
3
−4 −3 −2 −1 0 1 2 3 4
−4
−3
−2
−1
0
1
2
3
4
(a)
−4 −3 −2 −1 0 1 2 3 4
−4
−3
−2
−1
0
1
2
3
4
(b)
−4 −3 −2 −1 0 1 2 3 4
−4
−3
−2
−1
0
1
2
3
4
(c)
−4 −3 −2 −1 0 1 2 3 4
−4
−3
−2
−1
0
1
2
3
4
(d)
Figure 2. Demonstrative synthetic examples (see text for key)
Table 1. Comparison of proposed model with state-of-the-art approaches
Dataset dim. instances Back-prop SVM SVM Mix. of Neurons Mix. of Neurons
(Gaussian) (polynomial) no competition with competition
PIMA 8 768 64.36 ±0.45 69.12 ±0.58 75.50 ±0.39 76.74 ±0.47 77.69 ±0.18
WDBC 39 569 96.67 ±0.15 97.04 ±0.15 93.42 ±0.36 97.24 ±0.21 97.68 ±0.17
ION 33 353 86.86 ±0.69 87.93 ±2.48 85.00 ±0.67 87.00 ±0.33 87.79 ±0.78
BUPA 6 345 62.61 ±0.82 67.25 ±0.77 71.01 ±0.70 70.22 ±0.52 70.72 ±0.72
Note: errors quoted are the standard errors of the mean.
away from each other as possible, while classifying the data
correctly. Therefore the extremal hyperplanes (neurons) ar-
range themselves in a way so as to increase the gap (linear
margin) between them. When all the terms of the weights
of the classifier are updated according to the cosine squared
function, the weight terms tend to be as different from each
other as possible yet classifying the data accurately. Thus
the model has similar behavior to mixture of SVM classi-
fiers like that proposed in [7]. Therefore, an extension to
our proposed method is to directly formulate the method as
a mixture of margin classifiers and use quadratic program-
ming to find the optimal parameter setting.
6 Conclusion
We have presented a probabilistic mixture of neurons for
binary classification that can tackle the problem of overfit-
ting by combining results of neurons that are as orthogonal
to each other as possible yet each strives to do well on clas-
sification. We have shown the effective performance of the
proposed method on both synthetic and real data, and have
shown for the most part superior performance over neural
networks and two types of SVMs. One phenomenon we ob-
serve while performing classification of 2-d data is that oc-
casionally some neurons end up putting hyperplanes at un-
likely locations where no data is found just to cancel out the
effects of other neurons’ errors. This is mainly because the
decision of a single neuron is uniformly weighed all across
the hyperplane. Hence certain neurons seem to be sacri-
ficed to compensate for errors far away from actual data.
One way to counter this problem is to have a distribution to
weigh the classification across the hyperplane such that near
the data the hyperplanes are weighted more and away from
the data their weight is less thus decreasing their confidence
about classification away from data points.
References
[1] L. Breiman. Bagging predictors. Mach. Learn., 24(2):123–
140, 1996.
[2] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum like-
lihood from incomplete data via the EM algorithm. Journal
of the Royal Statistical Society, Series B, 39(1):1–38, 1977.
[3] C. B. D.J. Newman, S. Hettich and C. Merz. UCI repository
of machine learning databases, 1998.
[4] Y. Freund and R. E. Schapire. Experiments with a new boost-
ing algorithm. In ICML, pages 148–156, 1996.
[5] R. A. Jacobs, M. I. Jordan, S. Nowlan, and G. Hinton. Adap-
tive mixture of local experts. Neural Comp., 3(1):79–87,
1991.
[6] M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of ex-
perts and the EM algorithm. Neural Comp., 6(2):181–214,
1994.
[7] J. T.-Y. Kwok. Support vector mixture for classification and
regression problems. In ICPR, volume 14, page 255, Wash-
ington, DC, USA, 1998. IEEE Computer Society.
[8] R. M. Neal. Bayesian Learning for Neural Networks.
Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1996.
4