Conference PaperPDF Available

Competitive Mixtures of Simple Neurons

Authors:

Abstract and Figures

We propose a competitive finite mixture of neurons (or perceptrons) for solving binary classification problems. Our classifier includes a prior for the weights between dif- ferent neurons such that it prefers mixture models made up from neurons having classification boundaries as orthog- onal to each other as possible. We derive an EM algo- rithm for learning the mixing proportions and weights of each neuron, consisting of an exact E step and a partial M step, and show that our model covers the regions of high posterior probability in weight space and tends to reduce overfitting. We demonstrate the way in which our mixture classifier works using a toy 2-dimensional data set, show- ing the effective use of strategically positioned components in the mixture. We further compare its performance against SVMs and one-hidden-layer neural networks on four real- world data sets from the UCI repository, and show that even a relatively small number of neurons with appopriate com- petitive priors can achieve superior classification accura- cies on held-out test data.
Content may be subject to copyright.
Competitive Mixtures of Simple Neurons
Karthik Sridharan Matthew J. Beal Venu Govindaraju
{ks236,mbeal,govind}@cse.buffalo.edu
Department of Computer Science and Engineering
State University of New York at Buffalo
Buffalo, NY 14260-2000, USA
Abstract
We propose a competitive finite mixture of neurons (or
perceptrons) for solving binary classification problems.
Our classifier includes a prior for the weights between dif-
ferent neurons such that it prefers mixture models made up
from neurons having classification boundaries as orthog-
onal to each other as possible. We derive an EM algo-
rithm for learning the mixing proportions and weights of
each neuron, consisting of an exact E step and a partial M
step, and show that our model covers the regions of high
posterior probability in weight space and tends to reduce
overfitting. We demonstrate the way in which our mixture
classifier works using a toy 2-dimensional data set, show-
ing the effective use of strategically positioned components
in the mixture. We further compare its performance against
SVMs and one-hidden-layer neural networks on four real-
world data sets from the UCI repository, and show that even
a relatively small number of neurons with appopriate com-
petitive priors can achieve superior classification accura-
cies on held-out test data.
1 Introduction
One of the main challenges in the problem of classifica-
tion is to find a hypothesis that does not overfit the train-
ing data; there is always a trade-off between training and
test accuracies. High training accuracy is often indicative
of an overfitted classifier that may lead to poor generaliza-
tion performance on test data. An effective way to alle-
viate the problem of overfitting is to combine the classifi-
cation results of several classifiers. A Bayes-optimal rule
for combining the results of several classifiers is to take a
linear weighted sum of their predictions, weighting by the
posterior probability of each classifier having generated the
training data set. Due to the fact that even for moderate
dimensionalities of parameter space it is difficult to sam-
ple uniformly over the parameters’ posterior distribution,
this weighted prediction is often approximated using a very
large set of bagged or bootstrap-trained classifiers [1]. An
alternative to such bagged estimates is to use Markov chain
Monte Carlo methods, in which a trajectory is simulated
through the parameter space that converges to a stationary
distribution that is the true posterior distribution [8]. How-
ever, both these methods are computationally intensive.
We propose a solution where we use much fewer fi-
nite combinations of simple neurons that each form simple
linear boundaries, and combine them in a mixture model
formalism to model various parts of the data. Indeed
such a mixture model has been used before for classifica-
tion [5]. However, in constrast to standard mixture model
approaches, we introduce a penalty into the cost function of
each neuron in the form of a squared cosine term between
the weights of other neurons, such that the model learns
solutions that are not only good in terms of the classifica-
tion performance, but are also encouraged to be different
from each other. Our algorithm can be thought of as a mix-
ture of experts model with competitive penalties between
the experts (as shown in Figure 1). We use an EM algo-
rithm [2] for learning the weights of each of the neurons
and the mixing proportions between them. Since we treat
the loss function of a neuron as a negative log probabil-
ity, we automatically have a probabilistic interpretation and
hence can evaluate posterior responsibilities of neurons for
data given priors on each of them.
2 Preliminaries
Consider a binary classification problem with feature
space Xand binary target space T={0,1}. Given ntrain-
ing samples {x1, ..., xn}from space Xwith binary labels
{t1, ..., tn} ∈ T , our task is to find a function f:X → T
mapping a given input to a binary target classification. If we
consider a single neuron having a logistic output function,
the function f(x;w) = (1 + ew>x+b)1is thresholded to
obtain the classification, where wis the set of weights for
the neuron. One possible general error function that can be
minimized during the training of the neuron is
Eα(w) = α
2
n
X
i=1
(tif(xi;w))2.
If we assume that this loss function is the negative log prob-
ability of the data, then by this interpretation we are in fact
1
modeling each data point ithe random variable tif(xi;w)
as a Gaussian distribution with mean zero and precision α.
3 Mixture of Neurons
Consider kneurons with weights given by w1, ..., wk.
Note that without loss of generality we can model a bias
into the neuron by appending a bias weight and augmenting
the inputs with an extra dimension that is fixed to a value of
1. Each neuron in the mixture attempts to reduce its total
squared prediction error over all the data, but we also add
a term in the cost function that explicitly pushes each neu-
ron’s parameters away from the remaining neurons. Thus
our final classifier is the expected output of a set of classi-
fiers, each trained to classify well, but at the same time as
different from each other as possible. This overall cost func-
tion for the classifier modeling the ith data point is given by
Ei=α
2
k
X
j=1
(tif(xi;wj))2+β
2
k
X
j=1
k
X
l=1,l6=j
cos2(wj,wl)
(1)
where the squared cosine term is employed to penalized
similar weights in different neurons. This term is impor-
tant to break the symmetry between the kclassifiers, since
without it the weights of all the neurons would converge to
the same value (modulo undersirable local minima), which
is clearly an uninteresting predictive ensemble. By mini-
mizing the inner product, in weight space we are effectively
making the neuron weights as orthogonal to each other as
possible thus covering the posterior more thoroughly. In
data space, since the angle between any two weight vectors
is the angle between the corresponding separating hyper-
planes, reducing the squared cosine term causes a splaying
of the separating hyperplanes to the widest possible angles
whilst also striving for good classification capability. If we
consider this cost function as the negative log probability of
the datum, not only are we trying to model the prediction er-
ror with a Gaussian distribution with precision αand mean
0, but we are also including a prior stating that we believe
that the cosine of the weight of each neuron with respect
to other neurons is also from a Gaussian distribution with
mean 0(orthogonal) and precision β.
Now let Θw={w1, . . . , wk}denote the setting of the
collection of weights for the kneurons. Let the hidden in-
dicator variable hitaking on value jdenote the ith datum
being modeled by the jth neuron with probability πj, then
the probability of observing datum (xi, ti)given the pa-
rameters and indicator hi=jis P(xi, ti,Θw|hi=j)
eα
2(tif(xi;wj))2+β
2Pk
l=1,l6=jcos2(wj,wl)=eEij .
3.1 EM Algorithm
We now derive in brief an EM algorithm [2] to estimate
the parameters of our model. The EM algorithm we con-
sider uses a batch update of weights, i.e. all the instances are
Figure 1. Model of the competitive classifier.
considered together and the parameters of all the neurons
are updated simultaneously. The probability of observing n
i.i.d. training samples {xi, ti}n
i=1 given the weights Θwand
mixing proportions πis given by
P({xi, ti}n
i=1|Θw,π) =
n
Y
i=1
k
X
j=1
P(hi=j|π)P(xi, ti|hi=j, Θw)
where each hiis some hidden variable distributed accord-
ing to the prior mixing proportions π. Taking the logarithm
of this probability, and introducing a set of variational dis-
tributions Q={Qi(hi)}n
i=1 over the hidden indicators for
each of the data points, using Jensen’s inequality we obtain
a lower bound on the likelihood of the parameters Θwfor
the model, which we denote Fw,{Qi(hi)})
Lw)log P(x1, ..., xn, t1, ..., tn|Θw)
n
X
i=1
k
X
hi=1
Qi(hi) log P(hi)P(xi, ti|hi,Θw)
Qi(hi)
≡F(Q, Θw).(2)
Therefore we have Lw)≥ F(Q, Θw), and since
we cannot directly optimize Lw)we instead opti-
mize F(Q, Θw)and this guarantees that the negative log-
likelihood of the data always decreases. The EM algorithm
alternately optimizes the distributions {Qi}and the weights
Θw={w1, . . . , wk}keeping one constant while optimiz-
ing the other, thus performing a co-ordinate ascent.
E step: In the E step we optimize F(Q, Θw)with respect
to each Qi(hi). Since Qi(·)is a probability distribution
we use a Lagrage multiplier λito enforce normalization.
Taking derivatives of Fwith respect to each Qi(hi)yields
F(Q, Θw)
∂Qi(hi)= log P(xi, ti, hi|Θw)log Qi(hi)+1λi,
which upon setting to zero, finding the extremum, and solv-
ing for λi, yields
Qi(hi=j) = πjP(xi, ti|hi=j, Θw)
Pk
l=1 πlP(xi, ti|hi=l, Θw),(3)
where πj=P(hi=j|Θw)is the prior mixing proportion
associated with neuron j.
2
M Step: From Equation (2) we have that
F(Q, Θw) =
n
X
i=1
k
X
j=1
Qi(j) log πjP(xi, ti|hi=j, Θw)
Qi(j)
In the M step we maximize F(Q, Θw)with respect to
each of the weights wjand the priors πj. To find the
optimal prior πwe differentiate with respect to each of
its elements πjwhile enforcing normalization with a La-
grange multiplier λ0, resulting in Pn
i=1
Qi(j)
πj+λ0=
0πj=Pn
i=1 Qi(j)
λ0. From normalization, λ0=
Pk
j=1 Pn
i=1 Qi(j) = n, so to maximize F(Qi,Θw)
we set
πj=Pn
i=1 Qi(j)
n(4)
Now to maximize F(Qi,Θw)with respect to each of
the weights wjwe use gradient ascent. The gradient of
F(Q, Θw)w.r.t. each of the weights wjis given by
F(Q, Θw)
∂wj
=
n
X
i=1
Qi(j)log P(xi, ti|hi=j, Θw))
wj
=
n
X
i=1
Qi(j)∂Ei,j
wj
+c0,
where c0is some constant. With learning rate η, the jth
neuron is updated following the negative gradient:
wj=η
n
X
i=1
Qi(j)(tif(xi;wj))(1f(xi;wj))f(xi;wj)xi
ηβ X
l6=j
(wj)>wl
|wj||wl|
wlwl
wjw>
l
|wj|2
|wj||wl|.(5)
we follow the negative gradient, the M step is partial.
4 Performance Evaluation
We demonstrate the working of our proposed method on
some 2-d toy examples. Figure 2(a) shows the classification
of 2-d data in which class 1 is a Gaussian with mean zero
and unit spherical covariance and class 2 is made of 2 Gaus-
sians with unit spherical covariances and means (2,0) and
(0,2). This data set is successfully classified using mixture
of just two neurons. Figure 2(b) shows a binary classifica-
tion problem where class 1 consists of points drawn from a
Gaussian of mean 0and unit spherical covariance and is sur-
rounded in a circular fashion by class 2 points. The Bayes
boundary is shown as a circle of radius 1.5, as well as the
boundary obtained using our proposed approach with just 4
neurons; we have effectively used the 4 neurons to form a
closed boundary resembling the required boundary.
Lastly, one of the compelling reasons to design neural
networks with hidden layers is to be able to classify the
XOR function. We show that we can achieve this with a
mixture of 4 neurons. Figure 2(c) shows the hyperplanes
laid down by the 4 neurons and Figure 2(d) shows the deci-
sion boundary formed by this mixture of neurons. We see
that the simple mixture of neurons is able to discern even
this nonlinear boundary well.
To test the performance of the proposed mixture of neu-
rons, we used four 2-class real datasets from the UCI Ma-
chine Learning repository [3]: PIMA, WDBC, ION and
BUPA (see Table 1 for dimensionalities and numbers of
instances). In all the experiments we used randomly se-
lected 60% of the data for training and 40% for testing. Ta-
ble 1 summarizes the accuracies of the proposed method
against Gaussian-kernel SVMs, polynomial-kernel SVMs
and a simple backpropagation neural network. The results
are averaged over 20 trials and in each trial the same train-
ing and testing data was given to all four classifiers. First,
we see that our method with competition between classi-
fiers (β > 0) always beats a simple combination of classi-
fiers (β= 0). We also see that the proposed method outper-
forms neural networks in all cases, and significantly beats
polynomial- and Gaussian-kernel SVMs in (a different) 3
out of the 4 data sets. Further, we note the stability of our
method as evidenced in its consistently low standard error.
Finally we note that all algorithms had their hyperparame-
ters individually tuned to report their best possible results:
in our proposed method αand βparameters were set to
0.03 and 0.01 respectively. According to (1), αand βtrade
off the costs of classification and orthogonality of classifier
boundaries, respectively, and this ratio of 3 : 1 was found to
be optimal. For the experiments above, a mixture of 16 neu-
rons was found to be effective (through cross-validation).
5 Related Work
In [6] a hierarchical mixture of experts is proposed, in
which a probabilistic formulation of the experts is trained
using an EM algorithm. A precursor to the work in [6]
is a model described in [5], which is most similar to our
proposed method. The authors even suggest the idea of
using of a cost function that introduces competition be-
tween the classifiers, but do not elaborate on this or per-
form such experiments. In our approach we do indeed find
that our cosine squared cost function does help in creating
a competition between the classifiers. We have also deter-
mined that there is an intriguing link between our proposed
classifier and the margin-based classifiers such as that pro-
posed in [4]. This link becomes clear when we use the pro-
posed approach for a linearly separable classification prob-
lem where, apart from the mean squared error term, we
use the cosine squared term to update only the bias term
of the weight of the classifiers. In this case all the neu-
rons have approximately the same orientation and the co-
sine squared term for the bias pushes the neurons as far
3
−4 −3 −2 −1 0 1 2 3 4
−4
−3
−2
−1
0
1
2
3
4
(a)
−4 −3 −2 −1 0 1 2 3 4
−4
−3
−2
−1
0
1
2
3
4
(b)
−4 −3 −2 −1 0 1 2 3 4
−4
−3
−2
−1
0
1
2
3
4
(c)
−4 −3 −2 −1 0 1 2 3 4
−4
−3
−2
−1
0
1
2
3
4
(d)
Figure 2. Demonstrative synthetic examples (see text for key)
Table 1. Comparison of proposed model with state-of-the-art approaches
Dataset dim. instances Back-prop SVM SVM Mix. of Neurons Mix. of Neurons
(Gaussian) (polynomial) no competition with competition
PIMA 8 768 64.36 ±0.45 69.12 ±0.58 75.50 ±0.39 76.74 ±0.47 77.69 ±0.18
WDBC 39 569 96.67 ±0.15 97.04 ±0.15 93.42 ±0.36 97.24 ±0.21 97.68 ±0.17
ION 33 353 86.86 ±0.69 87.93 ±2.48 85.00 ±0.67 87.00 ±0.33 87.79 ±0.78
BUPA 6 345 62.61 ±0.82 67.25 ±0.77 71.01 ±0.70 70.22 ±0.52 70.72 ±0.72
Note: errors quoted are the standard errors of the mean.
away from each other as possible, while classifying the data
correctly. Therefore the extremal hyperplanes (neurons) ar-
range themselves in a way so as to increase the gap (linear
margin) between them. When all the terms of the weights
of the classifier are updated according to the cosine squared
function, the weight terms tend to be as different from each
other as possible yet classifying the data accurately. Thus
the model has similar behavior to mixture of SVM classi-
fiers like that proposed in [7]. Therefore, an extension to
our proposed method is to directly formulate the method as
a mixture of margin classifiers and use quadratic program-
ming to find the optimal parameter setting.
6 Conclusion
We have presented a probabilistic mixture of neurons for
binary classification that can tackle the problem of overfit-
ting by combining results of neurons that are as orthogonal
to each other as possible yet each strives to do well on clas-
sification. We have shown the effective performance of the
proposed method on both synthetic and real data, and have
shown for the most part superior performance over neural
networks and two types of SVMs. One phenomenon we ob-
serve while performing classification of 2-d data is that oc-
casionally some neurons end up putting hyperplanes at un-
likely locations where no data is found just to cancel out the
effects of other neurons’ errors. This is mainly because the
decision of a single neuron is uniformly weighed all across
the hyperplane. Hence certain neurons seem to be sacri-
ficed to compensate for errors far away from actual data.
One way to counter this problem is to have a distribution to
weigh the classification across the hyperplane such that near
the data the hyperplanes are weighted more and away from
the data their weight is less thus decreasing their confidence
about classification away from data points.
References
[1] L. Breiman. Bagging predictors. Mach. Learn., 24(2):123–
140, 1996.
[2] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum like-
lihood from incomplete data via the EM algorithm. Journal
of the Royal Statistical Society, Series B, 39(1):1–38, 1977.
[3] C. B. D.J. Newman, S. Hettich and C. Merz. UCI repository
of machine learning databases, 1998.
[4] Y. Freund and R. E. Schapire. Experiments with a new boost-
ing algorithm. In ICML, pages 148–156, 1996.
[5] R. A. Jacobs, M. I. Jordan, S. Nowlan, and G. Hinton. Adap-
tive mixture of local experts. Neural Comp., 3(1):79–87,
1991.
[6] M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of ex-
perts and the EM algorithm. Neural Comp., 6(2):181–214,
1994.
[7] J. T.-Y. Kwok. Support vector mixture for classification and
regression problems. In ICPR, volume 14, page 255, Wash-
ington, DC, USA, 1998. IEEE Computer Society.
[8] R. M. Neal. Bayesian Learning for Neural Networks.
Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1996.
4
Conference Paper
Full-text available
In this paper, the classification results obtained from several kinds of support vector machines (SVM) and neural networks (NN) are compared with our proposed classifier. Our approach is based on neural networks and interval neutrosophic sets which are used to classify the input patterns into one of the two binary class outputs. The comparison is based on several classical benchmark problems from UCI machine learning repository. We have found that the performance of our approaches are comparable to the existing classifiers. However, our approach has taken into account of the uncertainty in the classification process.
Article
Full-text available
We present a new supervised learning procedure for systems composed of many separate networks, each of which learns to handle a subset of the complete set of training cases. The new procedure can be viewed either as a modular version of a multilayer supervised network, or as an associative version of competitive learning. It therefore provides a new link between these two apparently different approaches. We demonstrate that the learning procedure divides up a vowel discrimination task into appropriate subtasks, each of which can be solved by a very simple expert network.
Article
Full-text available
We present a tree-structured architecture for supervised learning. The statistical model underlying the architecture is a hierarchical mixture model in which both the mixture coefficients and the mixture components are generalized linear models (GLIM's). Learning is treated as a maximum likelihood problem; in particular, we present an Expectation -Maximization (EM) algorithm for adjusting the parameters of the architecture. We also develop an on-line learning algorithm in which the parameters are updated incrementally. Comparative simulation results are presented in the robot dynamics domain.
Article
Two features distinguish the Bayesian approach to learning models from data. First, beliefs derived from background knowledge are used to select a prior probability distribution for the model parameters. Second, predictions of future observations are made by integrating the model's predictions with respect to the posterior parameter distribution obtained by updating this prior to take account of the data. For neural network models, both these aspects present diiculties | the prior over network parameters has no obvious relation to our prior knowledge, and integration over the posterior is computationally very demanding. I address the problem by deening classes of prior distributions for network param-eters that reach sensible limits as the size of the network goes to innnity. In this limit, the properties of these priors can be elucidated. Some priors converge to Gaussian processes, in which functions computed by the network may be smooth, Brownian, or fractionally Brownian. Other priors converge to non-Gaussian stable processes. Interesting eeects are obtained by combining priors of both sorts in networks with more than one hidden layer.
Book
From the Publisher: Artificial "neural networks" are now widely used as flexible models for regression classification applications, but questions remain regarding what these models mean, and how they can safely be used when training data is limited. Bayesian Learning for Neural Networks shows that Bayesian methods allow complex neural network models to be used without fear of the "overfitting" that can occur with traditional neural network learning methods. Insight into the nature of these complex Bayesian models is provided by a theoretical investigation of the priors over functions that underlie them. Use of these models in practice is made possible using Markov chain Monte Carlo techniques. Both the theoretical and computational aspects of this work are of wider statistical interest, as they contribute to a better understanding of how Bayesian methods can be applied to complex problems. Presupposing only the basic knowledge of probability and statistics, this book should be of interest to many researchers in statistics, engineering, and artificial intelligence. Software for Unix systems that implements the methods described is freely available over the Internet.
Article
Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.
Article
S ummary A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.
Article
A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.
Conference Paper
We present a tree-structured architecture for supervised learning. The statistical model underlying the architecture is a hierarchical mixture model in which both the mixture coefficients and the mixture components are generalized linear models (GLIMs). Learning is treated as a maximum likelihood problem; in particular, we present an expectation-maximization (EM) algorithm for adjusting the parameters of the architecture. We also develop an online learning algorithm in which the parameters are updated incrementally. Comparative simulation results are presented in the robot dynamics domain.