Discriminant Parallel Perceptrons
Ana Gonz´ alez, Iv´ an Cantador and Jos´ e R. Dorronsoro?
Depto. de Ingenier´ ıa Inform´ atica and Instituto de Ingenier´ ıa del Conocimiento
Universidad Aut´ onoma de Madrid, 28049 Madrid, Spain
Abstract. Parallel perceptrons (PPs), a novel approach to committee
machine training requiring minimal communication between outputs and
hidden units, allows the construction of efficient and stable nonlinear
classifiers. In this work we shall explore how to improve their perfor-
mance allowing their output weights to have real values, computed by
applying Fisher’s linear discriminant analysis to the committee machine’s
perceptron outputs. We shall see that the final performance of the re-
sulting classifiers is comparable to that of the more complex and costlier
to train multilayer perceptrons.
After their heyday in the early sixties, interest in machines made up of Rosen-
blat’s perceptrons greatly decayed. The main reason for this was the lack of
suitable training methods: even if perceptron combinations could provide com-
plex decision boundaries, there were not efficient and robust procedures for con-
structing them. An example of this are the well known committe machines (CM;
, chapter 6) for 2–class classification problems. They are made up of an odd
number H of standard perceptrons, the output of the i–th perceptron Pi(X)
over a D–dimensional input pattern X being given by Pi(X) = s(acti(X))
(we assume xD = 1 for bias purposes). Here s(·) denotes the sign function
and acti(X) = Wi· X is the X activation of Pi. The CM output is then
h(X) = s
= s(V(X)), i.e., the sign of the overall perceptron vote
count V(X). Assuming that each X has a class label yX = ±1, X is correctly
classified if yXh(X) = 1. If not, CM training applies Rosenblat’s rule
Wi:= Wi+ η yXX
to the smallest number of incorrect perceptrons (this number is (1+|V(X)|)/2);
moreover, this is done for those incorrect perceptrons for which |acti(X)| is
smallest. Although sensible, this training is somewhat unstable and only able
to build not too strong classifiers. A simple but powerful variant of classical
CM training, the so–called parallel perceptrons (PPs), recently introduced by
Auer et al. in , allows a very fast construction of more powerful classifiers,
with capabilities close to the more complex (and costlier to train) multilayer
?With partial support of Spain’s CICyT, projects TIC 01–572, TIN2004–07676.
perceptrons (MLPs). In PP training, (1) is applied to all wrong perceptrons but
the PP key training ingredient is an output stabilization procedure that tries to
keep away from 0 the activation acti(X) of a correct Pi, so that small random
changes on X do not cause its being assigned to another class. More precisely,
when X is correctly classified, but for a given margin γ and a perceptron Pi
we have 0 < yXacti(X) < γ, Rosenblatt’s rule is essentially again applied in
order to push yXacti(X) further away from zero. The value of the margin γ is
also adjusted dynamically so that most of the correctly classified patterns have
activation margins greater than the final γ∗(see section 2). In spite of their
very simple structure, PPs do have a universal approximation property and, as
shown in , provide results in classification and regression problems quite close
to those offered by C4.5 decision trees or MLPs.
There is much work being done in computational learning theory to build
efficient classifiers based on low complexity information processing methods.
This is particularly important for high dimensionality problems, such as those
arising in text mining or bioinformatics. As just mentioned, PPs combine simple
processing with good performance. A natural way to try to get a richer behavior
is to relax their clamping of output weights to 1, allowing these weights to have
real values. In fact, usually PP performance does not depend on the number
of perceptrons used, 3 being typically good enough. For classification problems,
a natural option, that we shall explore in this work, is to use standar linear
discriminant analysis to do so. We shall briefly describe in section 2 the training
of these discriminant PPs as well as their handling of margins, while in section 3
we will numerically analize their performance over several classification problems,
comparing it to that of standard PPs and MLPs. As we shall see, discriminant
PPs will give results somewhat better than those of standard PPs and essentially
similar to those of MLPs.
2 Discriminant PPs
We discuss first perceptron weight and margin updates. Assume that a set W =
(W1,...,WH) of perceptron weights and of Fisher’s weights A = (a1,...,aH)t
have been computed. The output hypothesis of the resulting discriminant PP is
h(X) = s
A · (P(X) −˜P)
with˜P = (P++P−)/2 and P±the averages of the perceptron outputs over the
positive and negative classes. We assume that the sign of the A vector has been
adjusted so that a pattern X is correctly classified if yXh(X) = 1. Now,
|Pi(X?)| = 1.
with N± the sizes of the positive and negative classes C±. We can expect in
fact that |(P±)i| < 1 and hence, |˜Pi| < 1 too. Therefore, yXai(P(X) −˜P) > 0
0 50100150 200250300350400450 500
0 50100150200250 300350 400450 500
Fig.1. Margin evolution for the thyroid (left) and diabetes datasets. Values depicted
are 10 times 10 fold crossvalidation averages of 500 iteration training runs.
if and only if yXaiP(X) > 0, and if X is not correctly classified, we should
augment yXaiPi(X) over those wrong perceptrons for which yXaiPi(X) < 0.
This is equivalent to augment yXaiacti(X) = yXaiWi· X, which can be simply
achieved by using again Rosenblatt’s rule (1) adjusted in terms of A:
Wi:= Wi+ ηs(yXai)X,
for then we have
yXai(Wi+ ηs(yXai)X) · X = yXaiWi· X + η|yXai||X|2> yXaiWi· X.
On the other hand, the margin stabilization of discriminant PPs is essentially
that of standard PPs. More precisely, if X is correctly classified, yXaiPi(X) > 0
and thus s(yXai)acti(X) > 0, which we want to remain > 0 after small X
perturbations. For this we may again apply (2) now in the form Wi := Wi+
λ η s(yXai)X to those correct perceptrons with a too small margin, i.e., those
for which 0 < s(yXai)acti(X) < γ, so that we push s(yXai)acti(X) further
away from zero. The new parameter λ measures the importance we give to
wide margins. The value of the margin γ is also adjusted dynamically from a
starting value γ0. More precisely, at the beginning of the t–th batch pass, we set
γt= γt−1; then, if a pattern X is processed correctly, we set γt:= γt+0.25η if all
perceptrons Pithat process X correctly also verify s(yXai)acti(X) ≥ γt−1, while
we set γt:= γt−0.75η if for at least one Piwe have 0 < s(yXai)acti(X) < γt−1.
These γtusually have a stable converge to a limit margin γ∗(see figure 1). We
normalize the Wiweights after each batch pass so that the margin is meaningful.
We also adjust the learning rate as ηt= η0/√t after each batch pass, as suggested
We recall that for 2–class problems, Fisher’s discriminants are very simple
to construct. In fact, the vector A = S−1
J = sT/sB= sT(A)/sB(A) of the total variance sT of discriminant PP outputs
to their between class variance sB. However, the total covariance matrix ST of
T(P+− P−) minimizes  the ratio
Problem set size pos. % input dim. num. hid. lr. rate num. hid. lr. rate num. hid.
heart dis. 46.1 13
Table 1. Input dimensions and training parameters used for the 7 comparison datasets.
MLPs were trained by conjugate gradient minimization.
the perceptrons’ outputs is quite likely to be singular (notice that the output
space for H perceptrons has just 2Hdistinct values). To avoid this, we will take
as the output of the perceptron i the value P?
function σγtaking the values σγ(t) = s(t) if |t| > λ = min(1,2γ) and σγ(t) = t/λ
when |t| ≤ λ. This makes quite unlikely that ST will be singular and together
with the η and γ updates allows for a fast and quite stable learning convergence.
We finally comment on the complexity of this procedure. For D–dimensional
inputs and H perceptrons, Rosenblat’s rule has an O(NDH) cost. For its part,
the STcovariance matrix computation has an O(NH2) cost, that dominates the
O(H3) cost of its inversion. While formally similar to the complexity estimates
of MLPs, computing times are much smaller for discriminant PPs (and more so
for standard PPs), as their weight updates are much simpler.
i(X) = σγ(Wi· X), with the ramp
We shall compare the performance of discriminant PPs with that of standard
PPs and also of multilayer perceptrons (MLPs) over 7 classification problems
sets from the well known UCI database; they are listed in table 1, together with
the positive class size, their input dimensions and the training parameters used.
Some of them (glass, vehicle, thyroid) are multi–class problems; to reduce them
to 2–class problems, we are taking as the minority classes the class 1 in the vehicle
dataset and the class 7 in the glass problem, and merge in a single class both sick
thyroid classes. We refer to the UCI database documentation  for more details.
In what follows we shall compare the performance of standard and discriminant
PPs and also that of standard multilayer perceptrons first in terms of accuracy,
that is, the percentage of correctly classified patterns, but also in terms of the
value g =
classes (see ). Notice that for sample imbalanced data sets a high accuracy
could be achieved simply by assigning all patterns to the (possibly much larger)
negative classes; g gives a more balanced classification performance measure.
In all cases, training has been carried out as a batch procedure using 10–times
√a+a−, where a±are the accuracies of the positive and negative
(2.16) (2.22) (2.15) (2.22) (1.67) (1.72)
74.97 71.87 74.25
(2.45) (3.98) (3.21) (5.34) (3.09) (4.33)
96.91 92.12 94.26
(2.09) (11.01) (2.09) (11.09) (2.87) (8.52)
79.97 78.95 73.90
(3.80) (3.88) (3.80) (3.88) (4.05) (4.24)
84.06 82.16 76.97
(4.58) (4.54) (3.91) (4.11) (4.19) (4.54)
(0.40) (1.84) (0.91) (9.46)) (1.63) (4.41)
(0.41) (3.46) (2.50) (5.00) (4.26) (5.77)
96.10 96.57 96.15 95.84 95.53
diabetes68.63 76.00 70.45
glass84.29 94.05 85.27
heart dis.73.80 75.22 74.68
ionosphere74.32 84.83 81.36
thyroid 82.55 97.62 94.01
vehicle65.03 81.51 74.48
Table 2. Accuracy, g test values and their standard deviations for 7 datasets and
different classifier construction procedures. It can be seen that discriminant PPs results
are comparable to those of MLPs and both are better than those of standard PPs.
10–fold cross–validation. Updates (1) and (2) have been applied in standard and
discriminant PP training, while conjugate gradient has been used for MLPs.
The number of perceptrons in all cases and the initial learning rates for PPs
and discriminant PPs for each dataset are described in table 1. Table 2 presents
average values of the cross–validation procedure just described (best values in
bold face, second best in cursive) for accuracies and g values, together with their
standard deviation. As it can be seen, discriminant PPs give the best accuracy
in 3 problems and second best in the other 4, MLPs give the best accuracy in
3 problems and second best in another 2, while standard PPs’s accuracy is best
only in the cancer problem and is second best over the glass data set. If we
consider g values, discriminant PPs’ g is highest in 4 problems and second best
in the other 3, MLPs’ g is highest in 2 problems and second best in another 4,
while standard PPs’ is highest only in the cancer problem. The performance of
discriminant PPs and MLPs is thus quite close and better than that of standard
We finish this section by noticing that the weight update (2) only aims to
reduce the classification error and to achieve a clear margin, but there is no
reason that it should minimize the Fisher criterion J. However, as seen in figure
2, this also happens. The figure depicts in logartihmic X and Y scales the evo-
lution of J for the ionosphere, glass, diabetes and thyroid datasets (clockwise
from top left). Values depicted are 10 times 10 fold cross–validation averages
of 500 iteration training runs. Although not always monotonic (as in the glass,
thyroid and diabetes problems), the overall J behavior is clearly decreasing and
Fig.2. From top left, clockwise: evolution of Fisher’s criterion for the ionosphere, glass,
diabetes and thyroid datasets. Values depicted are 10 times 10 fold crossvalidation
averages of 500 iteration training runs (all figures in log X and Y scale).
Parallel perceptron training offers a very fast procedure to build good and stable
committee machine–like classifiers. In this work we have seen that their classifica-
tion performance can be improved by allowing their output weights to have real
values, obtained by applying Fisher’s analysis over the perceptron outputs. The
final performance of these discriminant PPs is essentially that of the powerful
but costlier to build standard MLPs.
1. P. Auer, H. Burgsteiner, W. Maass, Reducing Communication for Distributed
Learning in Neural Networks, Proceedings of ICANN’2002, Lecture Notes in Com-
puter Science 2415 (2002), 123–128.
2. R. Duda, P. Hart, D. Stork, Pattern classification (second edition), Wiley, 2000.
3. P. Murphy, D. Aha, UCI Repository of Machine Learning Databases, Tech. Report,
University of Califonia, Irvine, 1994.
4. N. Nilsson, The Mathematical Foundations of Learning Machines, Morgan
5. J.A. Swets, Measuring the accuracy of diagnostic systems, Science 240 (1998),