Discriminant Parallel Perceptrons.
ABSTRACT Parallel perceptrons (PPs), a novel approach to committee machine training requiring minimal communication between outputs
and hidden units, allows the construction of efficient and stable nonlinear classifiers. In this work we shall explore how
to improve their performance allowing their output weights to have real values, computed by applying Fisher’s linear discriminant
analysis to the committee machine’s perceptron outputs. We shall see that the final performance of the resulting classifiers
is comparable to that of the more complex and costlier to train multilayer perceptrons.
- SourceAvailable from: usc.es[Show abstract] [Hide abstract]
ABSTRACT: The Parallel Perceptron (PP) is a simple neural network which has been shown to be a universal approximator, and it can be trained using the Parallel Delta (P-Delta) rule. This rule tries to maximize the distance between the perceptron activations and their decision hyperplanes in order to increase its generalization ability, following the principles of the Statistical Learning Theory. In this paper we propose a closed-form analytical expression to calculate, without iterations, the PP weights for classification tasks. The calculated weights globally optimize a cost function which takes simultaneously into account the training error and the perceptron margin, similarly to the P-Delta rule. Our approach, called Direct Parallel Perceptron (DPP) has a linear computational complexity in the number of inputs, being very interesting for high-dimensional problems. DPP is competitive with SVM and other approaches (included P-Delta) for two-class classification problems but, as opposed to most of them, the tunable parameters of DPP do not influence the results very much. Besides, the absence of an iterative training stage gives to DPP the ability of on-line learning.Neural Networks (IJCNN), The 2010 International Joint Conference on; 08/2010
- [Show abstract] [Hide abstract]
ABSTRACT: Parallel perceptrons (PPs) are very simple and efficient committee machines (a single layer of perceptrons with threshold activation functions and binary outputs, and a majority voting decision scheme), which nevertheless behave as universal approximators. The parallel delta (P-Delta) rule is an effective training algorithm, which, following the ideas of statistical learning theory used by the support vector machine (SVM), raises its generalization ability by maximizing the difference between the perceptron activations for the training patterns and the activation threshold (which corresponds to the separating hyperplane). In this paper, we propose an analytical closed-form expression to calculate the PPs' weights for classification tasks. Our method, called Direct Parallel Perceptrons (DPPs), directly calculates (without iterations) the weights using the training patterns and their desired outputs, without any search or numeric function optimization. The calculated weights globally minimize an error function which simultaneously takes into account the training error and the classification margin. Given its analytical and noniterative nature, DPPs are computationally much more efficient than other related approaches (P-Delta and SVM), and its computational complexity is linear in the input dimensionality. Therefore, DPPs are very appealing, in terms of time complexity and memory consumption, and are very easy to use for high-dimensional classification tasks. On real benchmark datasets with two and multiple classes, DPPs are competitive with SVM and other approaches but they also allow online learning and, as opposed to most of them, have no tunable parameters.IEEE Transactions on Neural Networks 12/2011; · 2.95 Impact Factor
- [Show abstract] [Hide abstract]
ABSTRACT: A common feature in many hard pattern recognition problems is the fact that the object of interest is statistically overwhelmed by others. The overall aim of the \Learning, Evolution and Extreme Statistics" (AE3 being its Spanish acronym) project is to study those problems in the following concrete areas: 1. Natural image statistics and applications. 2. New classiflcation techniques in extreme sample problems. 3. Evolutionary machine learning. 4. Machine learning and evolutionary computing in flnance. AE3 is a coordinated project between a research group at the Instituto de Ingenier¶‡a del Conocimiento (IIC) and another at the Escuela Politecnica Superior (EPS), both in the Universidad Autonoma de Madrid (UAM).
Discriminant Parallel Perceptrons
Ana Gonz´ alez, Iv´ an Cantador and Jos´ e R. Dorronsoro?
Depto. de Ingenier´ ıa Inform´ atica and Instituto de Ingenier´ ıa del Conocimiento
Universidad Aut´ onoma de Madrid, 28049 Madrid, Spain
Abstract. Parallel perceptrons (PPs), a novel approach to committee
machine training requiring minimal communication between outputs and
hidden units, allows the construction of efficient and stable nonlinear
classifiers. In this work we shall explore how to improve their perfor-
mance allowing their output weights to have real values, computed by
applying Fisher’s linear discriminant analysis to the committee machine’s
perceptron outputs. We shall see that the final performance of the re-
sulting classifiers is comparable to that of the more complex and costlier
to train multilayer perceptrons.
After their heyday in the early sixties, interest in machines made up of Rosen-
blat’s perceptrons greatly decayed. The main reason for this was the lack of
suitable training methods: even if perceptron combinations could provide com-
plex decision boundaries, there were not efficient and robust procedures for con-
structing them. An example of this are the well known committe machines (CM;
, chapter 6) for 2–class classification problems. They are made up of an odd
number H of standard perceptrons, the output of the i–th perceptron Pi(X)
over a D–dimensional input pattern X being given by Pi(X) = s(acti(X))
(we assume xD = 1 for bias purposes). Here s(·) denotes the sign function
and acti(X) = Wi· X is the X activation of Pi. The CM output is then
h(X) = s
= s(V(X)), i.e., the sign of the overall perceptron vote
count V(X). Assuming that each X has a class label yX = ±1, X is correctly
classified if yXh(X) = 1. If not, CM training applies Rosenblat’s rule
Wi:= Wi+ η yXX
to the smallest number of incorrect perceptrons (this number is (1+|V(X)|)/2);
moreover, this is done for those incorrect perceptrons for which |acti(X)| is
smallest. Although sensible, this training is somewhat unstable and only able
to build not too strong classifiers. A simple but powerful variant of classical
CM training, the so–called parallel perceptrons (PPs), recently introduced by
Auer et al. in , allows a very fast construction of more powerful classifiers,
with capabilities close to the more complex (and costlier to train) multilayer
?With partial support of Spain’s CICyT, projects TIC 01–572, TIN2004–07676.
perceptrons (MLPs). In PP training, (1) is applied to all wrong perceptrons but
the PP key training ingredient is an output stabilization procedure that tries to
keep away from 0 the activation acti(X) of a correct Pi, so that small random
changes on X do not cause its being assigned to another class. More precisely,
when X is correctly classified, but for a given margin γ and a perceptron Pi
we have 0 < yXacti(X) < γ, Rosenblatt’s rule is essentially again applied in
order to push yXacti(X) further away from zero. The value of the margin γ is
also adjusted dynamically so that most of the correctly classified patterns have
activation margins greater than the final γ∗(see section 2). In spite of their
very simple structure, PPs do have a universal approximation property and, as
shown in , provide results in classification and regression problems quite close
to those offered by C4.5 decision trees or MLPs.
There is much work being done in computational learning theory to build
efficient classifiers based on low complexity information processing methods.
This is particularly important for high dimensionality problems, such as those
arising in text mining or bioinformatics. As just mentioned, PPs combine simple
processing with good performance. A natural way to try to get a richer behavior
is to relax their clamping of output weights to 1, allowing these weights to have
real values. In fact, usually PP performance does not depend on the number
of perceptrons used, 3 being typically good enough. For classification problems,
a natural option, that we shall explore in this work, is to use standar linear
discriminant analysis to do so. We shall briefly describe in section 2 the training
of these discriminant PPs as well as their handling of margins, while in section 3
we will numerically analize their performance over several classification problems,
comparing it to that of standard PPs and MLPs. As we shall see, discriminant
PPs will give results somewhat better than those of standard PPs and essentially
similar to those of MLPs.
2 Discriminant PPs
We discuss first perceptron weight and margin updates. Assume that a set W =
(W1,...,WH) of perceptron weights and of Fisher’s weights A = (a1,...,aH)t
have been computed. The output hypothesis of the resulting discriminant PP is
h(X) = s
A · (P(X) −˜P)
with˜P = (P++P−)/2 and P±the averages of the perceptron outputs over the
positive and negative classes. We assume that the sign of the A vector has been
adjusted so that a pattern X is correctly classified if yXh(X) = 1. Now,
|Pi(X?)| = 1.
with N± the sizes of the positive and negative classes C±. We can expect in
fact that |(P±)i| < 1 and hence, |˜Pi| < 1 too. Therefore, yXai(P(X) −˜P) > 0
0 50100150 200250300350400450 500
0 50100150200250 300350 400450 500
Fig.1. Margin evolution for the thyroid (left) and diabetes datasets. Values depicted
are 10 times 10 fold crossvalidation averages of 500 iteration training runs.
if and only if yXaiP(X) > 0, and if X is not correctly classified, we should
augment yXaiPi(X) over those wrong perceptrons for which yXaiPi(X) < 0.
This is equivalent to augment yXaiacti(X) = yXaiWi· X, which can be simply
achieved by using again Rosenblatt’s rule (1) adjusted in terms of A:
Wi:= Wi+ ηs(yXai)X,
for then we have
yXai(Wi+ ηs(yXai)X) · X = yXaiWi· X + η|yXai||X|2> yXaiWi· X.
On the other hand, the margin stabilization of discriminant PPs is essentially
that of standard PPs. More precisely, if X is correctly classified, yXaiPi(X) > 0
and thus s(yXai)acti(X) > 0, which we want to remain > 0 after small X
perturbations. For this we may again apply (2) now in the form Wi := Wi+
λ η s(yXai)X to those correct perceptrons with a too small margin, i.e., those
for which 0 < s(yXai)acti(X) < γ, so that we push s(yXai)acti(X) further
away from zero. The new parameter λ measures the importance we give to
wide margins. The value of the margin γ is also adjusted dynamically from a
starting value γ0. More precisely, at the beginning of the t–th batch pass, we set
γt= γt−1; then, if a pattern X is processed correctly, we set γt:= γt+0.25η if all
perceptrons Pithat process X correctly also verify s(yXai)acti(X) ≥ γt−1, while
we set γt:= γt−0.75η if for at least one Piwe have 0 < s(yXai)acti(X) < γt−1.
These γtusually have a stable converge to a limit margin γ∗(see figure 1). We
normalize the Wiweights after each batch pass so that the margin is meaningful.
We also adjust the learning rate as ηt= η0/√t after each batch pass, as suggested
We recall that for 2–class problems, Fisher’s discriminants are very simple
to construct. In fact, the vector A = S−1
J = sT/sB= sT(A)/sB(A) of the total variance sT of discriminant PP outputs
to their between class variance sB. However, the total covariance matrix ST of
T(P+− P−) minimizes  the ratio
Problem set size pos. % input dim. num. hid. lr. rate num. hid. lr. rate num. hid.
heart dis. 46.1 13
Table 1. Input dimensions and training parameters used for the 7 comparison datasets.
MLPs were trained by conjugate gradient minimization.
the perceptrons’ outputs is quite likely to be singular (notice that the output
space for H perceptrons has just 2Hdistinct values). To avoid this, we will take
as the output of the perceptron i the value P?
function σγtaking the values σγ(t) = s(t) if |t| > λ = min(1,2γ) and σγ(t) = t/λ
when |t| ≤ λ. This makes quite unlikely that ST will be singular and together
with the η and γ updates allows for a fast and quite stable learning convergence.
We finally comment on the complexity of this procedure. For D–dimensional
inputs and H perceptrons, Rosenblat’s rule has an O(NDH) cost. For its part,
the STcovariance matrix computation has an O(NH2) cost, that dominates the
O(H3) cost of its inversion. While formally similar to the complexity estimates
of MLPs, computing times are much smaller for discriminant PPs (and more so
for standard PPs), as their weight updates are much simpler.
i(X) = σγ(Wi· X), with the ramp
We shall compare the performance of discriminant PPs with that of standard
PPs and also of multilayer perceptrons (MLPs) over 7 classification problems
sets from the well known UCI database; they are listed in table 1, together with
the positive class size, their input dimensions and the training parameters used.
Some of them (glass, vehicle, thyroid) are multi–class problems; to reduce them
to 2–class problems, we are taking as the minority classes the class 1 in the vehicle
dataset and the class 7 in the glass problem, and merge in a single class both sick
thyroid classes. We refer to the UCI database documentation  for more details.
In what follows we shall compare the performance of standard and discriminant
PPs and also that of standard multilayer perceptrons first in terms of accuracy,
that is, the percentage of correctly classified patterns, but also in terms of the
value g =
classes (see ). Notice that for sample imbalanced data sets a high accuracy
could be achieved simply by assigning all patterns to the (possibly much larger)
negative classes; g gives a more balanced classification performance measure.
In all cases, training has been carried out as a batch procedure using 10–times
√a+a−, where a±are the accuracies of the positive and negative
(2.16) (2.22) (2.15) (2.22) (1.67) (1.72)
74.97 71.87 74.25
(2.45) (3.98) (3.21) (5.34) (3.09) (4.33)
96.91 92.12 94.26
(2.09) (11.01) (2.09) (11.09) (2.87) (8.52)
79.97 78.95 73.90
(3.80) (3.88) (3.80) (3.88) (4.05) (4.24)
84.06 82.16 76.97
(4.58) (4.54) (3.91) (4.11) (4.19) (4.54)
(0.40) (1.84) (0.91) (9.46)) (1.63) (4.41)
(0.41) (3.46) (2.50) (5.00) (4.26) (5.77)
96.10 96.57 96.15 95.84 95.53
diabetes68.63 76.00 70.45
glass84.29 94.05 85.27
heart dis.73.80 75.22 74.68
ionosphere74.32 84.83 81.36
thyroid 82.55 97.62 94.01
vehicle65.03 81.51 74.48
Table 2. Accuracy, g test values and their standard deviations for 7 datasets and
different classifier construction procedures. It can be seen that discriminant PPs results
are comparable to those of MLPs and both are better than those of standard PPs.
10–fold cross–validation. Updates (1) and (2) have been applied in standard and
discriminant PP training, while conjugate gradient has been used for MLPs.
The number of perceptrons in all cases and the initial learning rates for PPs
and discriminant PPs for each dataset are described in table 1. Table 2 presents
average values of the cross–validation procedure just described (best values in
bold face, second best in cursive) for accuracies and g values, together with their
standard deviation. As it can be seen, discriminant PPs give the best accuracy
in 3 problems and second best in the other 4, MLPs give the best accuracy in
3 problems and second best in another 2, while standard PPs’s accuracy is best
only in the cancer problem and is second best over the glass data set. If we
consider g values, discriminant PPs’ g is highest in 4 problems and second best
in the other 3, MLPs’ g is highest in 2 problems and second best in another 4,
while standard PPs’ is highest only in the cancer problem. The performance of
discriminant PPs and MLPs is thus quite close and better than that of standard
We finish this section by noticing that the weight update (2) only aims to
reduce the classification error and to achieve a clear margin, but there is no
reason that it should minimize the Fisher criterion J. However, as seen in figure
2, this also happens. The figure depicts in logartihmic X and Y scales the evo-
lution of J for the ionosphere, glass, diabetes and thyroid datasets (clockwise
from top left). Values depicted are 10 times 10 fold cross–validation averages
of 500 iteration training runs. Although not always monotonic (as in the glass,
thyroid and diabetes problems), the overall J behavior is clearly decreasing and
Fig.2. From top left, clockwise: evolution of Fisher’s criterion for the ionosphere, glass,
diabetes and thyroid datasets. Values depicted are 10 times 10 fold crossvalidation
averages of 500 iteration training runs (all figures in log X and Y scale).
Parallel perceptron training offers a very fast procedure to build good and stable
committee machine–like classifiers. In this work we have seen that their classifica-
tion performance can be improved by allowing their output weights to have real
values, obtained by applying Fisher’s analysis over the perceptron outputs. The
final performance of these discriminant PPs is essentially that of the powerful
but costlier to build standard MLPs.
1. P. Auer, H. Burgsteiner, W. Maass, Reducing Communication for Distributed
Learning in Neural Networks, Proceedings of ICANN’2002, Lecture Notes in Com-
puter Science 2415 (2002), 123–128.
2. R. Duda, P. Hart, D. Stork, Pattern classification (second edition), Wiley, 2000.
3. P. Murphy, D. Aha, UCI Repository of Machine Learning Databases, Tech. Report,
University of Califonia, Irvine, 1994.
4. N. Nilsson, The Mathematical Foundations of Learning Machines, Morgan
5. J.A. Swets, Measuring the accuracy of diagnostic systems, Science 240 (1998),