Page 1

Discriminant Parallel Perceptrons

Ana Gonz´ alez, Iv´ an Cantador and Jos´ e R. Dorronsoro?

Depto. de Ingenier´ ıa Inform´ atica and Instituto de Ingenier´ ıa del Conocimiento

Universidad Aut´ onoma de Madrid, 28049 Madrid, Spain

Abstract. Parallel perceptrons (PPs), a novel approach to committee

machine training requiring minimal communication between outputs and

hidden units, allows the construction of efficient and stable nonlinear

classifiers. In this work we shall explore how to improve their perfor-

mance allowing their output weights to have real values, computed by

applying Fisher’s linear discriminant analysis to the committee machine’s

perceptron outputs. We shall see that the final performance of the re-

sulting classifiers is comparable to that of the more complex and costlier

to train multilayer perceptrons.

1Introduction

After their heyday in the early sixties, interest in machines made up of Rosen-

blat’s perceptrons greatly decayed. The main reason for this was the lack of

suitable training methods: even if perceptron combinations could provide com-

plex decision boundaries, there were not efficient and robust procedures for con-

structing them. An example of this are the well known committe machines (CM;

[4], chapter 6) for 2–class classification problems. They are made up of an odd

number H of standard perceptrons, the output of the i–th perceptron Pi(X)

over a D–dimensional input pattern X being given by Pi(X) = s(acti(X))

(we assume xD = 1 for bias purposes). Here s(·) denotes the sign function

and acti(X) = Wi· X is the X activation of Pi. The CM output is then

h(X) = s

= s(V(X)), i.e., the sign of the overall perceptron vote

count V(X). Assuming that each X has a class label yX = ±1, X is correctly

classified if yXh(X) = 1. If not, CM training applies Rosenblat’s rule

??H

i=1Pi(X)

?

Wi:= Wi+ η yXX

(1)

to the smallest number of incorrect perceptrons (this number is (1+|V(X)|)/2);

moreover, this is done for those incorrect perceptrons for which |acti(X)| is

smallest. Although sensible, this training is somewhat unstable and only able

to build not too strong classifiers. A simple but powerful variant of classical

CM training, the so–called parallel perceptrons (PPs), recently introduced by

Auer et al. in [1], allows a very fast construction of more powerful classifiers,

with capabilities close to the more complex (and costlier to train) multilayer

?With partial support of Spain’s CICyT, projects TIC 01–572, TIN2004–07676.

Page 2

perceptrons (MLPs). In PP training, (1) is applied to all wrong perceptrons but

the PP key training ingredient is an output stabilization procedure that tries to

keep away from 0 the activation acti(X) of a correct Pi, so that small random

changes on X do not cause its being assigned to another class. More precisely,

when X is correctly classified, but for a given margin γ and a perceptron Pi

we have 0 < yXacti(X) < γ, Rosenblatt’s rule is essentially again applied in

order to push yXacti(X) further away from zero. The value of the margin γ is

also adjusted dynamically so that most of the correctly classified patterns have

activation margins greater than the final γ∗(see section 2). In spite of their

very simple structure, PPs do have a universal approximation property and, as

shown in [1], provide results in classification and regression problems quite close

to those offered by C4.5 decision trees or MLPs.

There is much work being done in computational learning theory to build

efficient classifiers based on low complexity information processing methods.

This is particularly important for high dimensionality problems, such as those

arising in text mining or bioinformatics. As just mentioned, PPs combine simple

processing with good performance. A natural way to try to get a richer behavior

is to relax their clamping of output weights to 1, allowing these weights to have

real values. In fact, usually PP performance does not depend on the number

of perceptrons used, 3 being typically good enough. For classification problems,

a natural option, that we shall explore in this work, is to use standar linear

discriminant analysis to do so. We shall briefly describe in section 2 the training

of these discriminant PPs as well as their handling of margins, while in section 3

we will numerically analize their performance over several classification problems,

comparing it to that of standard PPs and MLPs. As we shall see, discriminant

PPs will give results somewhat better than those of standard PPs and essentially

similar to those of MLPs.

2 Discriminant PPs

We discuss first perceptron weight and margin updates. Assume that a set W =

(W1,...,WH) of perceptron weights and of Fisher’s weights A = (a1,...,aH)t

have been computed. The output hypothesis of the resulting discriminant PP is

h(X) = s

?

A · (P(X) −˜P)

?

= s

?H

?

1

ai(Pi(X) −˜Pi)

?

,

with˜P = (P++P−)/2 and P±the averages of the perceptron outputs over the

positive and negative classes. We assume that the sign of the A vector has been

adjusted so that a pattern X is correctly classified if yXh(X) = 1. Now,

|(P±)i| ≤

1

N±

?

X?∈C±

|Pi(X?)| = 1.

with N± the sizes of the positive and negative classes C±. We can expect in

fact that |(P±)i| < 1 and hence, |˜Pi| < 1 too. Therefore, yXai(P(X) −˜P) > 0

Page 3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 50100150 200250300350400450 500

thyroid

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 50100150200250 300350 400450 500

diabetes

Fig.1. Margin evolution for the thyroid (left) and diabetes datasets. Values depicted

are 10 times 10 fold crossvalidation averages of 500 iteration training runs.

if and only if yXaiP(X) > 0, and if X is not correctly classified, we should

augment yXaiPi(X) over those wrong perceptrons for which yXaiPi(X) < 0.

This is equivalent to augment yXaiacti(X) = yXaiWi· X, which can be simply

achieved by using again Rosenblatt’s rule (1) adjusted in terms of A:

Wi:= Wi+ ηs(yXai)X,

(2)

for then we have

yXai(Wi+ ηs(yXai)X) · X = yXaiWi· X + η|yXai||X|2> yXaiWi· X.

On the other hand, the margin stabilization of discriminant PPs is essentially

that of standard PPs. More precisely, if X is correctly classified, yXaiPi(X) > 0

and thus s(yXai)acti(X) > 0, which we want to remain > 0 after small X

perturbations. For this we may again apply (2) now in the form Wi := Wi+

λ η s(yXai)X to those correct perceptrons with a too small margin, i.e., those

for which 0 < s(yXai)acti(X) < γ, so that we push s(yXai)acti(X) further

away from zero. The new parameter λ measures the importance we give to

wide margins. The value of the margin γ is also adjusted dynamically from a

starting value γ0. More precisely, at the beginning of the t–th batch pass, we set

γt= γt−1; then, if a pattern X is processed correctly, we set γt:= γt+0.25η if all

perceptrons Pithat process X correctly also verify s(yXai)acti(X) ≥ γt−1, while

we set γt:= γt−0.75η if for at least one Piwe have 0 < s(yXai)acti(X) < γt−1.

These γtusually have a stable converge to a limit margin γ∗(see figure 1). We

normalize the Wiweights after each batch pass so that the margin is meaningful.

We also adjust the learning rate as ηt= η0/√t after each batch pass, as suggested

in [1].

We recall that for 2–class problems, Fisher’s discriminants are very simple

to construct. In fact, the vector A = S−1

J = sT/sB= sT(A)/sB(A) of the total variance sT of discriminant PP outputs

to their between class variance sB. However, the total covariance matrix ST of

T(P+− P−) minimizes [2] the ratio

Page 4

discr. PPsPPsMLPs

Problem set size pos. % input dim. num. hid. lr. rate num. hid. lr. rate num. hid.

breast cancer34.59

diabetes 34.97

glass 13.69

heart dis. 46.1 13

ionosphere 35.933

thyroid 7.48

vehicle25.718

Table 1. Input dimensions and training parameters used for the 7 comparison datasets.

MLPs were trained by conjugate gradient minimization.

5

5

5

5

5

5 0.0005

10

0.001

0.01

0.01

0.001

0.001

3

3

3

5

5 0.0001

50.001

50.001

0.001

0.001

0.01

0.001

5

5

5

5

7

5

50.01

the perceptrons’ outputs is quite likely to be singular (notice that the output

space for H perceptrons has just 2Hdistinct values). To avoid this, we will take

as the output of the perceptron i the value P?

function σγtaking the values σγ(t) = s(t) if |t| > λ = min(1,2γ) and σγ(t) = t/λ

when |t| ≤ λ. This makes quite unlikely that ST will be singular and together

with the η and γ updates allows for a fast and quite stable learning convergence.

We finally comment on the complexity of this procedure. For D–dimensional

inputs and H perceptrons, Rosenblat’s rule has an O(NDH) cost. For its part,

the STcovariance matrix computation has an O(NH2) cost, that dominates the

O(H3) cost of its inversion. While formally similar to the complexity estimates

of MLPs, computing times are much smaller for discriminant PPs (and more so

for standard PPs), as their weight updates are much simpler.

i(X) = σγ(Wi· X), with the ramp

3Numerical results

We shall compare the performance of discriminant PPs with that of standard

PPs and also of multilayer perceptrons (MLPs) over 7 classification problems

sets from the well known UCI database; they are listed in table 1, together with

the positive class size, their input dimensions and the training parameters used.

Some of them (glass, vehicle, thyroid) are multi–class problems; to reduce them

to 2–class problems, we are taking as the minority classes the class 1 in the vehicle

dataset and the class 7 in the glass problem, and merge in a single class both sick

thyroid classes. We refer to the UCI database documentation [3] for more details.

In what follows we shall compare the performance of standard and discriminant

PPs and also that of standard multilayer perceptrons first in terms of accuracy,

that is, the percentage of correctly classified patterns, but also in terms of the

value g =

classes (see [5]). Notice that for sample imbalanced data sets a high accuracy

could be achieved simply by assigning all patterns to the (possibly much larger)

negative classes; g gives a more balanced classification performance measure.

In all cases, training has been carried out as a batch procedure using 10–times

√a+a−, where a±are the accuracies of the positive and negative

Page 5

discr. PPs

acc.

96.50

(2.16) (2.22) (2.15) (2.22) (1.67) (1.72)

74.97 71.87 74.25

(2.45) (3.98) (3.21) (5.34) (3.09) (4.33)

96.91 92.12 94.26

(2.09) (11.01) (2.09) (11.09) (2.87) (8.52)

79.97 78.95 73.90

(3.80) (3.88) (3.80) (3.88) (4.05) (4.24)

84.06 82.16 76.97

(4.58) (4.54) (3.91) (4.11) (4.19) (4.54)

97.8992.06 96.86

(0.40) (1.84) (0.91) (9.46)) (1.63) (4.41)

76.1870.57 74.82

(0.41) (3.46) (2.50) (5.00) (4.26) (5.77)

PPsMLPs

acc.Problem set

cancer

g acc.gg

96.10 96.57 96.15 95.84 95.53

diabetes68.63 76.00 70.45

glass84.29 94.05 85.27

heart dis.73.80 75.22 74.68

ionosphere74.32 84.83 81.36

thyroid 82.55 97.62 94.01

vehicle65.03 81.51 74.48

Table 2. Accuracy, g test values and their standard deviations for 7 datasets and

different classifier construction procedures. It can be seen that discriminant PPs results

are comparable to those of MLPs and both are better than those of standard PPs.

10–fold cross–validation. Updates (1) and (2) have been applied in standard and

discriminant PP training, while conjugate gradient has been used for MLPs.

The number of perceptrons in all cases and the initial learning rates for PPs

and discriminant PPs for each dataset are described in table 1. Table 2 presents

average values of the cross–validation procedure just described (best values in

bold face, second best in cursive) for accuracies and g values, together with their

standard deviation. As it can be seen, discriminant PPs give the best accuracy

in 3 problems and second best in the other 4, MLPs give the best accuracy in

3 problems and second best in another 2, while standard PPs’s accuracy is best

only in the cancer problem and is second best over the glass data set. If we

consider g values, discriminant PPs’ g is highest in 4 problems and second best

in the other 3, MLPs’ g is highest in 2 problems and second best in another 4,

while standard PPs’ is highest only in the cancer problem. The performance of

discriminant PPs and MLPs is thus quite close and better than that of standard

PPs.

We finish this section by noticing that the weight update (2) only aims to

reduce the classification error and to achieve a clear margin, but there is no

reason that it should minimize the Fisher criterion J. However, as seen in figure

2, this also happens. The figure depicts in logartihmic X and Y scales the evo-

lution of J for the ionosphere, glass, diabetes and thyroid datasets (clockwise

from top left). Values depicted are 10 times 10 fold cross–validation averages

of 500 iteration training runs. Although not always monotonic (as in the glass,

thyroid and diabetes problems), the overall J behavior is clearly decreasing and

it converges.

Page 6

1

10

1 10100

ionosphere

1

10

1 10100

glass

1

10

110100

thyroid

1

10

110100

diabetes

Fig.2. From top left, clockwise: evolution of Fisher’s criterion for the ionosphere, glass,

diabetes and thyroid datasets. Values depicted are 10 times 10 fold crossvalidation

averages of 500 iteration training runs (all figures in log X and Y scale).

4 Conclusions

Parallel perceptron training offers a very fast procedure to build good and stable

committee machine–like classifiers. In this work we have seen that their classifica-

tion performance can be improved by allowing their output weights to have real

values, obtained by applying Fisher’s analysis over the perceptron outputs. The

final performance of these discriminant PPs is essentially that of the powerful

but costlier to build standard MLPs.

References

1. P. Auer, H. Burgsteiner, W. Maass, Reducing Communication for Distributed

Learning in Neural Networks, Proceedings of ICANN’2002, Lecture Notes in Com-

puter Science 2415 (2002), 123–128.

2. R. Duda, P. Hart, D. Stork, Pattern classification (second edition), Wiley, 2000.

3. P. Murphy, D. Aha, UCI Repository of Machine Learning Databases, Tech. Report,

University of Califonia, Irvine, 1994.

4. N. Nilsson, The Mathematical Foundations of Learning Machines, Morgan

Kaufmann, 1990.

5. J.A. Swets, Measuring the accuracy of diagnostic systems, Science 240 (1998),

1285–1293.