Content uploaded by Bojan Ploj

Author content

All content in this area was uploaded by Bojan Ploj on Feb 25, 2016

Content may be subject to copyright.

Chapter

OPTIMIZATION FOR MULTI LAYER

PERCEPTRON: WITHOUT THE GRADIENT

Dr Bojan Ploj

ABSTRACT

During the last twenty years, gradient-based methods have been

primarily focused on the Feed Forward Artificial Neural Network learning

field. They are the derivatives of Backpropagation method with various

deficiencies. Some of these include an inability to: cluster and reduce noise,

quantify data quality information, redundant learning data elimination. Other

potential areas for improvement have been identified; including, random

initialization of values of free parameters, dynamic learning from new data as

it becomes available, explanation of states and settings in hidden layers of

learned ANN, among other.

This chapter deals with the contemporary, non-gradient approach to

ANN learning, which is not based on gradual reduction of remaining learning

error anymore and it tries to eliminate most of the mentioned deficiencies.

Introduction includes a chronological description of some methods, which

deal with solutions of mentioned problems: Initializing Neural Networks

using Decision Trees (Arunava Banerjee, 1994), DistAl: An inter-pattern

distance-based constructive learning algorithm (Jihoon Yang, 1998),

Geometrical synthesis of multilayer feedforward neural networks or Multi-

Layer Perceptron (Rita Delogu, 2006) and Bipropagation - a new way of

MLP learning (Bojan Ploj, 2009).We continue with the description of a new

Dr Bojan Ploj

learning method - Border Pairs Method (BPM), which in comparison with

the gradient methods carries numerous advantages or eliminates most of the

predecessor’s deficiencies. The BMP implements and uses border pairs –

learning patterns pairs in the input space, which are located close to the class

border.

The number of boundary pairs gives us some information about the

complexity of the learning process. Boundary pairs are also the perfect basis

for the noise reduction. We determine that performing a noise reduction of

the border pairs is sufficient.

By dividing the input space, the homogenous areas (clusters) are

established. For every linear segment of border we assign one neuron in the

first layer. The MLP learning begins in the first layer by adapting individual

neurons. Neurons on the first layers are saturated, so we get a binary code on

the output of the first layer - the code is unified for all members of the same

cluster. Logical operations based on the data from the first layer are executed

in the following layers. Testing showed that such learning is reliable, it is not

subject to overfitting, and is appropriate for on-line learning and susceptible

to concept drift in the process of learning (forgetting and additional learning).

1. INTRODUCTION

Artificial Neural Networks (ANN) have already been known for several

decades and during this time have become well accepted and one of the most

common of intelligent systems. One of their best features is that they are easy to

use, because the user does not need to know anything about their performance

(black box principle). Their learning is unfortunately not as simple and efficient

consequently scientists have continually tried to improve them since they came

into existence. The most frequently used are the iterative learning methods, which

are based on the gradually reduction of cumulative learning error and they have

many shortcomings:

•Do not detect and eliminate the noise

•Do not know how to estimate the complexity of the learning patterns

•Do not cluster data

•Do not find features

•Unreliable learning

•Inaccurate learning

•Unconstructive learning

•Non-resistance to over-fitting

2

Optimization for Multi Layer Perceptron: Without the Gradient

•No elimination of barren patterns

•Non-modular learning

•Iterative learning

•Unsuitable for dynamic learning

These iterative methods are based on the calculation of the error gradient,

thus their learning error decreases in a number of small steps, and in doing so the

system often gets stuck in a local minimum.

In the following text, we will first note the gradient descent learning, and then

we will perceive two new algorithms: Bipropagation and Border Pairs Method.

The first algorithm, Bipropagation is a smaller improvement of the well-known

Backpropagation algorithm and still remains gradient descend. The second

algorithm, the Border Pairs Method is a completely new a design and more or less

successfully does away with all the enumerated disadvantages of the gradient

methods. Mentioned will also be some of the areas of BPM, which have not yet

been thoroughly researched.

2. GRADIENT LEARNING

A typical representative of gradient descent learning is Perceptron - a simple

Feedforward Neural Network (FFNN), introduced by Frank Rosenblatt in 1957

[11, 14, 38]. All neurons in Perceptron are arranged in one straight line, which is

called the layer (Figure 1). Its input and output values can be continuous or

discrete. The learning process is called delta rule and takes place gradually over a

number of repetitions (iterative process).

3

Dr Bojan Ploj

Figure 1. Perceptron with three inputs, two neurons and two outputs. The small circles in

the graph are inputs, the large circles are neurons, and the arrows are synapses.

Learning and structure of Perceptron are simple, but unfortunately are also

simple problems that you can effectively resolve, as it is only successful when

learning patterns can be separated by a straight line or its multidimensional

equivalent (linear separability). Real-life problems are usually not so simple.

With the serial connection of several perceptrons a multi-layer perceptron is

created (MLP) as shown in Figure 2, which no longer has the mentioned linear

constraints. If the MLP is large enough, it can be used to solve any demanding

task (Kolmogorov’s theorem), but with its size the complexity of its learning also

increases. Its original learning algorithm is called Backpropagation. Discovered

by Paul Werbos in 1974, it was enforced only in 1986 [14] and then it was caused

a resurgence of neural networks, which have already slowly begun to sink into

oblivion. The event was so remarkable that the period before its resurgence was

called AI Winter, because until then there was no appropriate learning algorithm

for the multi-layer perceptron, and the single-layer gave too modest results.

Figure 2. Multi-layer Perceptron. It has four inputs (small circles), three neurons in the

hidden layer (lower large circles) and two neurons in the output layer (upper large circle).

The arrows represent connections or synapses.

Backpropagation is one of the algorithms for supervised learning, where the

desired output values for each learning pattern are known. During the learning

process the weights of synapses and biases are iteratively changing so that the

cumulative error of all learning patterns is gradually reduced. These changes are

based on the gradient of cumulative learning error therefore, during the

learning process you can get in local minima, and get stuck there. At the start of

4

Optimization for Multi Layer Perceptron: Without the Gradient

learning, the weights of synapses are randomly selected, which normally means

that are far from the optimal value, resulting in a slow, lengthy and unreliable

learning.

Over the years, the backpropagation algorithm did receive a number of

variants (Quickprop, Quasi-Newton, Levenberg-Marquardt...), which all preserve

this common properties based on gradient descent learning and associated

drawbacks:

•Slowness - for successful learning a large number of iterations are

needed, where free parameters are calculated gradually in small steps.

Usually, several thousand such increments are required to yield the

corresponding learned neural network. This is not the only cause for

slowness, as with the exception of useful learning patterns (supporting

vectors) also barren ones are used.

•Unreliability - during the learning process the cumulative error of

learning patterns is iteratively calculated, which with gradient descent

corrections is gradually reduced. We look for the global minima, during

which it frequently occurs that we go off and get stuck in local minimum,

and thus conclude the learning process when the learning error is still

much greater than acceptable. In this case, the initial value of the free

parameters is changed and the whole learning process is repeated as

many times as is required until there is a change of luck. Due to the

unconstructiveness of the learning methods (see next paragraph) we have

to guess the appropriate MLP construction. It is also possible that due to

unsuitable MLP construction, learning is never successful. Too small

MLP cannot be satisfactorily learnt; too big is subjected to overfitting,

which is also not good.

•Unconstructiveness - between learning with gradient descent algorithm,

we do not know anything about the optimum number of layers and the

optimal number of neurons in individual layers of neural networks.

Optimal construction is looked for through guesswork, so that we learn

many diverse MLPs and in the end use the best. Due to the unreliability

of learning (see previous paragraph), we must learn each MLP even more

times.

The last two shortcomings mutually foment, so it may happen that learning

gets into a vicious circle from which the output cannot be found. Because of these

shortcomings, the learning of MLP with the gradient methods is complex and

relatively inefficient.

5

Dr Bojan Ploj

Apart from the above shortcomings during this type of learning take rise a

number of questions:

•Is it necessary for the learning process to use all the learning

patterns, or might it suffice to use the already selected patterns? How to

choose these patterns? We can assume that learning with selected patterns

will be quicker and with fewer complications. A similar idea is used by

the method of SVM [9, 22, 34], where the selected learning patterns are

called support vectors.

•Is it possible to find the initial value of the free parameters of MLP

that are better than random? Is it possible to determine the values of

free parameters in a single step (non-iterative)? In part, this is true for the

bipropagation method, where some weights are determined in advance

[1].

•Can the use of a feature with a greater dimension than the input

space simplify and improve learning? How to find these potential

features? Usually the feature has a smaller dimension than the original

learning pattern, since it is aimed to eliminate redundancy in the data.

The other way round engages the support vector method, which seeks to

locate features of larger dimensions, which are linearly separable [34].

•Does there exist in the hidden layers of the learned MLP some kind

of rule? Does that rule in some way inform something about the

functioning of the internal layers of MLP? Until now, it was felt that the

MLP is a black box, which at the output gives a result that is not

accompanied by an explanation [14].

•Is it possible during the learning process to find (near) optimal MLP

construction for given learning patterns? Usually, under the term

optimal construction is understood as a small MLP, which still solves

given problems, because small MLPs are less subject to overfitting [14].

When we want a robust MLP it is necessary to add some redundant

neurons that can replace damaged ones. Some other methods during

learning find suitable MLP construction, which is not a necessarily

optimal.

•Is it possible during learning process to reduce the noise of learning

patterns and as such simplify learning and increase its accuracy?

There are several methods of noise reduction, which are carried out

already before the beginning of learning. Optimal noise reduction takes

6

Optimization for Multi Layer Perceptron: Without the Gradient

place during the learning process; it is carried out to the extent

appropriate and affects only the relevant learning patterns.

•Is it possible already before the learning process to identify the

quality of learning patterns, and how difficult it will be to learn? It is

good during the learning process to know which learning data is being

dealt with, because then requirements and expectations can be adjusted. If

the learning data is bad (noise, overlapping, unrepresentativeness...), we

should not expect a good learning outcome [54].

•Is it possible to perform clustering data during the learning process?

Clustering input data during the learning process means that the complex

problem is divided into several simpler ones, which also facilitates an

understanding of how MLP solves the given problem.

•Is effective incremental and online MLP learning possible? In the

Backpropagation algorithm it is usually necessary to repeat the whole

process of learning from the beginning onwards due to one additional

learning pattern [54]. Ideal algorithm allows the continuation of prior

learning.

•Is it possible to drift the concept during the learning process? Equally

as when adding fresh learning patterns the same also applies in the

removal of obsolete. Because of only one removed or added learning

pattern we do not want to repeat the whole learning process.

3. BIPROPAGATION ALGORITHM

The Bipropagation algorithm is the intermediary link between the gradient

and non-gradient algorithms. It was discovered by Bojan Ploj in 2009 [1] as an

improvement of the backpropagation algorithm. Bipropagation has retained its

predecessor's iterativeness and non-constructiveness and took over the idea of

kernel functions from the support vectors machine. Kernel function in the first

layer makes the partially linearisation of the input space and thus allowing faster

and more reliable learning of the subsequent layers. It often happens that due to

the difficult learning patterns the backpropagation method completely fails, while

due to linearization the bipropagation method in the same conditions is fast,

efficient and reliable.

The original idea of the Bipropagation algorithm is that the hidden layers

MLP obtain the desired values. Thus, the N-layer perceptron is divided into N

single-layer perceptrons and with that the complex problem of learning is divided

7

Dr Bojan Ploj

into several simpler problems, independent of each other. Learning thus becomes

easier, faster and more reliable than the backpropagation method. The prefix bi- in

the name of the algorithm arose because the corrections of weights synapses

during learning spread in both directions (forward and backward).

The gradualness of the bipropagation method from layer to layer is also

evident in the matrix of Euclidean distances between learning patterns (Equation

1). The elements on the same position of the matrix from the input layer towards

the output gradually change the values (equation 1, 2 and 3).

3.1. Description of the Bipropagation Algorithm

We will describe the algorithm with the logical XOR function example. As

already stated, the basic idea of the bipropagation algorithm is that the inner layers

are no longer hidden, but get the desired output values. With this measure learning

of the entire MLP is divided into learning of two individual, linear layers which

has no problem with local minima.

Suitable MLP construction for this problem is shown in Figure 3. The XOR

function was chosen for this example because it is not extensive and also contains

local minima, which causes problems to many other methods.

Figure

3 Multi -layer perceptron, which successfully solves the

XOR problem

Y

X

X

n

Y

n

XOR

8

Optimization for Multi Layer Perceptron: Without the Gradient

Figure 3. Multi-layer perceptron, which successfully solves the XOR problem.

Table 1. Logical XOR function.

X y XOR

0

0

1

1

0

1

0

1

0

1

1

0

Ri matrix of Euclidean distances (Equation 1) of the input space for the data

shown in Table 1 looks like this:

Equation 1

The elements of the Ri matrix are Euclidean distances between individual

learning patterns. The distance between the patterns n and m in the matrix are

found in row n and column m. Because the Euclidean distance is commutative,

the Ri matrix is symmetric regarding its main diagonal, where are only zeros,

since there are given distances for learning patterns to itself. In a similar way we

get the matrix of Euclidean distances for data from the output layer:

Equation 2

Table 2. Show of inner values Xn and Yn.

X Y XnYnXOR

0

0

1

1

0

1

0

1

0

0

1

0

0

1

0

0

0

1

1

0

9

Dr Bojan Ploj

How we get the intrinsic value Xn and Yn shown in Table 2 will be observed a

little later. Now we are interested only in the Euclidean distance of R n matrix in

the inner layer:

Equation 3

We can see a similarity between the matrices Ri, Rn and Ro. If we ignore the

anti-diagonal, the values of the elements in the same positions of all three matrices

are the same. Let us look at the anti-diagonal in more detail. The input matrix R i

has all its values equal to the square root of the number two, the inner matrix R n

already contains two zeros, and the output matrix R o contains only zeros. We note

that the values of Rn matrix are in between values of Ri and Ro on the same

position, and that these values change gradually, which gives us hope that we are

on the right path to a solution.

The biggest problem of the bipropagation method is the search for suitable

inner desired values (Xn and Yn), which allow a gradual decline of values on the

same position of R matrices. The author of the algorithm proposes two ways to

search for relevant inner desired values:

1 Analytical: In this method of defining the desired values of inner layers a

new concept is introduced called patterns quality (PQ).

Equation 4

Where is:

∑DDC = the sum of the distances between patterns of different classes

∑DSC = the sum of the distances between patterns of the same class

We want that the patterns quality with each transition to the next layer

increase, respectively that the members of the same class draw themselves closer

to each other, and members of different classes distance themselves from each

10

Optimization for Multi Layer Perceptron: Without the Gradient

other. To calculate the value of the inner layers a non-linear system of equations

with free parameters (kernel functions) is used. When calculating the quality of

the patterns for the inner layer, we obtain an expression that contains only

constants and free parameters. By choosing the free parameters the quality of the

patterns in the inner layer is maximized.

An example of a non-linear system of equations is shown in the equation 5.

Pairs (x, y) form a two-dimensional set of learning patterns, which have the role

of variables, constants a, b, c, d, e and f are free parameters.

fxyeydxy

cxybyaxx

n

n

++= ++=

Equation 5

After the transformation with equation 5 we get new inner values:

Table 3. transformed inner values.

From table 2 From equation 5

XnYnXnYn

0

0

1

0

0

1

0

0

0

b

a

a+b+c

0

e

d

d+e+f

From that the free parameters can easily be calculated:

a = 1, b = 0, c = -1,

d = 0, e = 1, f = -1 Equation 6

By inserting them in the non-linear system of equations we get:

xyyxy

xyyxx

n

n

110

101

−+= −+=

Equation 7

2 Graphically: In this way the desired value of the inner layer in the interval

between the input value and the desired output value is selected. Their arithmetic

mean can be used. In the case of Figure 4 all the grey circles go halfway to the

point (0, 0), and all the white circles halfway to the point (1, 1). If there are more

inner layers, the principle of gradualism is used. Thus, in a four layer perceptron a

move of one-third is carried out in the first inner layer, in the second inner layer a

move of two-thirds.

11

Dr Bojan Ploj

Bipropagation algorithm is tested with different sets of learning patterns and

shows as being fast and reliable. Learning of the logical XOR function with the

bipropagation method runs more than 25 times faster than the backpropagation

method [1]. In this also the number of necessary learning epochs is much more

constant (the standard deviation is smaller), which indicates that we are less

dependent on luck in choosing the initial values of weights.

Figure4: Graphical survey of learning samples prior to transformation

(left) and after it (right)

(0,0)

(0,1)

(1,0)

(1,1)

(0,0)

(0.5

,1)

(1, 0.5)

(0.5, 0.5)

Figure4. Graphical survey of learning samples prior to transformation (left) and after it

(right).

Also, testing with real learning data gives good results. In the test we

recognized handwritten decimal digits. The learning set consisted of

approximately 60 thousand digits that were read at a resolution of 28 x 28 points

and they were contributed by more than 500 different writers. Despite extensive

data all the learning digits have been correctly identified after only some ten

iterations.

Also test digits have yielded good results. The Backpropagation method in the

same circumstances completely failed, as its learning error did not even begin to

reduce. Used in this comparison was the Levenberg-Marquardt backpropagation

conjugate method, which is considered as one of the most effective gradient

learning methods.

4. BORDER PAIRS METHOD

One step more advanced than the Bipropagation method is the Border Pairs

method – a constructive, non-gradient method of machine learning classification.

This provides much more than the gradient method: validation of the learning

12

Optimization for Multi Layer Perceptron: Without the Gradient

patterns, the elimination of barren learning patterns, clustering of learning

patterns, the formation of features, noise reduction and classification of learning

patterns. This whole new method is carried out non-iteratively, reliably, without

being subjected to overfitting and on top of all that it is also suitable for dynamic

learning (online or offline) and for concept drift [54]. Among other things, it

differs from the bipropagation method in that the boundary between the classes is

completely linearized, which is done in sections.

4.1. Border Pairs Definition

First, we define the concept of near points. The word near we want to

determine as accurately as possible, easily and at the same time suitable for the

input space with arbitrary number of dimensions. These requirements of defining

nearness correspond to the Euclidean distance dAB between the points A and B in

the N dimensional space:

Equation 8

N= the number of dimensions of the input space

All points in the two-dimensional space, which are distant for the same

Euclidean distance from the starting point (the center), form a circle. Figure 5

shows the points A and B, which are called the near points, if inside the

intersection of their circles with a radius, which is corresponding to their mutual

Euclidean distance (area with hatching) there is no third point.

13

Dr Bojan Ploj

A

B

Figure 5 Points A a nd B in two-dimensiona l spa ce

Area with hatching represent circle intersection

between those points

hatching in Figure 5) there is no third point.

Y

X

Figure 5. Points A and B in two-dimensional space. Area with hatching represent circle

intersection between those points.

Definition: The Border Pair forms a pair of two near points of different

classes, points which have empty mutual intersection of circles.

In the three-dimensional space circles are replaced with spheres, in the space

with four or more dimensions with hyper spheres.

4.2. The Impact of the Learning Data Feature on Border Pairs

Figure 6 shows an example of randomly distributed learning patterns. The

dashed line is the optimum borderline; the black circles are negative learning

patterns, and the white ones positive. Border Pairs participants are illustrated with

larger circles. Participants of the same border pair are connected by a line. In the

figure 6 we can see seven border pairs, and the same learning pattern can also

involve more pairs (for example, pattern no. 50). Since the data shown do not

include noise, all the white circles are on the right side of the border line, and all

the black ones on the left.

14

Optimization for Multi Layer Perceptron: Without the Gradient

Figure 6 An Example of learning patterns at the start

v

Figure 6. An Example of learning patterns at the start.

4.2.1. Influence of the Number of Patterns on the Number of Border Pairs

This research assesses how the number of learning patterns influence the

number of border pairs. The start data changes the number of learning patterns

from ten to two hundred according to the logarithmic scale, with the number of

positive patterns always the same as the number of negatives. In order to eliminate

the influence of coincidence the research was repeated a hundred of times and the

average was calculated. The data in Table 4 show the absolute number and

relative share of the learning patterns that are participants in border pairs. [54]

Table 4. Impact of the number of patterns on the share of border pairs.

Number of patterns in one

class

Average number of border

pairs

Relative share of

participants

10 2.96 0.2960

20 4.08 0.2040

50 6.25 0.1250

100 9.42 0.0942

15

Dr Bojan Ploj

200 13.0 0.0650

It was found that by increasing the number of patterns also the number of

border pairs increase, but much more slowly than the number of patterns, and

therefore, the relative proportion of participants reduces. Twenty times

amplification of the number of learning patterns (from 10 to 200) causes only

about a fourfold increase in the number of border pairs (from 2.96 to 13.00).

4.2.2. The Impact of the Ratio of Learning Patterns on the Number of

Border Pairs

Also the ratio between the number of positive and negative patterns have an

impact on the number of border pairs. The same as in the previous researches we

excluded the impact of randomness, so that the calculation was repeated a

hundred times and the average calculated. It was found that the maximum of pairs

is obtained when the positive and negative patterns are of the same number, by

increasing the differences the number of pairs decline (see Table 5).

We note that in the minority class it makes sense to have at least 20 percent of

the learning patterns, namely the number of border pairs strongly decreases and

the determination of the borderline rendered difficult.

Table 5. Impact of the ratio between patterns on the number of border pairs.

Number of positive

patterns

Number of negative

patterns Share of participants

10 90 0.0774

20 80 0.1034

30 70 0.1222

40 60 0.1236

50 50 0.1250

60 40 0.1237

70 30 0.1225

80 20 0.1031

90 10 0.0779

4.2.3. The Impact of Noise on the Number of Border Pairs

By increasing noise also the number of border pairs increases. The intensity

of noise in the table 6 is given in percentages of the entire range of input values.

2% of noise therefore, means that the value of the variable x after adding noise

comes in the interval x ± 0.01.

16

Optimization for Multi Layer Perceptron: Without the Gradient

Table 6. Impact of noise intensity on the number of border pairs.

Noise intensity in % Average number of border pairs

0 6.25

1 6.38

2 6.49

5 7.14

10 8.27

20 10.13

50 22.17

100 38.62

From the research results in Table 6 it is evident that by increasing the noise,

it also monotonically increases the number of border pairs.

4.2.4. The Impact of Noise on Outliers

Here we research how noise intensity affects the strength of the number of

outliers - learning patterns, which due to noise overstep the boundary between the

classes. In doing so, we observe two separate types of outliers - participants and

non-participants in the border pairs. From the research results in Table 7 we find:

1) The vast majority of outliers are involved in border pairs.

2) A very large majority of patterns, which are not border pair participants,

are not outliers.

Table 7. Impact of noise on outliers.

Noise

(%)

Average

number of

outliers

Average

number of

participating

outliers

Average

number of

border

pairs

Average

number of

participants

Share of

outliers

participating in

border pairs

(%)

Share of non-

outliers in

patterns outside

of border pairs

(%)

0 0.00 0.00 6.37 11.33 - 100.00

1 0.26 0.25 6.44 11.41 96.15 99.99

2 0.62 0.59 6.55 11.55 95.16 99.97

5 1.19 1.12 7.07 12.11 94.12 99.92

10 2.49 2.38 7.46 12.46 95.58 99.87

20 4.69 4.27 10.32 15.57 91.04 99.50

50 12.43 11.31 22.26 30.15 90.99 98.40

100 25.38 22.66 39.90 52.93 89.28 94.22

17

Dr Bojan Ploj

5. NOISE REDUCTION WITH BORDER PAIRS

Noise is an undesirable data component and it is difficult to remove. Because

it affects learning and test data, it is possible that it moves learning patterns, and

with that also the border line in one direction and test patterns in the opposite

direction. Therefore, there are two causes of misclassification:

•Noise of learning patterns during learning cause us to define the wrong

position of the border line.

•Noise of test patterns cause that they cross the given border line.

5.1. The Principle of Noise Reduction with Border Pairs

In noise reduction we decide which learning patterns will be moved, in which

direction and by how much. From the results of research on the impact of noise on

the border pairs we can conclude two things:

1) it does not make sense for learning patterns that are not involved in

border pairs to reduce noise (to move), as it is very big possibility that

they are not outliers.

2) it is reasonable to reduce noise for learning patterns, which are involved

in border pairs with approximating the nearest pattern from the same

class that is not involved in border pairs. They can also be distanced from

the nearest non-participating pattern of the opposite class.

The results of noise reduction will be seen later, under the title Noisy data.

6. CLUSTERING DATA WITH THE BORDER PAIRS METHOD

Clustering is a procedure which divides the input space so as to obtain two or

more homogeneous areas [43]. One area is homogeneous if it contains learning

patterns of only one class. Border between areas in a two-dimensional space is a

line, which in three-dimensional space substitutes the surface, and in four and

more dimensional space, the hyper-surface. Lines, surfaces and hyper-surfaces in

general can also be curved or not linear respectively.

18

Optimization for Multi Layer Perceptron: Without the Gradient

We will limit ourselves to a linear separation, as we shall in the continuation

use this algorithm for learning MLP, where only LTU neurons, which have a

linear border line or (hyper) surface, are used.

6.1. Clustering with Border Pairs

The basic idea of clustering with border pairs is simple. Between two patterns

of the same border pair draw a line (Figure 8), which separates it and is called

"border line" ( lines a, b, and c). Border lines divide the whole input area into

several areas (A, B, C...). Homogeneous areas of learning patterns are called a

“clusters”. The same border line can separate several border pairs. In doing so the

separated patterns of the same class must be in the same half-plane. It is necessary

for the separation of all pairs that there are as many border lines as there are

border pairs. If with the same line two or more border pairs are separated, we have

then successfully reduced the number of necessary border lines.

The idea for clustering with the border pairs method formed during the study

of the learned MLP, when we observe the behavior of individual neurons in the

inner layer, and in so doing came to the following conclusions:

•That the output value of neurons is always close to the value 0 or 1, in

spite of the fact that they have a continuous transfer function - neurons

operate in saturation.

•That the value at the output of the first layer does not change as long as

we stay within the same area, which means that the input area is divided

into several homogeneous areas - clusters.

•That in the second and any subsequent layers a logic operation is

performed with the data of the previous layer.

Each area of the input space belongs a binary code, which has a number of

bits equal to the number of border lines and number of neurons in the first layer.

The individual bit tell us on which side of corresponding border line we are. Thus

obtained area codes are features.

19

Dr Bojan Ploj

X

Y

Figure 7 Homogeneity of areas

The area on right side of dott ed line is not homogeneous,

because it contains a patt ern f rom t he opposite class

(whit e circle).

Figure 7 Homogeneity of areas. The area on right side of dotted line is not homogeneous,

because it contains a pattern from the opposite class (white circle).

The described clustering and feature determination are based on solely border

pairs or the Euclidean distance, for this reason everything can be generalized also

at the input area with an arbitrary number of dimensions.

Let's look at the impact of noise on the position of the border line. If we add a

little noise to the individual learning data, the learning patterns move slightly and

each in a different direction – Figure 9. Thus the noise wants some patterns to

move the border line to the left (patterns 1, 3 and 7), and the noise of other

patterns (patterns 2, 4, 5, 6 and 8) to the right.

20

Optimization for Multi Layer Perceptron: Without the Gradient

8

9

7

a

11

18

17

16

15

12

13

14

3

4

5

2

6

1

A

B

C

D

E

F

G

X

Y

Figure 8 Lear ni ng da taset with bor der lines and areas

1, 2, 3,…18: learning patt erns

A, B, C,…G: areas

a, b, c: border lines

arrows indicate the border pairs

10

b

c

Figure 8. Learning dataset with border lines and areas. 1, 2, 3,…18: learning patterns. A,

B, C,…G: areas. a, b, c: border lines. arrows indicate the border pairs

X

Y

1

2

3

4

5

6

7

8

Figure 9 Noisy data

Dashed line – pattern without noise

Solid line – pattern with noise

Figure 9. Noisy data. Dashed line – pattern without noise. Solid line – pattern with noise

21

Dr Bojan Ploj

Table 8. Area code of the input space.

area border lines

a b c

A 1 1 1

B 0 1 1

C 1 1 0

D 1 0 1

E 0 1 0

F 1 0 0

G 0 0 1

Since their effect on the border line is contradictory, it partially nullifies and

therefore the position of the border line, which is surrounded by numerous border

pairs, due to the noise hardly ever changes. This applies as long as the noise is

small enough that the learning patterns do not exceed the border line.

6.2. Complication in Clustering With Border Pairs

Let us see if with the separation of all border pairs we obtain only a

homogenous area? The three border lines in the figure 10 successfully separate all

border pairs, a pair (12, 15), even twice. The input area is thus divided into five

sections, if we count only those that contain at least one learning pattern. All five

areas are homogeneous (one white and four black), so the clustering succeeded.

This simple algorithm was used for clustering:

Algorithm 1: A simple algorithm for clustering set of training data

1Step: Find the border pairs.

2Step: Divide all border pairs with as few border lines as possible.

In the case of more complex learning patterns after separation of all the

border pairs, there sometimes remain a few learning patterns on the wrong side of

the line, and some areas remain heterogeneous (Figure 11). Algorithm 1 finds

only border pairs on the left and right sides of the square (the white patterns),

which are separated by the two vertical lines. The area in the middle remains

heterogeneous, as except for the white patterns it contains two black patterns (39

and 40).

22

Optimization for Multi Layer Perceptron: Without the Gradient

Figure 10 Successful clustering of triangular learning data

Figure 10. Successful clustering of triangular learning data.

Central area with white patterns is not homogeneous. Distracting are black

pattern number 39 and 40. Pair (6, 40) was not detected because of

interference of black patterns with numbers 18, 19, 36 and 37 in adjacent areas.

Figure 11. Deficient clustering. Central area with white patterns is not homogeneous.

Distracting are black pattern number 39 and 40. Pair (6, 40) was not detected because of

interference of black patterns with numbers 18, 19, 36 and 37 in adjacent areas.

23

Dr Bojan Ploj

The question arises as to why these two patterns are not participating in any

of the border pairs. If we observe separately only the patterns in the central area

we ascertain that also the patterns 39 and 40 are participating in the border pairs.

In searching for them the learning patterns in the neighboring areas distracted us.

The formation of the upper border pair were hampered by the black pattern

numbers 18, 19, 36 and 37.

Figure 12. Successful clustering of learning patterns. Shown are nine homogeneous areas,

one positive and eight negative.

The same applies below. This realization makes it possible to improve the

algorithm in such a way that it finds all the border pairs among the learning

patterns. If after the separation of the border pairs, there remains a heterogeneous

area, it is dealt with once again but this time separately from other areas. With this

algorithm upgrade in every heterogeneous area we find additional border pairs

that we have overlooked with algorithm 1 (figure 12). The search for the border

pairs is completed only when all areas are homogeneous. The improved algorithm

search for border pairs is as follows:

Algorithm 2: Improved algorithm for clustering learning data sets

24

Optimization for Multi Layer Perceptron: Without the Gradient

1Step: Find the border pairs.

2Step: Separate the border pairs with border lines.

3Step: Check the homogeneity of the resulting areas.

4Step: If you find a heterogeneous area, search in it for additional border

pairs and continue with Step 2.

Table 9. Number of found border pairs in different data sets.

Data set Number of border pairs Number of border lines

algorithm 1 algorithm 2 algorithm 1 algorithm 2

Vine 1 14 18 1 2

Vine 2 25 27 2 4

Vine 3 11 11 1 1

Iris Setosa 2 2 1 1

Iris Versicolor 25 27 2 4

Iris Virginica 12 15 3 5

Digit 0 4 5 2 3

Digit 1 12 16 2 5

Digit 2 5 6 1 2

Digit 3 5 11 1 4

Digit 4 8 11 2 5

Digit 5 8 9 2 3

Digit 6 5 6 1 1

Digit 7 8 10 2 3

Digit 8 6 12 2 6

Digit 9 9 10 2 3

The improved algorithm for clustering data with border pairs functioned

correctly in all test datasets. Results of clustering for sixteen sets of learning data

by the border pairs method with both algorithms are shown in Table 9, where it is

clear that the improved algorithm 2 usually finds a few more border pairs than the

simple algorithm 1

6.3. Combining of Border Pairs

We have already mentioned that the same border line can separate two or

more border pairs. Likewise we also know that the number of border lines

corresponds to the number of neurons in the first layer of MLP. From these facts

it follows that in each line we have to separate as many pairs as possible, if we

25

Dr Bojan Ploj

want to get a small MLP (near minimal construction). Which border pair separates

the same line, is a difficult question, and it needs to be researched further.

Algorithm 3: Algorithm for combining border pairs

1Step: Search for the border pairs.

2Step: Take a new neuron and learn it with the data of the next border

pair.

3Step: Add new (nearby) border pair and teach the same neuron. If

learning succeeds, then retain the new pair and mark it as used, otherwise

discard it.

4Step: Continue with Step 3 up until the last border pair.

5Step: Continue Step 2 until all border pairs are used.

6Step: Check all areas of the input area. If an area is heterogeneous then

search for additional border pairs and continue with Step 2.

Algorithm 3 tries to combine each border pair with all the other border pairs

and therefore is time consuming. The result also depends on the order of

combining attempts.

7. CLASSIFICATION OF DATA WITH THE BORDER

PAIRS METHOD

7.1. Description of the Data Classification with Border

Pairs Method

The binary value at the perceptron output, which we got by clustering data

with the border pairs method, can be used to classify data into classes, and

feasible in several ways. One of them is Boolean algebra (use of logic functions),

since the data from the clustering are first of all binary. In this research we will

continue with the use of additional perceptrons. To the perceptron, which cluster

input data, we cascadingly add more new perceptrons. Added neural network is

called the multi-layer perceptron remainder (MLP remainder). We have

researched two possibilities for construction of the MLP remainder:

26

Optimization for Multi Layer Perceptron: Without the Gradient

1 The layers in the remainder of the MLP are formed in exactly the same

way as we created the first layer. When in the next layer there remains

only one neuron the construction of MLP is concluded.

2 All subsequent layers are treated as additional MLP that you learn with

one of the established gradient methods. It turns out that an eventual

bottleneck is only in the first layer, therefore learning of the additional

MLP runs quickly and reliably. In all cases, the learning error is

monotonically declining rapidly and so it seems that no function of the

residual learning errors in the additional MLP contains a local minimum.

7.2. Examples of Learning with the Border Pairs Method

Both the just described methods of classification were tested with a number of

valid, real and synthetic learning data sets. Linearly separable sets of learning

patterns are classified already in the first layer (plain perceptron) and therefore

determination of the following layers is not necessary for them. That's why we

started the research with the non-linear set of XOR learning data.

7.2.1. XOR

A feature of the set XOR is that it contains only four learning patterns, which

are only two-dimensional, yet cause problems for numerous learning algorithms.

The reason for this is that the local minima of the XOR function in which the

gradient method often shifts into and usually gets stuck. In this case, the learning

stops already, when the residual learning error is still very large, or too large.

Let look at the course of learning the XOR function with the Border Pairs

Method. First, find the border pairs. From Figure 13 it is evident that between the

pairs of points A and B, A and C, B and D and C and D there are no intermediate

points. So we have four border pairs: AB, AC, BD and CD. After searching

border pairs their combining is next. The pairs AB and AC can be combined, as

they can be separated by the same line or one neuron. Because for the remaining

pairs one line is sufficient, we have succeeded in separating all the border pairs

with only two lines, which means that the first layer of MLP contains only two

neurons. Thus, the search, separation and combining of border pairs are done

easily, quickly and without shifting away. At the output of the first layer is

obtained:

27

Dr Bojan Ploj

A

(0,0)

B

(0,1)

C

(1,0)

D

(1,1)

a

b

X

Y

Y

Figure 13 Input space of XOR function

A, B, C, D: learning patterns

a, b: border lines

Figure 13. Input space of XOR function. A, B, C, D: learning patterns, a, b: border lines.

Table 10. Inner values.

Learning pattern X Y XnYn

A 0 0 0 0

B 0 1 0 1

C 1 0 1 0

D 1 1 0 0

Because we have two-dimensional learning data, the situation in the inner

layer can be drawn. From Figure 14 it is evident that the data at the output of the

inner layer are linearly separable; consequently sufficient for the final result is one

additional neuron in the second layer, which is simultaneously the last. We can

also come to the same conclusions if we build a second layer in the same manner

as the first. Transformed learning patterns, which are obtained at the output of the

first layer and are given in Table 11, are conveyed to the MLP inner layer.

The first and fourth learning pattern (A and D) are both mapped to the value

of An = (0, 0). Thus, for the second layer MLP remain only three learning

patterns, which form two border pairs: (An, Bn), and (An, Cn). Because they can

be separated by the same straight line, we once again find that in the second layer

a single neuron is sufficient.

28

Optimization for Multi Layer Perceptron: Without the Gradient

D

n

=A

n

=(0,0)

B

n

=(0,1)

C

n

=(1,0)

X

Y

Figure 14 Feature space for XOR function

Points A, B and C remain on its position, point D is

moved to the point A.

Figure 14. Feature space for XOR function. Points A, B and C remain on its position,

point D is moved to the point A.

Table 11. Learning data of the inner and output MLP layer.

Learning pattern Inner layer Output layer

XnYnXOR

An=A=D 0 0 0

Bn=B 0 1 1

Cn=C 1 0 1

7.2.2. Triangle

In the case of the triangular learning dataset we find which two-dimensional

points lie inside the triangle. The used dataset of learning patterns is similar to

those in Figure 8 and 10. The difference is only in the number of the learning

patterns. This time, we used a lot more learning patterns (200), which are no

longer evenly distributed, since their position is random. Around a quarter of these

patterns are positive or lie inside the triangle. Due to the random position of the

samples the learning process was repeated ten times, and finally calculated the

mean result and standard deviation. The results were compared with the results

obtained by the backpropagation method. Table 12 shows that the method gives

the border pairs substantially better results. In fact, the precedence of the border

pairs is even greater, as seen in the Table, because with this method we found the

29

Dr Bojan Ploj

structure of MLP, which is nearly optimal and we then used it for learning with

the controlling backpropagation method, which thus became more successful. As

a matter of interest we would like to mention another finding, that the BPM

triangle method was not always restricted to three straight lines, sometimes there

were more of them. This phenomenon is due to the random position of the

learning patterns and the primitive algorithm for combining border pairs.

Table 12. Comparative results of the triangle learning.

Number of test RMSE error

BPM BP

1 0.036 0.200

2 0.075 0.128

3 0.008 0.417

4 0.016 0.072

5 0.024 0.240

6 0.048 0.051

7 0.048 0.339

8 0.115 0.026

9 0.008 0.155

10 0.016 0.042

Average 0.039 0.167

Standard deviation 0.034 0.133

7.2.3. Recognition Of Irises

For the first set of real data to test the BPM method we have chosen Irises, as

it is one of the most popular and oldest sets [5]. It contains data on three types of

irises – Iris Setosa, Iris Virginica and Iris Versicolor, each having 50 instances.

For each flower are given four parameters: the length and width of the petal and

the length and width of the sepal. Some researchers in the field of cluster analysis

due to the partial overlap of clusters cite irises as a difficult set of data.

Overlapping is predominant with the iris species of Iris Versicolor and Iris

Virginica [5].

Because in this research we were interested in how successful the BPM

method is separates patterns that overlap somewhat, we have used to learn the

whole set of data and we have learned and tested with the same data. We used the

approach "one against all" and began by identifying the iris type Iris Setosa. In

doing so, it has been shown that the complete set of training data contains only

two border pairs and that only four out of the 150 learning data was completely

30

Optimization for Multi Layer Perceptron: Without the Gradient

adequate for successful learning. Due to the favorable disposition of the border

pairs it is valid that for their separation only one border hyper-plane is sufficient.

The classification of the remaining two types of Irises (Iris Virginica and Iris

Versicolor) is conducted in the same way. We got only slightly more of the border

pairs and hyper-planes, than in the case of Setosa. In all three cases, the BPM

method succeeded to separate all learning data correctly. Data on the remaining

learning errors RMSE are given in table 13.

Table 13. MLP suitable for learning for "Irises" datasets.

Given

class

Border

pairs

Neurons in

1st layer

RMSE error

BPM BP SVM DT

Setosa 2 1 0.0000 0.0071 0.0000 0.0000

Versicolor 14 5 0.0000 0.1114 0.0816 0.1323

Virginica 12 4 0.0000 0.1129 0.1155 0.1325

Figure 15. Clustering of irises. White circles represents Iris Setosa, the black ones Iris

Virginica or Iris Versicolor.

31

Dr Bojan Ploj

For a comparison we also learned with other methods (bipropagation BP,

support vector machine SVM and decision tree DT), which all stand out as

inferior because they have greater remaining RSME.

Due to the relatively small number of learning patterns it is difficult in this

dataset to determine whether the MLP is overfitted. When we randomly halved

the data and used one half of the data for learning and the other half for testing, it

turned out that in all methods we got just one or two incorrectly classified irises,

which means good generalization. This did not surprise us, because it usually

holds that MLP with a small number of neurons has a good generalization.

Figure 15 shows the separation of Iris setosa from the other type of Irises. In

the figure we transformed the four-dimensional data into two-dimensional. On the

X axis we add up the width and length of the petal, and on the Y axis, the width

and length of sepal. Despite this primitive reduction of the dimensions, in the

figure is still visible the separation of Iris setosa from the others, as the black and

white circles are not mixed together.

7.2.4. Pen-Based Recognition Of Handwritten Digits

Pen-Based Recognition of Handwritten Digits is the name of a

comprehensive, real, validated and verified set of learning data [6]. In it are digits

written by 44 people with an electronic pen with a resolution of five hundred

times five hundred pixels. The data set is prepared (equal size of digits and

centering in the middle of the frame). An individual digit is described with

seventeen attributes. The first sixteen numbers represents coordinates x and y for

eight points. These are integers between zero and one hundred, which are

recorded with an electronic pen at intervals of one hundred milliseconds. The

seventeenth number represents the class in which the digit belongs.

Table 14 Examples of handwritten decimal digits.

Eight coordinates of pen Digit

1 2 3 4 5 6 7 8

0 39 2 62 11 5 63 0 100 43 89 99 36 100 0 57 0

0 57 31 68 72 90 100 100 76 75 50 51 28 25 16 0 1

0 89 27 100 42 75 29 45 15 15 37 0 69 2 100 6 2

35 76 57 100 100 92 68 66 81 38 82 9 32 0 0 17 3

0 100 7 92 5 68 19 45 86 34 100 45 74 23 67 0 4

13 89 12 50 72 38 56 0 4 17 0 61 32 94 100 100 5

99 100 88 99 49 74 17 47 0 16 37 0 73 16 20 20 6

0 85 38 100 81 88 87 50 84 12 58 0 53 22 100 24 7

47 100 27 81 57 37 26 0 0 23 56 53 100 90 40 98 8

74 87 31 100 0 69 62 64 100 79 100 38 84 0 18 1 9

32

Optimization for Multi Layer Perceptron: Without the Gradient

Examples of learning patterns for all ten digits are shown in Table 14 and

Figure 16

Figure 16. Graphical representation of digits in the table 14.

Classification was first done using the BPM method, and then with three

control methods –backpropagation (BP), support vector machine (SVM), and

decision tree (DT). For learning, we used two hundred samples, for validation

3498 new patterns. The comparative results of the learning are given in Table 15.

The reason for using a small number of learning patterns is the non-optimized

source code of the learning program, as it is written by the interpreter and is

therefore very slow. For this reason, we have not performed any speed

measurements or comparisons. Despite the small set of learning data it is evident

in Table 15 that the selected learning patterns are representative and that the

learning succeeded.

The percentage of misrecognized digits is similar to that from support vector

machine (SVM) and for one good percentage better than that of decision tree. The

backpropagation method has fared better this time. This is probably due to

excessive learning with BPM, as it identified all the learning patterns correctly. As

a point of interest we would like to add that also man does not correctly recognize

all the digits in this dataset.

33

Dr Bojan Ploj

Table 15. A comparison of learning results on recognition of handwritten

digits.

Digit

Share of misclassified patterns [%]

BPM BP SVM DT

0 2.5 2.2 2.0 1.7

1 6.8 7.1 7.7 9.6

2 4.2 4.4 4.9 4.5

3 5.2 1.8 1.5 5.2

4 1.7 0.7 1.3 2.6

5 4.8 5.5 8.3 6.0

6 0.8 0.4 0.3 2.8

7 2.2 1.2 3.9 7.1

8 3.4 1.7 4.8 5.5

9 4.7 3.2 2.9 5.2

Average 3.63 2.82 3.76 5.02

7.2.5. Ionosphere

Ionosphere is a classification dataset obtained when using aviation radar [45].

Table 16. Learning results from a ionosphere data set.

Dataset Number of misclassified patterns

BPM BP SVM DT

1 65 50 81 106

2 50 47 59 33

3 71 110 72 82

4 62 76 64 50

5 55 48 66 53

6 111 104 126 69

7 126 126 126 126

Average 77.14 80.14 84.86 74.14

In the dataset are 351 patterns without missing values, which are composed of

34 attributes and classes and can be positive or negative. The complete set of

patterns was divided into seven parts with 50 or 51 patterns. We learned seven

times, always with another part of the dataset, but we always tested with the

whole dataset. Classification results are shown in Table 16.

34

Optimization for Multi Layer Perceptron: Without the Gradient

7.3. Noisy Data

In this research, we determined how resistant the border pairs method is to

noise in the learning data. The precise content of noise in the learning data can be

known only when we use an artificial set of learning data to which we add noise.

During our research we enhanced the noise and found what the percentage of

incorrectly identified test patterns is. In doing so, we did not use early stopping of

learning. Once again, we used a two-dimensional set of learning patterns - the

image of a square. The position of the patterns is random and evenly distributed.

There are 500 patterns, half to learn and the other half for the evaluation of

learning results. The first learning was carried out without added noise, which was

then enhanced (1%, 2%, 5% and 10%). Each of learning was repeated 10 times

and then we calculated the average and standard deviations.

The obtained results of the research were compared with the results of the

backpropagation method.

1

0

Y

m

1

X

Figure17 Learning patterns that contain noise

Three circles have left the square, a cross entered.

Figure17 Learning patterns that contain noise. Three circles have left the square, a cross

entered.

35

Dr Bojan Ploj

Table 17. RMSE error at different levels of noise.

Level of noise (%)

Algorithm 0 1 2 5 10

Back-

propagation

Average RMSE 0.0692 0.1257 0.1094 0.1333 0.1519

Standard

deviation 0.0290 0.1136 0.0953 0.1146 0.0782

Border

pairs

method

Average RMSE 0.0399 0.0308 0.0497 0.0302 0.1046

Standard

deviation 0.0304 0.0250 0.0491 0.0235 0.0280

In evaluating the data in Table 17 it is also necessary to take into

consideration that the BPM method itself finds the optimal structure of the MLP,

while the BP method cannot. Because, owing to the comparableness of the results,

we use for both methods the same MLP structure, the method of BP somewhat

gains and its results are unjustifiably slightly better, but despite that still worse

than BPM.

8. DYNAMIC LEARNING WITH BORDER PAIRS METHOD

Machine learning is especially expedient to use when the conditions are

dynamic [23] and during operation new learning data can be added to the

intelligent system. Thus, two benefits are achieved:

•Increased set of learning data. Further data is added to the initial

learning data that we have gained during its operation. This is especially

helpful when we had a small set at the beginning.

•Adapting to new circumstances. Often in machine learning we know

only part of the input vector. Among the unknown factors are also those

that change very slowly, which results in newer learning patterns

contributing more to the quality of learning than the older one. Some

authors call this obsolescence of-data or concept drift. In such

circumstances, it is reasonable to add new patterns and also to remove the

old ones.

36

Optimization for Multi Layer Perceptron: Without the Gradient

8.1. Dynamic Learning Approaches with Border Pairs Method

The different approaches of machine learning are differently adapted to the

addition and taking away of learning patterns. In the most "rigid" approach

because of one single added or taken away pattern it is necessary to repeat the

whole learning. Such an approach is certainly not the most appropriate for

dynamic intelligent systems. Unfortunately included in this group of approaches is

also the gradient learning method with the backpropagation at the helm.

There are two strategies of learning dynamic intelligent systems:

•Incremental Learning. When you accumulate enough new learning

data, we interrupt the use of the intelligent system and additionally learn.

In advance we determine the criteria when we will do additional learning.

For example, when you accumulate a certain number of new learning

patterns, or when you perceive a greater change in the properties of the

learning data.

•Online Learning. When a new learning pattern appears immediately use

it for further learning. With this we achieve that the intelligent system is

fully up to date, but unfortunately due to the continuous flow of

supplementary learning its working slows down.

When the dynamics of the system is large and there is enough time to learn, it

makes sense to use additional online learning. If the dynamics of the system is

small or there is not enough time for continual further learning, incremental

learning is a better choice.

The BP method has difficulties with dynamic learning. In additional learning

the result is often unsuccessful because new learning patterns greatly increase the

residual learning error to the extent that additional learning can no longer

satisfactorily reduced it. Synapses in a neural network behave as if they are

"woody" and their values hardly change. The network in additional learning as a

rule finds a local minima and the residual learning error does not even begin to

decline. Sometimes you can escape from the local minima in that we randomly

change the existing value of the weight slightly. If this trick does not succeed we

need the network to learn again from the beginning. Let us see how appropriate

BPM is for dynamic learning.

37

Dr Bojan Ploj

8.2. Incremental Learning with the Border Pairs Method

In incremental learning we first learn from data that were available prior to

the beginning of learning. This is followed by use of MLP, during which rise new

data that are evaluated every time. If with time the new data errors fail to increase,

then only new patterns are added to the existing set of learning data.

a)

b)

c)

Figu re 18 Differ ent posi tio ns of lea rning p att er ns durin g add ition al onli ne learn ing.

Full circle r epresents a new patt ern , das he d line is for mer bord er li ne.

a) There is no need to mo ve the bor der line.

b) It is necess ary t o move the b order line.

c) The shi ft is no t suf ficie nt, ddd itional border line is needed.

Figure 18 Different positions of learning patterns during additional online learning. Full

circle represents a new pattern, dashed line is former border line. a) There is no need to

move the border line.b) It is necessary to move the border line. c) The shift is not

sufficient, ddditional border line is needed.

Otherwise, when the error increases we talk about concept changes and then it

is also reasonable to eliminate old data. Adding new and eventual removal of old

learning data is followed by additional learning, in which we only seek additional

border pairs and eventual additional border lines (planes). Thus, only changing

that part of the first layer MLP, which corresponds to an area that has become

heterogeneous. In unconstructive learning methods we do not know which pattern

has an impact on which part of the neural network, therefore for additional

learning we have to use the entire set of learning patterns and with it learn the

whole MLP.

8.3. Online Learning with a Method of Border Pairs

The principle of online learning with the border pairs method is shown in

Figure 18. During use MLP classifies unknown samples into classes and

38

Optimization for Multi Layer Perceptron: Without the Gradient

simultaneously ascertains how successful the classification was done. When it is

successful (Figure 18 a), no further action is required as the borderline is still

located in the right position and because the construction of MLP and the values

of its weights remain unchanged. Otherwise, when classification is incorrect

additional learning is necessary, which can be done in several different ways:

A) With or without reconstruction of the MLP.

B) With or without forgetting or unlearning.

ad A) Figure 18a shows an arrangement of learning patterns, which does not

need any additional learning because the new learning pattern (full circle) is

correctly classified (located on the right side of the borderline). When the

classification is incorrect, first determine which border line is closest to the new

learning pattern. The neuron in the first layer, which corresponds to this line is

additionally learned and thereby move the appurtenant line. To move the line we

use only those border pairs that have already been used in determining its location

and new learning pattern. By moving the borderline we try to get a new pattern on

the right side of the line, or at least close to it. When reconstruction of MLP is not

allowed (adding neurons) this is the only possible measure. In Figure 18b such a

shift is possible, but in Figure 18c moving the borderline is no longer sufficient.

In the case of 34c we have to decide:

•Whether we will add a new borderline (neuron)?

•Whether we will come to terms with the wrong classification?

When learning patterns contain noise, it is usually advisable to accept the

smaller incorrect classification. This measure reduces excessive learning and

improves the generalization of learning. When the error is greater, or a wrongly

classified pattern is far beyond the borderline, it is better to move the border line.

If this is not feasible, we need the borderline to be replaced with two (breaking the

borderline). Such a measure adds a new neuron in the first layer of MLP.

ad B) Everything written in point A applies to learning without forgetting or

unlearning. Both learning patterns, the oldest and the newest, influence this kind

of learning equally. So far we have used only local data, without time dimension.

In this case we cannot talk about out-of-date data and therefore unlearning does

not make sense. When the data contains a time dimension, forgetting can be a

welcome feature. As an example we mention the conditions in predicting stock

exchange rates for securities. Forecasting stock exchange rates is based on past

experience. The difficulty in stock exchange prediction is that we known only part

39

Dr Bojan Ploj

of the factors affecting the stock quotes. All the unknown factors can be combined

into one, which we call "spirit of the time". Brokers for this purpose coined terms

like boom, recession, stagnation, and the like. Old learning patterns were obtained

under different economic conditions, in a different spirit of the time than the new

one. Older learning patterns can therefore be considered as (partially) out-of-date

and in learning withhold them or even completely eliminate them. Figure 19a

shows conditions without forgetting. Both border pairs remain in their positions.

The new pattern had to remain on the wrong side of the borderline, because the

position of the line can no longer be improved. If unlearning is allowed, then we

can loosen up old border pairs so that we increase the distance between both

participants in the border pairs. By doing this we gain some extra leeway for

moving the borderline. Thus, we have made it possible that the new pattern in

Figure 19b can be positioned on the right side of the borderline.

a)

b)

Figure 1 Lear ning wit h forget ting (un lear ning)

a) Circu mstance s before forgett ing where the

bo rder line can not be mo ved.

b) Circ umsta nces after forgetti ng (upper border pair

is expanded) where the border line c an b e

mo ved.

Figure 19. Learning with forgetting (unlearning). a) Circumstances before forgetting where

the border line can not be moved. b) Circumstances after forgetting (upper border pair is

expanded) where the border line can be moved.

Let us now look at the algorithm for online learning with the border pairs

method:

Algorithm 4 Algorithm for online learning with the border pairs

1Step: Take the new learning pattern and classify it.

2Step: If the classification succeeded go on to Step 6.

3Step: Find all the patterns in this cluster.

40

Optimization for Multi Layer Perceptron: Without the Gradient

4Step: Find all border pairs in this cluster.

5Step: Use the found border pairs for learning neurons in the first MLP

layer

6Step: Continue with Step 1 until the last on-line learning pattern is

processed.

8.3.1. Online Recognition of Digits

For testing of online algorithm (Algorithm 4) and evaluating its results, we

again used the data set of handwritten digits, which we already became acquainted

with in the last chapter. We took MLP, which was previously learnt with the

offline method and further learned with the online method. Thus we could, after

the concluding of the learning, compare online and offline results with each other.

For additional online learning, we used one hundred new learning patterns, which

in offline learning and testing have not yet participated. Because the BPM method

is constructive, the MLP structure between online learning can change. If

necessary a new neurons are added in the various layers of MLP. Normally, only

the first layer extends. Procedure for testing the quality of learning remained the

same as in the offline method, for which we used again the same 3498 test

patterns. The results obtained on additional online learning are shown in Table 18.

It is expected that the additional learning would improve results, which in most

cases also happened. But in some individual cases, such as digit 4, the result is

somewhat worse. Overall, after additional learning the number of incorrect

identified digits decreased. The deterioration of some individual digits is

attributed to the noise in additional learning patterns.

Table 18. Results of online learning.

Misclassified (%)

Digit Before After

0 2.5 2.5

1 6.8 6.0

2 4.2 3.3

3 5.2 4.2

4 1.7 2.1

5 4.8 4.8

6 0.8 1.1

7 2.2 4.4

8 3.4 3.0

9 4.7 3.9

Average 3.63 3.50

41

Dr Bojan Ploj

9. CONCLUSION

With the results of online learning, description of BPM learning concludes.

We note that we have been successful and we found some interesting results. Here

is a summary of the most important:

Complexity of data evaluation. From the number of border pairs and share

of learning patterns, which are participating in border pairs we can conclude how

difficult the learning will be. Where there are only a few patterns participating in

border pairs, the learning is not a problem. A large number of border pairs tell us

that the learning is demanding or contain a lot of noise.

Noise reduction. With using the method of border pairs it is possible to

successfully reduce noise in data. This is done so that we find participants of

border pairs and approach the nearest non-participating border pairs of the same

class. In this way we reduce noise only in relevant patterns. We found that the

generalization of learning from noisy data with the method of border pairs better

than the backpropagation method.

Clustering. With the border pairs method it is possible to cluster data. This is

done so that we find all border pairs and linear separate them. If after the

separation there is the odd heterogeneous area, it is separated into subsections.

This is repeated until all the areas were homogeneous.

Search for features. With the border pairs method it is possible to find

quality features. Each borderline has two hills, which are called binary with 0 and

1. The features are formed in such a way that we write on which hill individual

borderlines are lying patterns. As a cluster is located within the same area the

members of the same cluster also have the same features.

Reliability of learning. Learning with the border pairs method is reliable, as

it never got stuck and it has always ended successfully. Binary features, which

form in the first layer, search in further layers for a logical operation that from

features calculate which class the pattern belongs to.

Learning accuracy. Learning with the border pairs method is accurate. The

residual learning error was almost always less than that of the BP method.

Constructiveness. Learning with the border pairs method, unlike gradient

methods, is constructive. During learning with it we find the near minimal

construction, which is simultaneously the cause for the good generalization of

learning data.

Overfitting. The border pairs method has no problem with overfitting. The

reason for this is the fact that the linear border line (plane) is at the same time

adapted to numerous learning data, so there is no risk of over-adjustment of

individual data.

42

Optimization for Multi Layer Perceptron: Without the Gradient

Selected learning patterns. In learning only those learning patterns take part

that are participants in border pairs and contain information. During separation of

useful learning patterns, already before learning a picture of the complexity of the

learning data set can be created.

Gradual and modularity learning. Learning with the border pairs method

runs progressively from the input layer towards the output. With this complex

learning process is divided into several smaller independent processes that can be

solved separately (modular).

Non-iterative. Learning with the border pairs method takes place non-

iteratively. Neurons free parameters are calculated separately for each neuron and

in one step – non-iteratively.

Suitability for dynamic learning. The border pairs method is suitable for

dynamic learning (incremental and online), as it allows the addition and (gradual)

removal of data, without having to repeat the whole process of learning.

Findings on the features of the BPM method are summarized in Table 19.

A comparison of the features of the BPM method with established methods is

given in Table 20. Bold writing denotes the approach with the lowest percentage

of incorrectly identified test patterns, and the best result.

From the summary table it is evident that the BPM method regarding

accuracy among the three control methods once achieved the best result, and

twice the second best results. Overall, the BPM approach was the most successful

of all. Due to the relatively small number of used sets of learning patterns this

finding is treated with caution.

Table 19. Overview of the BPM method features.

Noise reduction Yes

Validation of learning patterns Yes

Clustering data Yes

Searching for features Yes

Reliable learning Yes

Accurate learning Yes

Constructive learning Yes

Resistance to overfitting Yes

Elimination of barren patterns Yes

Modular learning Yes

Non-iterative learning Yes

Dynamic learning Yes

43

Dr Bojan Ploj

Table 20. Summary table of results.

Misclassified patterns (%)

BPM BP SVM DT

Irises* 0.00 1.33 1.33 2

Digits 3.63 2.82 3.76 5.02

Ionosphere 77.14 80.14 84.86 74.14

* Test data set was the same as learning data set

The research results have inspired us for further research work, which offers a

number of options. Here are some of them:

•Optimizing the search for border pairs: In extensive datasets the search

for border pairs can be a very time consuming task and it is therefore

reasonable to use the optimized algorithm.

•Optimization of combining border pairs: When there are many border

pairs in the data, there are also a lot of different combinations as to how

they can be combined with each other. Combining in complex learning

data is the most time consuming task in the process of classification.

•Regression version of BPM methods: the addressed BPM method only

allows classification (binary data). In nature many things are continuous

and they know also values between 0% and 100%. For example, weather

forecasting, temperature regulations...

The discussed algorithm is probably possible to remodel so as to become

suitable for regression (continuous data).

•Multi-class Method: The discussed method had at the output only one

neuron, which conveyed whether the pattern suits a specific class. MLP

with multiple output neurons could decide among more than two classes.

•Implementation of the BPM method in the “Weka” software and other

and related validated tools for machine learning: Implementation of BPM

method would certainly facilitate and approximate the research work of

many interested researchers.

44

Optimization for Multi Layer Perceptron: Without the Gradient

REFERENCES

[1] B. Ploj, Bipropagation – nov način učenja večslojnega perceptrona (MLP),

Proceedings of the Eighteenth International Electrotechical and Computer

Science Conference ERK 2009, Slovenian section IEEE, pp 199-202, 2009

[2] Hebb, Donald Olding, The Organization of Behaviour: A

Neuropsychological Theory, 1949

[3] Yann LeCun, Corinna Cortes: yan.lecun.com/exdb/mnist, Handwritten digit

database

[4] Jihoon Yang, Rajesh Parekh, Vasant Honavar: DistAl: An inter-pattern

distance-based constructive learning algorithm, Intelligent Data Analysis,

Volume 3, Issue 1, May 1999, Pages 55–73

[5] Iris data set, http://en.wikipedia.org/wiki/Iris_flower_data_set, 12. 1. 2013

[6] Pen based handwriten digits data set, http://archive.ics.uci.edu/ml/

support/Pen-Based+Recognition+of+Handwritten+Digits, 12. 1. 2013

[7] Weka software. http://en.wikipedia.org/wiki/Weka_(machine_learning) , 18.

4. 2013

[8] Decision tree, http://en.wikipedia.org/wiki/Decision_tree, 12. 3. 2013

[9] Suport vector machine, http://en.wikipedia.org/wiki/Support_vector_

machine, 12. 3. 2013

[10] Ian H. Witten, Eibe Frank, Mark A. Hall, Data Mining: Practical Machine

Learning Tools and Techniques, Third Edition,The Morgan Kaufmann

Series in Data Management Systems, 2011

[11] 11.Feed forward neural network, http://en.wikipedia.org/wiki/Feed-

forward_neural _networks, 12. 3. 2013

[12] Recurrent neural network, http://en.wikipedia.org/wiki/Recurrent_neural_

networks, 12. 3. 2013

[13] Perceptron, http://en.wikipedia.org/wiki/Perceptron, 12. 3. 2013

[14] Davida E. Rumelharta, Geoffreya E. Hintona in Ronalda J. Williamsa,

Learning representations by back-propagating errors, Nature, October 1986

[15] P.J.G. Lisboa, T.A. Etchells and D.C. Pountney, Minimal MLPs do not

model the XOR logic, School of Computing and Mathematical Sciences

[16] T. L. Andersen, T.R. Martinez, DMP3: A Dynamic Multilayer Perceptron

Construction Algorithm, Brigham Young University, Utah USA

[17] Iris data set, http://archive.ics.uci.edu/ml/datasets/Iris, 12. 3. 2013

[18] Wine data set, http://archive.ics.uci.edu/ml/datasets/Wine, 12. 3. 2013

[19] DistAl: An inter-pattern distance-based constructive learning algorithm,

Jihoon Yang, Rajesh Parekh, Vasant Honavar, Neural Networks

Proceedings, 1998. IEEE World Congress on Computational Intelligence.

45

Dr Bojan Ploj

The 1998 IEEE International Joint Conference, 4-9 May 1998, Volume: 3,

On Pages: 2208 - 2213 vol.3

[20] Geometrical synthesis of MLP neural networks, Rita Delogu, Alessandra

Fanni and Augusto Montisci, Neurocomputing,Volume 71, Issues 4–6,

January 2008, Pages 919–930,Neural Networks: Algorithms and

Applications, 4th International Symposium on Neural Networks

[21] Arunava Banerjee, Initializing Neural Networks using Decision Trees,

Computational learning theory and natural learning systems: Volume IV,

MIT Press Cambridge, 1997, ISBN:0-262-57118-8

[22] Cortes, Corinna; and Vapnik, Vladimir N.; "Support-Vector Networks",

Machine Learning, 20, 1995, http://www.springerlink.com/content/

k238jx04hm87j80g/, 12. 3. 2013

[23] Neapolitan, Richard; Jiang, Xia (2012). Contemporary Artificial

Intelligence. Chapman & Hall/CRC. ISBN 978-1-4398-4469-4.

[24] Mitchell, T.: Machine Learning, McGraw Hill, 1997, ISBN 0-07-042807-7,

p.2.

[25] Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar: Foundations of

Machine Learning, The MIT Press, 2012, ISBN 9780262018258.

[26] Ross, Brian H.; Kennedy, Patrick T: Generalizing from the use of earlier

examples in problem solving, Journal of Experimental Psychology:

Learning, Memory, and Cognition, Vol 16(1), Jan 1990, strani 42-55.

[27] Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar Foundations of

Machine Learning, The MIT Press, 2012, ISBN 9780262018258.

[28] Vapnik, V. N. The Nature of Statistical Learning Theory (2nd Ed.),

Springer Verlag, 2000

[29] Oded Maimon and Lior Rokach: DATA MINING AND KNOWLEDGE

DISCOVERY HANDBOOK, Springer, 2010

[30] Hipp, J.; Güntzer, U.; Nakhaeizadeh, G.: "Algorithms for association rule

mining - a general survey and comparison". ACM SIGKDD Explorations

Newsletter 2: 58. doi:10.1145/360402.360421, 2000

[31] J. J. HOPFIELD Neural networks and physical systems with emergent

collective computational abilities. Proc. NatL Acad. Sci. USA Vol. 79, pp.

2554-2558, April 1982 Biophysics

[32] Fogel, L.J., Owens, A.J., Walsh, M.J. (1966), Artificial Intelligence through

Simulated Evolution, John Wiley

[33] Muggleton, S. (1994). "Inductive Logic Programming: Theory and

methods". The Journal of Logic Programming. 19-20: 629–679.

doi:10.1016/0743-1066(94)90035-3

46

Optimization for Multi Layer Perceptron: Without the Gradient

[34] Cortes, Corinna; and Vapnik, Vladimir N.; "Support-Vector Networks",

Machine Learning, 20, 1995. http://www.springerlink.com/content/

k238jx04hm87j80g/

[35] Ben-Gal, Irad (2007). Bayesian Networks (PDF). In Ruggeri, Fabrizio;

Kennett, Ron S.; Faltin, Frederick W. "Encyclopedia of Statistics in Quality

and Reliability". Encyclopedia of Statistics in Quality and Reliability. John

Wiley & Sons. doi:10.1002/9780470061572.eqr089. ISBN 978-0-470-

01861-3.

[36] John Peter Jesan, Donald M. Lauro: Human Brain and Neural Network

behavior a comparison, Ubiquity, Volume 2003 Issue November

[37] McCulloch, W. and Pitts, W. (1943). A logical calculus of the ideas

immanent in nervous activity. Bulletin of Mathematical Biophysics, 7:115 -

133.

[38] Rosenblatt, Frank, The Perceptron--a perceiving and recognizing

automaton. Report 85-460-1, Cornell Aeronautical Laboratory, 1957

[39] Russell, Ingrid. "The Delta Rule". University of Hartford, November 2012

[40] P.J.G. Lisboa, T.A. Etchells, D.C. Pountney: Minimal MLPs do not model

the XOR logic, Neurocomputing, Volume 48, Issues 1–4, October 2002,

Pages 1033–1037

[41] Deza, E.; Deza, M.: Dictionary of Distances, Elsevier, ISBN 0-444-52087-

2, 2006

[42] Roland Priemer : Introductory Signal Processing. World Scientific. p. 1.

ISBN 9971509199, 1991

[43] Estivill-Castro, V.: "Why so many clustering algorithms". ACM SIGKDD

Explorations Newsletter 4: 65. doi:10.1145/568574.568575, 2002

[44] Borodin, A.; El-Yaniv, R.: Online Computation and Competitive Analysis.

Cambridge University Press. ISBN 0-521-56392-5, 1998

[45] Ionosphere data set, http://archive.ics.uci.edu/ml/machine-learning-

databases/ ionosphere, 12. 3. 2013

[46] Alsmadi M. S., Omar B. K. :Back Propagation Agorithm: The Best

Algorithm Among the Multi-layer Perceptron Algorithm, IJCSNS, April

2009

[47] Sharma K. S., Constractive Neural Networks: a reiew, International

Journal of Engineerinf Science and Technology, 2010, pp. 7847-7855

[48] Aizenbeg I., Moraga C.: Multilayer Feedforward Neural Network Based on

Multi-Valued Neurons and Backpropagation Learning Algorithm, Soft

Computing, January 2007, pp. 169-183

47

Dr Bojan Ploj

[49] P. A. Castillo, J. Carpio, J. J. Merelo, A. Prieto, V. Rivas, G, Romero:

Evolving Multilayer Perceptrons, Neural Processing Letters, 2000, pp. 115-

127

[50] J. L. Subirat, L. Franco, I. Molina, J. M. Jerez:Active Learning Using a

Constructive Neural Network Algorithm, Constructive Neural Networks,

pp. 193-206, 2009, Springer Verlag

[51] Y. G. Smetanin: Neural Networks as system for recognizing patterns,

Journal of Mathematical Science, 1998

[52] E. Ferrari, M. Muselli:Efficient Constructiv Tecniques for Training

Switching Neural Networks, Constructive Neural Networks, pp. 24-48,

2009, Springer Verlag

[53] J. F. C. Khaw, B. S. Lim, L. E. N. Lim: Optimal Design of Neural Networks

Using the Taguchi Method, Neorocomputing, 1995, pp. 225-245

[54] B. Ploj, R. Harb, M. Zorman, Border Pairs Method—constructive MLP

learning classification algorithm, Neurocomputing, Volume 126, 27

February 2014, Pages 180-187

48