ChapterPDF Available

Advances in Machine Learning Research

Authors:
  • College of Ptuj

Abstract

During the last twenty years, gradient-based methods have been primarily focused on the Feed Forward Artificial Neural Network learning field. They are the derivatives of Backpropagation method with various deficiencies. Some of these include an inability to: cluster and reduce noise, quantify data quality information, redundant learning data elimination. Other potential areas for improvement have been identified; including, random initialization of values of free parameters, dynamic learning from new data as it becomes available, explanation of states and settings in hidden layers of learned ANN, among other. This chapter deals with the contemporary, non-gradient approach to ANN learning, which is not based on gradual reduction of remaining learning error anymore and it tries to eliminate most of the mentioned deficiencies. Introduction includes a chronological description of some methods, which deal with solutions of mentioned problems: Initializing Neural Networks using Decision Trees (Arunava Banerjee, 1994), DistAl: An inter-pattern distance-based constructive learning algorithm (Jihoon Yang, 1998), Geometrical synthesis of multilayer feedforward neural networks or Multi-Layer Perceptron (Rita Delogu, 2006) and Bipropagation - a new way of MLP learning (Bojan Ploj, 2009).We continue with the description of a new learning method - Border Pairs Method (BPM), which in comparison with the gradient methods carries numerous advantages or eliminates most of the predecessor’s deficiencies. The BMP implements and uses border pairs – learning patterns pairs in the input space, which are located close to the class border. The number of boundary pairs gives us some information about the complexity of the learning process. Boundary pairs are also the perfect basis for the noise reduction. We determine that performing a noise reduction of the border pairs is sufficient. By dividing the input space, the homogenous areas (clusters) are established. For every linear segment of border we assign one neuron in the first layer. The MLP learning begins in the first layer by adapting individual neurons. Neurons on the first layers are saturated, so we get a binary code on the output of the first layer - the code is unified for all members of the same cluster. Logical operations based on the data from the first layer are executed in the following layers. Testing showed that such learning is reliable, it is not subject to overfitting, and is appropriate for on-line learning and susceptible to concept drift in the process of learning (forgetting and additional learning).
Chapter
OPTIMIZATION FOR MULTI LAYER
PERCEPTRON: WITHOUT THE GRADIENT
Dr Bojan Ploj
ABSTRACT
During the last twenty years, gradient-based methods have been
primarily focused on the Feed Forward Artificial Neural Network learning
field. They are the derivatives of Backpropagation method with various
deficiencies. Some of these include an inability to: cluster and reduce noise,
quantify data quality information, redundant learning data elimination. Other
potential areas for improvement have been identified; including, random
initialization of values of free parameters, dynamic learning from new data as
it becomes available, explanation of states and settings in hidden layers of
learned ANN, among other.
This chapter deals with the contemporary, non-gradient approach to
ANN learning, which is not based on gradual reduction of remaining learning
error anymore and it tries to eliminate most of the mentioned deficiencies.
Introduction includes a chronological description of some methods, which
deal with solutions of mentioned problems: Initializing Neural Networks
using Decision Trees (Arunava Banerjee, 1994), DistAl: An inter-pattern
distance-based constructive learning algorithm (Jihoon Yang, 1998),
Geometrical synthesis of multilayer feedforward neural networks or Multi-
Layer Perceptron (Rita Delogu, 2006) and Bipropagation - a new way of
MLP learning (Bojan Ploj, 2009).We continue with the description of a new
Dr Bojan Ploj
learning method - Border Pairs Method (BPM), which in comparison with
the gradient methods carries numerous advantages or eliminates most of the
predecessor’s deficiencies. The BMP implements and uses border pairs
learning patterns pairs in the input space, which are located close to the class
border.
The number of boundary pairs gives us some information about the
complexity of the learning process. Boundary pairs are also the perfect basis
for the noise reduction. We determine that performing a noise reduction of
the border pairs is sufficient.
By dividing the input space, the homogenous areas (clusters) are
established. For every linear segment of border we assign one neuron in the
first layer. The MLP learning begins in the first layer by adapting individual
neurons. Neurons on the first layers are saturated, so we get a binary code on
the output of the first layer - the code is unified for all members of the same
cluster. Logical operations based on the data from the first layer are executed
in the following layers. Testing showed that such learning is reliable, it is not
subject to overfitting, and is appropriate for on-line learning and susceptible
to concept drift in the process of learning (forgetting and additional learning).
1. INTRODUCTION
Artificial Neural Networks (ANN) have already been known for several
decades and during this time have become well accepted and one of the most
common of intelligent systems. One of their best features is that they are easy to
use, because the user does not need to know anything about their performance
(black box principle). Their learning is unfortunately not as simple and efficient
consequently scientists have continually tried to improve them since they came
into existence. The most frequently used are the iterative learning methods, which
are based on the gradually reduction of cumulative learning error and they have
many shortcomings:
Do not detect and eliminate the noise
Do not know how to estimate the complexity of the learning patterns
Do not cluster data
Do not find features
Unreliable learning
Inaccurate learning
Unconstructive learning
Non-resistance to over-fitting
2
Optimization for Multi Layer Perceptron: Without the Gradient
No elimination of barren patterns
Non-modular learning
Iterative learning
Unsuitable for dynamic learning
These iterative methods are based on the calculation of the error gradient,
thus their learning error decreases in a number of small steps, and in doing so the
system often gets stuck in a local minimum.
In the following text, we will first note the gradient descent learning, and then
we will perceive two new algorithms: Bipropagation and Border Pairs Method.
The first algorithm, Bipropagation is a smaller improvement of the well-known
Backpropagation algorithm and still remains gradient descend. The second
algorithm, the Border Pairs Method is a completely new a design and more or less
successfully does away with all the enumerated disadvantages of the gradient
methods. Mentioned will also be some of the areas of BPM, which have not yet
been thoroughly researched.
2. GRADIENT LEARNING
A typical representative of gradient descent learning is Perceptron - a simple
Feedforward Neural Network (FFNN), introduced by Frank Rosenblatt in 1957
[11, 14, 38]. All neurons in Perceptron are arranged in one straight line, which is
called the layer (Figure 1). Its input and output values can be continuous or
discrete. The learning process is called delta rule and takes place gradually over a
number of repetitions (iterative process).
3
Dr Bojan Ploj
Figure 1. Perceptron with three inputs, two neurons and two outputs. The small circles in
the graph are inputs, the large circles are neurons, and the arrows are synapses.
Learning and structure of Perceptron are simple, but unfortunately are also
simple problems that you can effectively resolve, as it is only successful when
learning patterns can be separated by a straight line or its multidimensional
equivalent (linear separability). Real-life problems are usually not so simple.
With the serial connection of several perceptrons a multi-layer perceptron is
created (MLP) as shown in Figure 2, which no longer has the mentioned linear
constraints. If the MLP is large enough, it can be used to solve any demanding
task (Kolmogorov’s theorem), but with its size the complexity of its learning also
increases. Its original learning algorithm is called Backpropagation. Discovered
by Paul Werbos in 1974, it was enforced only in 1986 [14] and then it was caused
a resurgence of neural networks, which have already slowly begun to sink into
oblivion. The event was so remarkable that the period before its resurgence was
called AI Winter, because until then there was no appropriate learning algorithm
for the multi-layer perceptron, and the single-layer gave too modest results.
Figure 2. Multi-layer Perceptron. It has four inputs (small circles), three neurons in the
hidden layer (lower large circles) and two neurons in the output layer (upper large circle).
The arrows represent connections or synapses.
Backpropagation is one of the algorithms for supervised learning, where the
desired output values for each learning pattern are known. During the learning
process the weights of synapses and biases are iteratively changing so that the
cumulative error of all learning patterns is gradually reduced. These changes are
based on the gradient of cumulative learning error therefore, during the
learning process you can get in local minima, and get stuck there. At the start of
4
Optimization for Multi Layer Perceptron: Without the Gradient
learning, the weights of synapses are randomly selected, which normally means
that are far from the optimal value, resulting in a slow, lengthy and unreliable
learning.
Over the years, the backpropagation algorithm did receive a number of
variants (Quickprop, Quasi-Newton, Levenberg-Marquardt...), which all preserve
this common properties based on gradient descent learning and associated
drawbacks:
Slowness - for successful learning a large number of iterations are
needed, where free parameters are calculated gradually in small steps.
Usually, several thousand such increments are required to yield the
corresponding learned neural network. This is not the only cause for
slowness, as with the exception of useful learning patterns (supporting
vectors) also barren ones are used.
Unreliability - during the learning process the cumulative error of
learning patterns is iteratively calculated, which with gradient descent
corrections is gradually reduced. We look for the global minima, during
which it frequently occurs that we go off and get stuck in local minimum,
and thus conclude the learning process when the learning error is still
much greater than acceptable. In this case, the initial value of the free
parameters is changed and the whole learning process is repeated as
many times as is required until there is a change of luck. Due to the
unconstructiveness of the learning methods (see next paragraph) we have
to guess the appropriate MLP construction. It is also possible that due to
unsuitable MLP construction, learning is never successful. Too small
MLP cannot be satisfactorily learnt; too big is subjected to overfitting,
which is also not good.
Unconstructiveness - between learning with gradient descent algorithm,
we do not know anything about the optimum number of layers and the
optimal number of neurons in individual layers of neural networks.
Optimal construction is looked for through guesswork, so that we learn
many diverse MLPs and in the end use the best. Due to the unreliability
of learning (see previous paragraph), we must learn each MLP even more
times.
The last two shortcomings mutually foment, so it may happen that learning
gets into a vicious circle from which the output cannot be found. Because of these
shortcomings, the learning of MLP with the gradient methods is complex and
relatively inefficient.
5
Dr Bojan Ploj
Apart from the above shortcomings during this type of learning take rise a
number of questions:
Is it necessary for the learning process to use all the learning
patterns, or might it suffice to use the already selected patterns? How to
choose these patterns? We can assume that learning with selected patterns
will be quicker and with fewer complications. A similar idea is used by
the method of SVM [9, 22, 34], where the selected learning patterns are
called support vectors.
Is it possible to find the initial value of the free parameters of MLP
that are better than random? Is it possible to determine the values of
free parameters in a single step (non-iterative)? In part, this is true for the
bipropagation method, where some weights are determined in advance
[1].
Can the use of a feature with a greater dimension than the input
space simplify and improve learning? How to find these potential
features? Usually the feature has a smaller dimension than the original
learning pattern, since it is aimed to eliminate redundancy in the data.
The other way round engages the support vector method, which seeks to
locate features of larger dimensions, which are linearly separable [34].
Does there exist in the hidden layers of the learned MLP some kind
of rule? Does that rule in some way inform something about the
functioning of the internal layers of MLP? Until now, it was felt that the
MLP is a black box, which at the output gives a result that is not
accompanied by an explanation [14].
Is it possible during the learning process to find (near) optimal MLP
construction for given learning patterns? Usually, under the term
optimal construction is understood as a small MLP, which still solves
given problems, because small MLPs are less subject to overfitting [14].
When we want a robust MLP it is necessary to add some redundant
neurons that can replace damaged ones. Some other methods during
learning find suitable MLP construction, which is not a necessarily
optimal.
Is it possible during learning process to reduce the noise of learning
patterns and as such simplify learning and increase its accuracy?
There are several methods of noise reduction, which are carried out
already before the beginning of learning. Optimal noise reduction takes
6
Optimization for Multi Layer Perceptron: Without the Gradient
place during the learning process; it is carried out to the extent
appropriate and affects only the relevant learning patterns.
Is it possible already before the learning process to identify the
quality of learning patterns, and how difficult it will be to learn? It is
good during the learning process to know which learning data is being
dealt with, because then requirements and expectations can be adjusted. If
the learning data is bad (noise, overlapping, unrepresentativeness...), we
should not expect a good learning outcome [54].
Is it possible to perform clustering data during the learning process?
Clustering input data during the learning process means that the complex
problem is divided into several simpler ones, which also facilitates an
understanding of how MLP solves the given problem.
Is effective incremental and online MLP learning possible? In the
Backpropagation algorithm it is usually necessary to repeat the whole
process of learning from the beginning onwards due to one additional
learning pattern [54]. Ideal algorithm allows the continuation of prior
learning.
Is it possible to drift the concept during the learning process? Equally
as when adding fresh learning patterns the same also applies in the
removal of obsolete. Because of only one removed or added learning
pattern we do not want to repeat the whole learning process.
3. BIPROPAGATION ALGORITHM
The Bipropagation algorithm is the intermediary link between the gradient
and non-gradient algorithms. It was discovered by Bojan Ploj in 2009 [1] as an
improvement of the backpropagation algorithm. Bipropagation has retained its
predecessor's iterativeness and non-constructiveness and took over the idea of
kernel functions from the support vectors machine. Kernel function in the first
layer makes the partially linearisation of the input space and thus allowing faster
and more reliable learning of the subsequent layers. It often happens that due to
the difficult learning patterns the backpropagation method completely fails, while
due to linearization the bipropagation method in the same conditions is fast,
efficient and reliable.
The original idea of the Bipropagation algorithm is that the hidden layers
MLP obtain the desired values. Thus, the N-layer perceptron is divided into N
single-layer perceptrons and with that the complex problem of learning is divided
7
Dr Bojan Ploj
into several simpler problems, independent of each other. Learning thus becomes
easier, faster and more reliable than the backpropagation method. The prefix bi- in
the name of the algorithm arose because the corrections of weights synapses
during learning spread in both directions (forward and backward).
The gradualness of the bipropagation method from layer to layer is also
evident in the matrix of Euclidean distances between learning patterns (Equation
1). The elements on the same position of the matrix from the input layer towards
the output gradually change the values (equation 1, 2 and 3).
3.1. Description of the Bipropagation Algorithm
We will describe the algorithm with the logical XOR function example. As
already stated, the basic idea of the bipropagation algorithm is that the inner layers
are no longer hidden, but get the desired output values. With this measure learning
of the entire MLP is divided into learning of two individual, linear layers which
has no problem with local minima.
Suitable MLP construction for this problem is shown in Figure 3. The XOR
function was chosen for this example because it is not extensive and also contains
local minima, which causes problems to many other methods.
Figure
3 Multi -layer perceptron, which successfully solves the
XOR problem
Y
X
X
n
Y
n
XOR
8
Optimization for Multi Layer Perceptron: Without the Gradient
Figure 3. Multi-layer perceptron, which successfully solves the XOR problem.
Table 1. Logical XOR function.
X y XOR
0
0
1
1
0
1
0
1
0
1
1
0
Ri matrix of Euclidean distances (Equation 1) of the input space for the data
shown in Table 1 looks like this:
Equation 1
The elements of the Ri matrix are Euclidean distances between individual
learning patterns. The distance between the patterns n and m in the matrix are
found in row n and column m. Because the Euclidean distance is commutative,
the Ri matrix is symmetric regarding its main diagonal, where are only zeros,
since there are given distances for learning patterns to itself. In a similar way we
get the matrix of Euclidean distances for data from the output layer:
Equation 2
Table 2. Show of inner values Xn and Yn.
X Y XnYnXOR
0
0
1
1
0
1
0
1
0
0
1
0
0
1
0
0
0
1
1
0
9
Dr Bojan Ploj
How we get the intrinsic value Xn and Yn shown in Table 2 will be observed a
little later. Now we are interested only in the Euclidean distance of R n matrix in
the inner layer:
Equation 3
We can see a similarity between the matrices Ri, Rn and Ro. If we ignore the
anti-diagonal, the values of the elements in the same positions of all three matrices
are the same. Let us look at the anti-diagonal in more detail. The input matrix R i
has all its values equal to the square root of the number two, the inner matrix R n
already contains two zeros, and the output matrix R o contains only zeros. We note
that the values of Rn matrix are in between values of Ri and Ro on the same
position, and that these values change gradually, which gives us hope that we are
on the right path to a solution.
The biggest problem of the bipropagation method is the search for suitable
inner desired values (Xn and Yn), which allow a gradual decline of values on the
same position of R matrices. The author of the algorithm proposes two ways to
search for relevant inner desired values:
1 Analytical: In this method of defining the desired values of inner layers a
new concept is introduced called patterns quality (PQ).
Equation 4
Where is:
∑DDC = the sum of the distances between patterns of different classes
∑DSC = the sum of the distances between patterns of the same class
We want that the patterns quality with each transition to the next layer
increase, respectively that the members of the same class draw themselves closer
to each other, and members of different classes distance themselves from each
10
Optimization for Multi Layer Perceptron: Without the Gradient
other. To calculate the value of the inner layers a non-linear system of equations
with free parameters (kernel functions) is used. When calculating the quality of
the patterns for the inner layer, we obtain an expression that contains only
constants and free parameters. By choosing the free parameters the quality of the
patterns in the inner layer is maximized.
An example of a non-linear system of equations is shown in the equation 5.
Pairs (x, y) form a two-dimensional set of learning patterns, which have the role
of variables, constants a, b, c, d, e and f are free parameters.
fxyeydxy
cxybyaxx
n
n
++= ++=
Equation 5
After the transformation with equation 5 we get new inner values:
Table 3. transformed inner values.
From table 2 From equation 5
XnYnXnYn
0
0
1
0
0
1
0
0
0
b
a
a+b+c
0
e
d
d+e+f
From that the free parameters can easily be calculated:
a = 1, b = 0, c = -1,
d = 0, e = 1, f = -1 Equation 6
By inserting them in the non-linear system of equations we get:
xyyxy
xyyxx
n
n
110
101
+= +=
Equation 7
2 Graphically: In this way the desired value of the inner layer in the interval
between the input value and the desired output value is selected. Their arithmetic
mean can be used. In the case of Figure 4 all the grey circles go halfway to the
point (0, 0), and all the white circles halfway to the point (1, 1). If there are more
inner layers, the principle of gradualism is used. Thus, in a four layer perceptron a
move of one-third is carried out in the first inner layer, in the second inner layer a
move of two-thirds.
11
Dr Bojan Ploj
Bipropagation algorithm is tested with different sets of learning patterns and
shows as being fast and reliable. Learning of the logical XOR function with the
bipropagation method runs more than 25 times faster than the backpropagation
method [1]. In this also the number of necessary learning epochs is much more
constant (the standard deviation is smaller), which indicates that we are less
dependent on luck in choosing the initial values of weights.
Figure4: Graphical survey of learning samples prior to transformation
(left) and after it (right)
(0,0)
(0,1)
(1,0)
(1,1)
(0,0)
(0.5
,1)
(1, 0.5)
(0.5, 0.5)
Figure4. Graphical survey of learning samples prior to transformation (left) and after it
(right).
Also, testing with real learning data gives good results. In the test we
recognized handwritten decimal digits. The learning set consisted of
approximately 60 thousand digits that were read at a resolution of 28 x 28 points
and they were contributed by more than 500 different writers. Despite extensive
data all the learning digits have been correctly identified after only some ten
iterations.
Also test digits have yielded good results. The Backpropagation method in the
same circumstances completely failed, as its learning error did not even begin to
reduce. Used in this comparison was the Levenberg-Marquardt backpropagation
conjugate method, which is considered as one of the most effective gradient
learning methods.
4. BORDER PAIRS METHOD
One step more advanced than the Bipropagation method is the Border Pairs
method – a constructive, non-gradient method of machine learning classification.
This provides much more than the gradient method: validation of the learning
12
Optimization for Multi Layer Perceptron: Without the Gradient
patterns, the elimination of barren learning patterns, clustering of learning
patterns, the formation of features, noise reduction and classification of learning
patterns. This whole new method is carried out non-iteratively, reliably, without
being subjected to overfitting and on top of all that it is also suitable for dynamic
learning (online or offline) and for concept drift [54]. Among other things, it
differs from the bipropagation method in that the boundary between the classes is
completely linearized, which is done in sections.
4.1. Border Pairs Definition
First, we define the concept of near points. The word near we want to
determine as accurately as possible, easily and at the same time suitable for the
input space with arbitrary number of dimensions. These requirements of defining
nearness correspond to the Euclidean distance dAB between the points A and B in
the N dimensional space:
Equation 8
N= the number of dimensions of the input space
All points in the two-dimensional space, which are distant for the same
Euclidean distance from the starting point (the center), form a circle. Figure 5
shows the points A and B, which are called the near points, if inside the
intersection of their circles with a radius, which is corresponding to their mutual
Euclidean distance (area with hatching) there is no third point.
13
Dr Bojan Ploj
A
B
Figure 5 Points A a nd B in two-dimensiona l spa ce
Area with hatching represent circle intersection
between those points
hatching in Figure 5) there is no third point.
Y
X
Figure 5. Points A and B in two-dimensional space. Area with hatching represent circle
intersection between those points.
Definition: The Border Pair forms a pair of two near points of different
classes, points which have empty mutual intersection of circles.
In the three-dimensional space circles are replaced with spheres, in the space
with four or more dimensions with hyper spheres.
4.2. The Impact of the Learning Data Feature on Border Pairs
Figure 6 shows an example of randomly distributed learning patterns. The
dashed line is the optimum borderline; the black circles are negative learning
patterns, and the white ones positive. Border Pairs participants are illustrated with
larger circles. Participants of the same border pair are connected by a line. In the
figure 6 we can see seven border pairs, and the same learning pattern can also
involve more pairs (for example, pattern no. 50). Since the data shown do not
include noise, all the white circles are on the right side of the border line, and all
the black ones on the left.
14
Optimization for Multi Layer Perceptron: Without the Gradient
Figure 6 An Example of learning patterns at the start
v
Figure 6. An Example of learning patterns at the start.
4.2.1. Influence of the Number of Patterns on the Number of Border Pairs
This research assesses how the number of learning patterns influence the
number of border pairs. The start data changes the number of learning patterns
from ten to two hundred according to the logarithmic scale, with the number of
positive patterns always the same as the number of negatives. In order to eliminate
the influence of coincidence the research was repeated a hundred of times and the
average was calculated. The data in Table 4 show the absolute number and
relative share of the learning patterns that are participants in border pairs. [54]
Table 4. Impact of the number of patterns on the share of border pairs.
Number of patterns in one
class
Average number of border
pairs
Relative share of
participants
10 2.96 0.2960
20 4.08 0.2040
50 6.25 0.1250
100 9.42 0.0942
15
Dr Bojan Ploj
200 13.0 0.0650
It was found that by increasing the number of patterns also the number of
border pairs increase, but much more slowly than the number of patterns, and
therefore, the relative proportion of participants reduces. Twenty times
amplification of the number of learning patterns (from 10 to 200) causes only
about a fourfold increase in the number of border pairs (from 2.96 to 13.00).
4.2.2. The Impact of the Ratio of Learning Patterns on the Number of
Border Pairs
Also the ratio between the number of positive and negative patterns have an
impact on the number of border pairs. The same as in the previous researches we
excluded the impact of randomness, so that the calculation was repeated a
hundred times and the average calculated. It was found that the maximum of pairs
is obtained when the positive and negative patterns are of the same number, by
increasing the differences the number of pairs decline (see Table 5).
We note that in the minority class it makes sense to have at least 20 percent of
the learning patterns, namely the number of border pairs strongly decreases and
the determination of the borderline rendered difficult.
Table 5. Impact of the ratio between patterns on the number of border pairs.
Number of positive
patterns
Number of negative
patterns Share of participants
10 90 0.0774
20 80 0.1034
30 70 0.1222
40 60 0.1236
50 50 0.1250
60 40 0.1237
70 30 0.1225
80 20 0.1031
90 10 0.0779
4.2.3. The Impact of Noise on the Number of Border Pairs
By increasing noise also the number of border pairs increases. The intensity
of noise in the table 6 is given in percentages of the entire range of input values.
2% of noise therefore, means that the value of the variable x after adding noise
comes in the interval x ± 0.01.
16
Optimization for Multi Layer Perceptron: Without the Gradient
Table 6. Impact of noise intensity on the number of border pairs.
Noise intensity in % Average number of border pairs
0 6.25
1 6.38
2 6.49
5 7.14
10 8.27
20 10.13
50 22.17
100 38.62
From the research results in Table 6 it is evident that by increasing the noise,
it also monotonically increases the number of border pairs.
4.2.4. The Impact of Noise on Outliers
Here we research how noise intensity affects the strength of the number of
outliers - learning patterns, which due to noise overstep the boundary between the
classes. In doing so, we observe two separate types of outliers - participants and
non-participants in the border pairs. From the research results in Table 7 we find:
1) The vast majority of outliers are involved in border pairs.
2) A very large majority of patterns, which are not border pair participants,
are not outliers.
Table 7. Impact of noise on outliers.
Noise
(%)
Average
number of
outliers
Average
number of
participating
outliers
Average
number of
border
pairs
Average
number of
participants
Share of
outliers
participating in
border pairs
(%)
Share of non-
outliers in
patterns outside
of border pairs
(%)
0 0.00 0.00 6.37 11.33 - 100.00
1 0.26 0.25 6.44 11.41 96.15 99.99
2 0.62 0.59 6.55 11.55 95.16 99.97
5 1.19 1.12 7.07 12.11 94.12 99.92
10 2.49 2.38 7.46 12.46 95.58 99.87
20 4.69 4.27 10.32 15.57 91.04 99.50
50 12.43 11.31 22.26 30.15 90.99 98.40
100 25.38 22.66 39.90 52.93 89.28 94.22
17
Dr Bojan Ploj
5. NOISE REDUCTION WITH BORDER PAIRS
Noise is an undesirable data component and it is difficult to remove. Because
it affects learning and test data, it is possible that it moves learning patterns, and
with that also the border line in one direction and test patterns in the opposite
direction. Therefore, there are two causes of misclassification:
Noise of learning patterns during learning cause us to define the wrong
position of the border line.
Noise of test patterns cause that they cross the given border line.
5.1. The Principle of Noise Reduction with Border Pairs
In noise reduction we decide which learning patterns will be moved, in which
direction and by how much. From the results of research on the impact of noise on
the border pairs we can conclude two things:
1) it does not make sense for learning patterns that are not involved in
border pairs to reduce noise (to move), as it is very big possibility that
they are not outliers.
2) it is reasonable to reduce noise for learning patterns, which are involved
in border pairs with approximating the nearest pattern from the same
class that is not involved in border pairs. They can also be distanced from
the nearest non-participating pattern of the opposite class.
The results of noise reduction will be seen later, under the title Noisy data.
6. CLUSTERING DATA WITH THE BORDER PAIRS METHOD
Clustering is a procedure which divides the input space so as to obtain two or
more homogeneous areas [43]. One area is homogeneous if it contains learning
patterns of only one class. Border between areas in a two-dimensional space is a
line, which in three-dimensional space substitutes the surface, and in four and
more dimensional space, the hyper-surface. Lines, surfaces and hyper-surfaces in
general can also be curved or not linear respectively.
18
Optimization for Multi Layer Perceptron: Without the Gradient
We will limit ourselves to a linear separation, as we shall in the continuation
use this algorithm for learning MLP, where only LTU neurons, which have a
linear border line or (hyper) surface, are used.
6.1. Clustering with Border Pairs
The basic idea of clustering with border pairs is simple. Between two patterns
of the same border pair draw a line (Figure 8), which separates it and is called
"border line" ( lines a, b, and c). Border lines divide the whole input area into
several areas (A, B, C...). Homogeneous areas of learning patterns are called a
“clusters”. The same border line can separate several border pairs. In doing so the
separated patterns of the same class must be in the same half-plane. It is necessary
for the separation of all pairs that there are as many border lines as there are
border pairs. If with the same line two or more border pairs are separated, we have
then successfully reduced the number of necessary border lines.
The idea for clustering with the border pairs method formed during the study
of the learned MLP, when we observe the behavior of individual neurons in the
inner layer, and in so doing came to the following conclusions:
That the output value of neurons is always close to the value 0 or 1, in
spite of the fact that they have a continuous transfer function - neurons
operate in saturation.
That the value at the output of the first layer does not change as long as
we stay within the same area, which means that the input area is divided
into several homogeneous areas - clusters.
That in the second and any subsequent layers a logic operation is
performed with the data of the previous layer.
Each area of the input space belongs a binary code, which has a number of
bits equal to the number of border lines and number of neurons in the first layer.
The individual bit tell us on which side of corresponding border line we are. Thus
obtained area codes are features.
19
Dr Bojan Ploj
X
Y
Figure 7 Homogeneity of areas
The area on right side of dott ed line is not homogeneous,
because it contains a patt ern f rom t he opposite class
(whit e circle).
Figure 7 Homogeneity of areas. The area on right side of dotted line is not homogeneous,
because it contains a pattern from the opposite class (white circle).
The described clustering and feature determination are based on solely border
pairs or the Euclidean distance, for this reason everything can be generalized also
at the input area with an arbitrary number of dimensions.
Let's look at the impact of noise on the position of the border line. If we add a
little noise to the individual learning data, the learning patterns move slightly and
each in a different direction Figure 9. Thus the noise wants some patterns to
move the border line to the left (patterns 1, 3 and 7), and the noise of other
patterns (patterns 2, 4, 5, 6 and 8) to the right.
20
Optimization for Multi Layer Perceptron: Without the Gradient
8
9
7
a
11
18
17
16
15
12
13
14
3
4
5
2
6
1
A
B
C
D
E
F
G
X
Y
Figure 8 Lear ni ng da taset with bor der lines and areas
1, 2, 3,…18: learning patt erns
A, B, C,…G: areas
a, b, c: border lines
arrows indicate the border pairs
10
b
c
Figure 8. Learning dataset with border lines and areas. 1, 2, 3,…18: learning patterns. A,
B, C,…G: areas. a, b, c: border lines. arrows indicate the border pairs
X
Y
1
2
3
4
5
6
7
8
Figure 9 Noisy data
Dashed line pattern without noise
Solid line pattern with noise
Figure 9. Noisy data. Dashed line – pattern without noise. Solid line – pattern with noise
21
Dr Bojan Ploj
Table 8. Area code of the input space.
area border lines
a b c
A 1 1 1
B 0 1 1
C 1 1 0
D 1 0 1
E 0 1 0
F 1 0 0
G 0 0 1
Since their effect on the border line is contradictory, it partially nullifies and
therefore the position of the border line, which is surrounded by numerous border
pairs, due to the noise hardly ever changes. This applies as long as the noise is
small enough that the learning patterns do not exceed the border line.
6.2. Complication in Clustering With Border Pairs
Let us see if with the separation of all border pairs we obtain only a
homogenous area? The three border lines in the figure 10 successfully separate all
border pairs, a pair (12, 15), even twice. The input area is thus divided into five
sections, if we count only those that contain at least one learning pattern. All five
areas are homogeneous (one white and four black), so the clustering succeeded.
This simple algorithm was used for clustering:
Algorithm 1: A simple algorithm for clustering set of training data
1Step: Find the border pairs.
2Step: Divide all border pairs with as few border lines as possible.
In the case of more complex learning patterns after separation of all the
border pairs, there sometimes remain a few learning patterns on the wrong side of
the line, and some areas remain heterogeneous (Figure 11). Algorithm 1 finds
only border pairs on the left and right sides of the square (the white patterns),
which are separated by the two vertical lines. The area in the middle remains
heterogeneous, as except for the white patterns it contains two black patterns (39
and 40).
22
Optimization for Multi Layer Perceptron: Without the Gradient
Figure 10 Successful clustering of triangular learning data
Figure 10. Successful clustering of triangular learning data.
Central area with white patterns is not homogeneous. Distracting are black
pattern number 39 and 40. Pair (6, 40) was not detected because of
interference of black patterns with numbers 18, 19, 36 and 37 in adjacent areas.
Figure 11. Deficient clustering. Central area with white patterns is not homogeneous.
Distracting are black pattern number 39 and 40. Pair (6, 40) was not detected because of
interference of black patterns with numbers 18, 19, 36 and 37 in adjacent areas.
23
Dr Bojan Ploj
The question arises as to why these two patterns are not participating in any
of the border pairs. If we observe separately only the patterns in the central area
we ascertain that also the patterns 39 and 40 are participating in the border pairs.
In searching for them the learning patterns in the neighboring areas distracted us.
The formation of the upper border pair were hampered by the black pattern
numbers 18, 19, 36 and 37.
Figure 12. Successful clustering of learning patterns. Shown are nine homogeneous areas,
one positive and eight negative.
The same applies below. This realization makes it possible to improve the
algorithm in such a way that it finds all the border pairs among the learning
patterns. If after the separation of the border pairs, there remains a heterogeneous
area, it is dealt with once again but this time separately from other areas. With this
algorithm upgrade in every heterogeneous area we find additional border pairs
that we have overlooked with algorithm 1 (figure 12). The search for the border
pairs is completed only when all areas are homogeneous. The improved algorithm
search for border pairs is as follows:
Algorithm 2: Improved algorithm for clustering learning data sets
24
Optimization for Multi Layer Perceptron: Without the Gradient
1Step: Find the border pairs.
2Step: Separate the border pairs with border lines.
3Step: Check the homogeneity of the resulting areas.
4Step: If you find a heterogeneous area, search in it for additional border
pairs and continue with Step 2.
Table 9. Number of found border pairs in different data sets.
Data set Number of border pairs Number of border lines
algorithm 1 algorithm 2 algorithm 1 algorithm 2
Vine 1 14 18 1 2
Vine 2 25 27 2 4
Vine 3 11 11 1 1
Iris Setosa 2 2 1 1
Iris Versicolor 25 27 2 4
Iris Virginica 12 15 3 5
Digit 0 4 5 2 3
Digit 1 12 16 2 5
Digit 2 5 6 1 2
Digit 3 5 11 1 4
Digit 4 8 11 2 5
Digit 5 8 9 2 3
Digit 6 5 6 1 1
Digit 7 8 10 2 3
Digit 8 6 12 2 6
Digit 9 9 10 2 3
The improved algorithm for clustering data with border pairs functioned
correctly in all test datasets. Results of clustering for sixteen sets of learning data
by the border pairs method with both algorithms are shown in Table 9, where it is
clear that the improved algorithm 2 usually finds a few more border pairs than the
simple algorithm 1
6.3. Combining of Border Pairs
We have already mentioned that the same border line can separate two or
more border pairs. Likewise we also know that the number of border lines
corresponds to the number of neurons in the first layer of MLP. From these facts
it follows that in each line we have to separate as many pairs as possible, if we
25
Dr Bojan Ploj
want to get a small MLP (near minimal construction). Which border pair separates
the same line, is a difficult question, and it needs to be researched further.
Algorithm 3: Algorithm for combining border pairs
1Step: Search for the border pairs.
2Step: Take a new neuron and learn it with the data of the next border
pair.
3Step: Add new (nearby) border pair and teach the same neuron. If
learning succeeds, then retain the new pair and mark it as used, otherwise
discard it.
4Step: Continue with Step 3 up until the last border pair.
5Step: Continue Step 2 until all border pairs are used.
6Step: Check all areas of the input area. If an area is heterogeneous then
search for additional border pairs and continue with Step 2.
Algorithm 3 tries to combine each border pair with all the other border pairs
and therefore is time consuming. The result also depends on the order of
combining attempts.
7. CLASSIFICATION OF DATA WITH THE BORDER
PAIRS METHOD
7.1. Description of the Data Classification with Border
Pairs Method
The binary value at the perceptron output, which we got by clustering data
with the border pairs method, can be used to classify data into classes, and
feasible in several ways. One of them is Boolean algebra (use of logic functions),
since the data from the clustering are first of all binary. In this research we will
continue with the use of additional perceptrons. To the perceptron, which cluster
input data, we cascadingly add more new perceptrons. Added neural network is
called the multi-layer perceptron remainder (MLP remainder). We have
researched two possibilities for construction of the MLP remainder:
26
Optimization for Multi Layer Perceptron: Without the Gradient
1 The layers in the remainder of the MLP are formed in exactly the same
way as we created the first layer. When in the next layer there remains
only one neuron the construction of MLP is concluded.
2 All subsequent layers are treated as additional MLP that you learn with
one of the established gradient methods. It turns out that an eventual
bottleneck is only in the first layer, therefore learning of the additional
MLP runs quickly and reliably. In all cases, the learning error is
monotonically declining rapidly and so it seems that no function of the
residual learning errors in the additional MLP contains a local minimum.
7.2. Examples of Learning with the Border Pairs Method
Both the just described methods of classification were tested with a number of
valid, real and synthetic learning data sets. Linearly separable sets of learning
patterns are classified already in the first layer (plain perceptron) and therefore
determination of the following layers is not necessary for them. That's why we
started the research with the non-linear set of XOR learning data.
7.2.1. XOR
A feature of the set XOR is that it contains only four learning patterns, which
are only two-dimensional, yet cause problems for numerous learning algorithms.
The reason for this is that the local minima of the XOR function in which the
gradient method often shifts into and usually gets stuck. In this case, the learning
stops already, when the residual learning error is still very large, or too large.
Let look at the course of learning the XOR function with the Border Pairs
Method. First, find the border pairs. From Figure 13 it is evident that between the
pairs of points A and B, A and C, B and D and C and D there are no intermediate
points. So we have four border pairs: AB, AC, BD and CD. After searching
border pairs their combining is next. The pairs AB and AC can be combined, as
they can be separated by the same line or one neuron. Because for the remaining
pairs one line is sufficient, we have succeeded in separating all the border pairs
with only two lines, which means that the first layer of MLP contains only two
neurons. Thus, the search, separation and combining of border pairs are done
easily, quickly and without shifting away. At the output of the first layer is
obtained:
27
Dr Bojan Ploj
A
(0,0)
B
(0,1)
C
(1,0)
D
(1,1)
a
b
X
Y
Y
Figure 13 Input space of XOR function
A, B, C, D: learning patterns
a, b: border lines
Figure 13. Input space of XOR function. A, B, C, D: learning patterns, a, b: border lines.
Table 10. Inner values.
Learning pattern X Y XnYn
A 0 0 0 0
B 0 1 0 1
C 1 0 1 0
D 1 1 0 0
Because we have two-dimensional learning data, the situation in the inner
layer can be drawn. From Figure 14 it is evident that the data at the output of the
inner layer are linearly separable; consequently sufficient for the final result is one
additional neuron in the second layer, which is simultaneously the last. We can
also come to the same conclusions if we build a second layer in the same manner
as the first. Transformed learning patterns, which are obtained at the output of the
first layer and are given in Table 11, are conveyed to the MLP inner layer.
The first and fourth learning pattern (A and D) are both mapped to the value
of An = (0, 0). Thus, for the second layer MLP remain only three learning
patterns, which form two border pairs: (An, Bn), and (An, Cn). Because they can
be separated by the same straight line, we once again find that in the second layer
a single neuron is sufficient.
28
Optimization for Multi Layer Perceptron: Without the Gradient
D
n
=A
n
=(0,0)
B
n
=(0,1)
C
n
=(1,0)
X
Y
Figure 14 Feature space for XOR function
Points A, B and C remain on its position, point D is
moved to the point A.
Figure 14. Feature space for XOR function. Points A, B and C remain on its position,
point D is moved to the point A.
Table 11. Learning data of the inner and output MLP layer.
Learning pattern Inner layer Output layer
XnYnXOR
An=A=D 0 0 0
Bn=B 0 1 1
Cn=C 1 0 1
7.2.2. Triangle
In the case of the triangular learning dataset we find which two-dimensional
points lie inside the triangle. The used dataset of learning patterns is similar to
those in Figure 8 and 10. The difference is only in the number of the learning
patterns. This time, we used a lot more learning patterns (200), which are no
longer evenly distributed, since their position is random. Around a quarter of these
patterns are positive or lie inside the triangle. Due to the random position of the
samples the learning process was repeated ten times, and finally calculated the
mean result and standard deviation. The results were compared with the results
obtained by the backpropagation method. Table 12 shows that the method gives
the border pairs substantially better results. In fact, the precedence of the border
pairs is even greater, as seen in the Table, because with this method we found the
29
Dr Bojan Ploj
structure of MLP, which is nearly optimal and we then used it for learning with
the controlling backpropagation method, which thus became more successful. As
a matter of interest we would like to mention another finding, that the BPM
triangle method was not always restricted to three straight lines, sometimes there
were more of them. This phenomenon is due to the random position of the
learning patterns and the primitive algorithm for combining border pairs.
Table 12. Comparative results of the triangle learning.
Number of test RMSE error
BPM BP
1 0.036 0.200
2 0.075 0.128
3 0.008 0.417
4 0.016 0.072
5 0.024 0.240
6 0.048 0.051
7 0.048 0.339
8 0.115 0.026
9 0.008 0.155
10 0.016 0.042
Average 0.039 0.167
Standard deviation 0.034 0.133
7.2.3. Recognition Of Irises
For the first set of real data to test the BPM method we have chosen Irises, as
it is one of the most popular and oldest sets [5]. It contains data on three types of
irises Iris Setosa, Iris Virginica and Iris Versicolor, each having 50 instances.
For each flower are given four parameters: the length and width of the petal and
the length and width of the sepal. Some researchers in the field of cluster analysis
due to the partial overlap of clusters cite irises as a difficult set of data.
Overlapping is predominant with the iris species of Iris Versicolor and Iris
Virginica [5].
Because in this research we were interested in how successful the BPM
method is separates patterns that overlap somewhat, we have used to learn the
whole set of data and we have learned and tested with the same data. We used the
approach "one against all" and began by identifying the iris type Iris Setosa. In
doing so, it has been shown that the complete set of training data contains only
two border pairs and that only four out of the 150 learning data was completely
30
Optimization for Multi Layer Perceptron: Without the Gradient
adequate for successful learning. Due to the favorable disposition of the border
pairs it is valid that for their separation only one border hyper-plane is sufficient.
The classification of the remaining two types of Irises (Iris Virginica and Iris
Versicolor) is conducted in the same way. We got only slightly more of the border
pairs and hyper-planes, than in the case of Setosa. In all three cases, the BPM
method succeeded to separate all learning data correctly. Data on the remaining
learning errors RMSE are given in table 13.
Table 13. MLP suitable for learning for "Irises" datasets.
Given
class
Border
pairs
Neurons in
1st layer
RMSE error
BPM BP SVM DT
Setosa 2 1 0.0000 0.0071 0.0000 0.0000
Versicolor 14 5 0.0000 0.1114 0.0816 0.1323
Virginica 12 4 0.0000 0.1129 0.1155 0.1325
Figure 15. Clustering of irises. White circles represents Iris Setosa, the black ones Iris
Virginica or Iris Versicolor.
31
Dr Bojan Ploj
For a comparison we also learned with other methods (bipropagation BP,
support vector machine SVM and decision tree DT), which all stand out as
inferior because they have greater remaining RSME.
Due to the relatively small number of learning patterns it is difficult in this
dataset to determine whether the MLP is overfitted. When we randomly halved
the data and used one half of the data for learning and the other half for testing, it
turned out that in all methods we got just one or two incorrectly classified irises,
which means good generalization. This did not surprise us, because it usually
holds that MLP with a small number of neurons has a good generalization.
Figure 15 shows the separation of Iris setosa from the other type of Irises. In
the figure we transformed the four-dimensional data into two-dimensional. On the
X axis we add up the width and length of the petal, and on the Y axis, the width
and length of sepal. Despite this primitive reduction of the dimensions, in the
figure is still visible the separation of Iris setosa from the others, as the black and
white circles are not mixed together.
7.2.4. Pen-Based Recognition Of Handwritten Digits
Pen-Based Recognition of Handwritten Digits is the name of a
comprehensive, real, validated and verified set of learning data [6]. In it are digits
written by 44 people with an electronic pen with a resolution of five hundred
times five hundred pixels. The data set is prepared (equal size of digits and
centering in the middle of the frame). An individual digit is described with
seventeen attributes. The first sixteen numbers represents coordinates x and y for
eight points. These are integers between zero and one hundred, which are
recorded with an electronic pen at intervals of one hundred milliseconds. The
seventeenth number represents the class in which the digit belongs.
Table 14 Examples of handwritten decimal digits.
Eight coordinates of pen Digit
1 2 3 4 5 6 7 8
0 39 2 62 11 5 63 0 100 43 89 99 36 100 0 57 0
0 57 31 68 72 90 100 100 76 75 50 51 28 25 16 0 1
0 89 27 100 42 75 29 45 15 15 37 0 69 2 100 6 2
35 76 57 100 100 92 68 66 81 38 82 9 32 0 0 17 3
0 100 7 92 5 68 19 45 86 34 100 45 74 23 67 0 4
13 89 12 50 72 38 56 0 4 17 0 61 32 94 100 100 5
99 100 88 99 49 74 17 47 0 16 37 0 73 16 20 20 6
0 85 38 100 81 88 87 50 84 12 58 0 53 22 100 24 7
47 100 27 81 57 37 26 0 0 23 56 53 100 90 40 98 8
74 87 31 100 0 69 62 64 100 79 100 38 84 0 18 1 9
32
Optimization for Multi Layer Perceptron: Without the Gradient
Examples of learning patterns for all ten digits are shown in Table 14 and
Figure 16
Figure 16. Graphical representation of digits in the table 14.
Classification was first done using the BPM method, and then with three
control methods –backpropagation (BP), support vector machine (SVM), and
decision tree (DT). For learning, we used two hundred samples, for validation
3498 new patterns. The comparative results of the learning are given in Table 15.
The reason for using a small number of learning patterns is the non-optimized
source code of the learning program, as it is written by the interpreter and is
therefore very slow. For this reason, we have not performed any speed
measurements or comparisons. Despite the small set of learning data it is evident
in Table 15 that the selected learning patterns are representative and that the
learning succeeded.
The percentage of misrecognized digits is similar to that from support vector
machine (SVM) and for one good percentage better than that of decision tree. The
backpropagation method has fared better this time. This is probably due to
excessive learning with BPM, as it identified all the learning patterns correctly. As
a point of interest we would like to add that also man does not correctly recognize
all the digits in this dataset.
33
Dr Bojan Ploj
Table 15. A comparison of learning results on recognition of handwritten
digits.
Digit
Share of misclassified patterns [%]
BPM BP SVM DT
0 2.5 2.2 2.0 1.7
1 6.8 7.1 7.7 9.6
2 4.2 4.4 4.9 4.5
3 5.2 1.8 1.5 5.2
4 1.7 0.7 1.3 2.6
5 4.8 5.5 8.3 6.0
6 0.8 0.4 0.3 2.8
7 2.2 1.2 3.9 7.1
8 3.4 1.7 4.8 5.5
9 4.7 3.2 2.9 5.2
Average 3.63 2.82 3.76 5.02
7.2.5. Ionosphere
Ionosphere is a classification dataset obtained when using aviation radar [45].
Table 16. Learning results from a ionosphere data set.
Dataset Number of misclassified patterns
BPM BP SVM DT
1 65 50 81 106
2 50 47 59 33
3 71 110 72 82
4 62 76 64 50
5 55 48 66 53
6 111 104 126 69
7 126 126 126 126
Average 77.14 80.14 84.86 74.14
In the dataset are 351 patterns without missing values, which are composed of
34 attributes and classes and can be positive or negative. The complete set of
patterns was divided into seven parts with 50 or 51 patterns. We learned seven
times, always with another part of the dataset, but we always tested with the
whole dataset. Classification results are shown in Table 16.
34
Optimization for Multi Layer Perceptron: Without the Gradient
7.3. Noisy Data
In this research, we determined how resistant the border pairs method is to
noise in the learning data. The precise content of noise in the learning data can be
known only when we use an artificial set of learning data to which we add noise.
During our research we enhanced the noise and found what the percentage of
incorrectly identified test patterns is. In doing so, we did not use early stopping of
learning. Once again, we used a two-dimensional set of learning patterns - the
image of a square. The position of the patterns is random and evenly distributed.
There are 500 patterns, half to learn and the other half for the evaluation of
learning results. The first learning was carried out without added noise, which was
then enhanced (1%, 2%, 5% and 10%). Each of learning was repeated 10 times
and then we calculated the average and standard deviations.
The obtained results of the research were compared with the results of the
backpropagation method.
1
0
Y
m
1
X
Figure17 Learning patterns that contain noise
Three circles have left the square, a cross entered.
Figure17 Learning patterns that contain noise. Three circles have left the square, a cross
entered.
35
Dr Bojan Ploj
Table 17. RMSE error at different levels of noise.
Level of noise (%)
Algorithm 0 1 2 5 10
Back-
propagation
Average RMSE 0.0692 0.1257 0.1094 0.1333 0.1519
Standard
deviation 0.0290 0.1136 0.0953 0.1146 0.0782
Border
pairs
method
Average RMSE 0.0399 0.0308 0.0497 0.0302 0.1046
Standard
deviation 0.0304 0.0250 0.0491 0.0235 0.0280
In evaluating the data in Table 17 it is also necessary to take into
consideration that the BPM method itself finds the optimal structure of the MLP,
while the BP method cannot. Because, owing to the comparableness of the results,
we use for both methods the same MLP structure, the method of BP somewhat
gains and its results are unjustifiably slightly better, but despite that still worse
than BPM.
8. DYNAMIC LEARNING WITH BORDER PAIRS METHOD
Machine learning is especially expedient to use when the conditions are
dynamic [23] and during operation new learning data can be added to the
intelligent system. Thus, two benefits are achieved:
Increased set of learning data. Further data is added to the initial
learning data that we have gained during its operation. This is especially
helpful when we had a small set at the beginning.
Adapting to new circumstances. Often in machine learning we know
only part of the input vector. Among the unknown factors are also those
that change very slowly, which results in newer learning patterns
contributing more to the quality of learning than the older one. Some
authors call this obsolescence of-data or concept drift. In such
circumstances, it is reasonable to add new patterns and also to remove the
old ones.
36
Optimization for Multi Layer Perceptron: Without the Gradient
8.1. Dynamic Learning Approaches with Border Pairs Method
The different approaches of machine learning are differently adapted to the
addition and taking away of learning patterns. In the most "rigid" approach
because of one single added or taken away pattern it is necessary to repeat the
whole learning. Such an approach is certainly not the most appropriate for
dynamic intelligent systems. Unfortunately included in this group of approaches is
also the gradient learning method with the backpropagation at the helm.
There are two strategies of learning dynamic intelligent systems:
Incremental Learning. When you accumulate enough new learning
data, we interrupt the use of the intelligent system and additionally learn.
In advance we determine the criteria when we will do additional learning.
For example, when you accumulate a certain number of new learning
patterns, or when you perceive a greater change in the properties of the
learning data.
Online Learning. When a new learning pattern appears immediately use
it for further learning. With this we achieve that the intelligent system is
fully up to date, but unfortunately due to the continuous flow of
supplementary learning its working slows down.
When the dynamics of the system is large and there is enough time to learn, it
makes sense to use additional online learning. If the dynamics of the system is
small or there is not enough time for continual further learning, incremental
learning is a better choice.
The BP method has difficulties with dynamic learning. In additional learning
the result is often unsuccessful because new learning patterns greatly increase the
residual learning error to the extent that additional learning can no longer
satisfactorily reduced it. Synapses in a neural network behave as if they are
"woody" and their values hardly change. The network in additional learning as a
rule finds a local minima and the residual learning error does not even begin to
decline. Sometimes you can escape from the local minima in that we randomly
change the existing value of the weight slightly. If this trick does not succeed we
need the network to learn again from the beginning. Let us see how appropriate
BPM is for dynamic learning.
37
Dr Bojan Ploj
8.2. Incremental Learning with the Border Pairs Method
In incremental learning we first learn from data that were available prior to
the beginning of learning. This is followed by use of MLP, during which rise new
data that are evaluated every time. If with time the new data errors fail to increase,
then only new patterns are added to the existing set of learning data.
a)
b)
c)
Figu re 18 Differ ent posi tio ns of lea rning p att er ns durin g add ition al onli ne learn ing.
Full circle r epresents a new patt ern , das he d line is for mer bord er li ne.
a) There is no need to mo ve the bor der line.
b) It is necess ary t o move the b order line.
c) The shi ft is no t suf ficie nt, ddd itional border line is needed.
Figure 18 Different positions of learning patterns during additional online learning. Full
circle represents a new pattern, dashed line is former border line. a) There is no need to
move the border line.b) It is necessary to move the border line. c) The shift is not
sufficient, ddditional border line is needed.
Otherwise, when the error increases we talk about concept changes and then it
is also reasonable to eliminate old data. Adding new and eventual removal of old
learning data is followed by additional learning, in which we only seek additional
border pairs and eventual additional border lines (planes). Thus, only changing
that part of the first layer MLP, which corresponds to an area that has become
heterogeneous. In unconstructive learning methods we do not know which pattern
has an impact on which part of the neural network, therefore for additional
learning we have to use the entire set of learning patterns and with it learn the
whole MLP.
8.3. Online Learning with a Method of Border Pairs
The principle of online learning with the border pairs method is shown in
Figure 18. During use MLP classifies unknown samples into classes and
38
Optimization for Multi Layer Perceptron: Without the Gradient
simultaneously ascertains how successful the classification was done. When it is
successful (Figure 18 a), no further action is required as the borderline is still
located in the right position and because the construction of MLP and the values
of its weights remain unchanged. Otherwise, when classification is incorrect
additional learning is necessary, which can be done in several different ways:
A) With or without reconstruction of the MLP.
B) With or without forgetting or unlearning.
ad A) Figure 18a shows an arrangement of learning patterns, which does not
need any additional learning because the new learning pattern (full circle) is
correctly classified (located on the right side of the borderline). When the
classification is incorrect, first determine which border line is closest to the new
learning pattern. The neuron in the first layer, which corresponds to this line is
additionally learned and thereby move the appurtenant line. To move the line we
use only those border pairs that have already been used in determining its location
and new learning pattern. By moving the borderline we try to get a new pattern on
the right side of the line, or at least close to it. When reconstruction of MLP is not
allowed (adding neurons) this is the only possible measure. In Figure 18b such a
shift is possible, but in Figure 18c moving the borderline is no longer sufficient.
In the case of 34c we have to decide:
Whether we will add a new borderline (neuron)?
Whether we will come to terms with the wrong classification?
When learning patterns contain noise, it is usually advisable to accept the
smaller incorrect classification. This measure reduces excessive learning and
improves the generalization of learning. When the error is greater, or a wrongly
classified pattern is far beyond the borderline, it is better to move the border line.
If this is not feasible, we need the borderline to be replaced with two (breaking the
borderline). Such a measure adds a new neuron in the first layer of MLP.
ad B) Everything written in point A applies to learning without forgetting or
unlearning. Both learning patterns, the oldest and the newest, influence this kind
of learning equally. So far we have used only local data, without time dimension.
In this case we cannot talk about out-of-date data and therefore unlearning does
not make sense. When the data contains a time dimension, forgetting can be a
welcome feature. As an example we mention the conditions in predicting stock
exchange rates for securities. Forecasting stock exchange rates is based on past
experience. The difficulty in stock exchange prediction is that we known only part
39
Dr Bojan Ploj
of the factors affecting the stock quotes. All the unknown factors can be combined
into one, which we call "spirit of the time". Brokers for this purpose coined terms
like boom, recession, stagnation, and the like. Old learning patterns were obtained
under different economic conditions, in a different spirit of the time than the new
one. Older learning patterns can therefore be considered as (partially) out-of-date
and in learning withhold them or even completely eliminate them. Figure 19a
shows conditions without forgetting. Both border pairs remain in their positions.
The new pattern had to remain on the wrong side of the borderline, because the
position of the line can no longer be improved. If unlearning is allowed, then we
can loosen up old border pairs so that we increase the distance between both
participants in the border pairs. By doing this we gain some extra leeway for
moving the borderline. Thus, we have made it possible that the new pattern in
Figure 19b can be positioned on the right side of the borderline.
a)
b)
Figure 1 Lear ning wit h forget ting (un lear ning)
a) Circu mstance s before forgett ing where the
bo rder line can not be mo ved.
b) Circ umsta nces after forgetti ng (upper border pair
is expanded) where the border line c an b e
mo ved.
Figure 19. Learning with forgetting (unlearning). a) Circumstances before forgetting where
the border line can not be moved. b) Circumstances after forgetting (upper border pair is
expanded) where the border line can be moved.
Let us now look at the algorithm for online learning with the border pairs
method:
Algorithm 4 Algorithm for online learning with the border pairs
1Step: Take the new learning pattern and classify it.
2Step: If the classification succeeded go on to Step 6.
3Step: Find all the patterns in this cluster.
40
Optimization for Multi Layer Perceptron: Without the Gradient
4Step: Find all border pairs in this cluster.
5Step: Use the found border pairs for learning neurons in the first MLP
layer
6Step: Continue with Step 1 until the last on-line learning pattern is
processed.
8.3.1. Online Recognition of Digits
For testing of online algorithm (Algorithm 4) and evaluating its results, we
again used the data set of handwritten digits, which we already became acquainted
with in the last chapter. We took MLP, which was previously learnt with the
offline method and further learned with the online method. Thus we could, after
the concluding of the learning, compare online and offline results with each other.
For additional online learning, we used one hundred new learning patterns, which
in offline learning and testing have not yet participated. Because the BPM method
is constructive, the MLP structure between online learning can change. If
necessary a new neurons are added in the various layers of MLP. Normally, only
the first layer extends. Procedure for testing the quality of learning remained the
same as in the offline method, for which we used again the same 3498 test
patterns. The results obtained on additional online learning are shown in Table 18.
It is expected that the additional learning would improve results, which in most
cases also happened. But in some individual cases, such as digit 4, the result is
somewhat worse. Overall, after additional learning the number of incorrect
identified digits decreased. The deterioration of some individual digits is
attributed to the noise in additional learning patterns.
Table 18. Results of online learning.
Misclassified (%)
Digit Before After
0 2.5 2.5
1 6.8 6.0
2 4.2 3.3
3 5.2 4.2
4 1.7 2.1
5 4.8 4.8
6 0.8 1.1
7 2.2 4.4
8 3.4 3.0
9 4.7 3.9
Average 3.63 3.50
41
Dr Bojan Ploj
9. CONCLUSION
With the results of online learning, description of BPM learning concludes.
We note that we have been successful and we found some interesting results. Here
is a summary of the most important:
Complexity of data evaluation. From the number of border pairs and share
of learning patterns, which are participating in border pairs we can conclude how
difficult the learning will be. Where there are only a few patterns participating in
border pairs, the learning is not a problem. A large number of border pairs tell us
that the learning is demanding or contain a lot of noise.
Noise reduction. With using the method of border pairs it is possible to
successfully reduce noise in data. This is done so that we find participants of
border pairs and approach the nearest non-participating border pairs of the same
class. In this way we reduce noise only in relevant patterns. We found that the
generalization of learning from noisy data with the method of border pairs better
than the backpropagation method.
Clustering. With the border pairs method it is possible to cluster data. This is
done so that we find all border pairs and linear separate them. If after the
separation there is the odd heterogeneous area, it is separated into subsections.
This is repeated until all the areas were homogeneous.
Search for features. With the border pairs method it is possible to find
quality features. Each borderline has two hills, which are called binary with 0 and
1. The features are formed in such a way that we write on which hill individual
borderlines are lying patterns. As a cluster is located within the same area the
members of the same cluster also have the same features.
Reliability of learning. Learning with the border pairs method is reliable, as
it never got stuck and it has always ended successfully. Binary features, which
form in the first layer, search in further layers for a logical operation that from
features calculate which class the pattern belongs to.
Learning accuracy. Learning with the border pairs method is accurate. The
residual learning error was almost always less than that of the BP method.
Constructiveness. Learning with the border pairs method, unlike gradient
methods, is constructive. During learning with it we find the near minimal
construction, which is simultaneously the cause for the good generalization of
learning data.
Overfitting. The border pairs method has no problem with overfitting. The
reason for this is the fact that the linear border line (plane) is at the same time
adapted to numerous learning data, so there is no risk of over-adjustment of
individual data.
42
Optimization for Multi Layer Perceptron: Without the Gradient
Selected learning patterns. In learning only those learning patterns take part
that are participants in border pairs and contain information. During separation of
useful learning patterns, already before learning a picture of the complexity of the
learning data set can be created.
Gradual and modularity learning. Learning with the border pairs method
runs progressively from the input layer towards the output. With this complex
learning process is divided into several smaller independent processes that can be
solved separately (modular).
Non-iterative. Learning with the border pairs method takes place non-
iteratively. Neurons free parameters are calculated separately for each neuron and
in one step – non-iteratively.
Suitability for dynamic learning. The border pairs method is suitable for
dynamic learning (incremental and online), as it allows the addition and (gradual)
removal of data, without having to repeat the whole process of learning.
Findings on the features of the BPM method are summarized in Table 19.
A comparison of the features of the BPM method with established methods is
given in Table 20. Bold writing denotes the approach with the lowest percentage
of incorrectly identified test patterns, and the best result.
From the summary table it is evident that the BPM method regarding
accuracy among the three control methods once achieved the best result, and
twice the second best results. Overall, the BPM approach was the most successful
of all. Due to the relatively small number of used sets of learning patterns this
finding is treated with caution.
Table 19. Overview of the BPM method features.
Noise reduction Yes
Validation of learning patterns Yes
Clustering data Yes
Searching for features Yes
Reliable learning Yes
Accurate learning Yes
Constructive learning Yes
Resistance to overfitting Yes
Elimination of barren patterns Yes
Modular learning Yes
Non-iterative learning Yes
Dynamic learning Yes
43
Dr Bojan Ploj
Table 20. Summary table of results.
Misclassified patterns (%)
BPM BP SVM DT
Irises* 0.00 1.33 1.33 2
Digits 3.63 2.82 3.76 5.02
Ionosphere 77.14 80.14 84.86 74.14
* Test data set was the same as learning data set
The research results have inspired us for further research work, which offers a
number of options. Here are some of them:
Optimizing the search for border pairs: In extensive datasets the search
for border pairs can be a very time consuming task and it is therefore
reasonable to use the optimized algorithm.
Optimization of combining border pairs: When there are many border
pairs in the data, there are also a lot of different combinations as to how
they can be combined with each other. Combining in complex learning
data is the most time consuming task in the process of classification.
Regression version of BPM methods: the addressed BPM method only
allows classification (binary data). In nature many things are continuous
and they know also values between 0% and 100%. For example, weather
forecasting, temperature regulations...
The discussed algorithm is probably possible to remodel so as to become
suitable for regression (continuous data).
Multi-class Method: The discussed method had at the output only one
neuron, which conveyed whether the pattern suits a specific class. MLP
with multiple output neurons could decide among more than two classes.
Implementation of the BPM method in the “Weka” software and other
and related validated tools for machine learning: Implementation of BPM
method would certainly facilitate and approximate the research work of
many interested researchers.
44
Optimization for Multi Layer Perceptron: Without the Gradient
REFERENCES
[1] B. Ploj, Bipropagation nov način učenja večslojnega perceptrona (MLP),
Proceedings of the Eighteenth International Electrotechical and Computer
Science Conference ERK 2009, Slovenian section IEEE, pp 199-202, 2009
[2] Hebb, Donald Olding, The Organization of Behaviour: A
Neuropsychological Theory, 1949
[3] Yann LeCun, Corinna Cortes: yan.lecun.com/exdb/mnist, Handwritten digit
database
[4] Jihoon Yang, Rajesh Parekh, Vasant Honavar: DistAl: An inter-pattern
distance-based constructive learning algorithm, Intelligent Data Analysis,
Volume 3, Issue 1, May 1999, Pages 55–73
[5] Iris data set, http://en.wikipedia.org/wiki/Iris_flower_data_set, 12. 1. 2013
[6] Pen based handwriten digits data set, http://archive.ics.uci.edu/ml/
support/Pen-Based+Recognition+of+Handwritten+Digits, 12. 1. 2013
[7] Weka software. http://en.wikipedia.org/wiki/Weka_(machine_learning) , 18.
4. 2013
[8] Decision tree, http://en.wikipedia.org/wiki/Decision_tree, 12. 3. 2013
[9] Suport vector machine, http://en.wikipedia.org/wiki/Support_vector_
machine, 12. 3. 2013
[10] Ian H. Witten, Eibe Frank, Mark A. Hall, Data Mining: Practical Machine
Learning Tools and Techniques, Third Edition,The Morgan Kaufmann
Series in Data Management Systems, 2011
[11] 11.Feed forward neural network, http://en.wikipedia.org/wiki/Feed-
forward_neural _networks, 12. 3. 2013
[12] Recurrent neural network, http://en.wikipedia.org/wiki/Recurrent_neural_
networks, 12. 3. 2013
[13] Perceptron, http://en.wikipedia.org/wiki/Perceptron, 12. 3. 2013
[14] Davida E. Rumelharta, Geoffreya E. Hintona in Ronalda J. Williamsa,
Learning representations by back-propagating errors, Nature, October 1986
[15] P.J.G. Lisboa, T.A. Etchells and D.C. Pountney, Minimal MLPs do not
model the XOR logic, School of Computing and Mathematical Sciences
[16] T. L. Andersen, T.R. Martinez, DMP3: A Dynamic Multilayer Perceptron
Construction Algorithm, Brigham Young University, Utah USA
[17] Iris data set, http://archive.ics.uci.edu/ml/datasets/Iris, 12. 3. 2013
[18] Wine data set, http://archive.ics.uci.edu/ml/datasets/Wine, 12. 3. 2013
[19] DistAl: An inter-pattern distance-based constructive learning algorithm,
Jihoon Yang, Rajesh Parekh, Vasant Honavar, Neural Networks
Proceedings, 1998. IEEE World Congress on Computational Intelligence.
45
Dr Bojan Ploj
The 1998 IEEE International Joint Conference, 4-9 May 1998, Volume: 3,
On Pages: 2208 - 2213 vol.3
[20] Geometrical synthesis of MLP neural networks, Rita Delogu, Alessandra
Fanni and Augusto Montisci, Neurocomputing,Volume 71, Issues 4–6,
January 2008, Pages 919–930,Neural Networks: Algorithms and
Applications, 4th International Symposium on Neural Networks
[21] Arunava Banerjee, Initializing Neural Networks using Decision Trees,
Computational learning theory and natural learning systems: Volume IV,
MIT Press Cambridge, 1997, ISBN:0-262-57118-8
[22] Cortes, Corinna; and Vapnik, Vladimir N.; "Support-Vector Networks",
Machine Learning, 20, 1995, http://www.springerlink.com/content/
k238jx04hm87j80g/, 12. 3. 2013
[23] Neapolitan, Richard; Jiang, Xia (2012). Contemporary Artificial
Intelligence. Chapman & Hall/CRC. ISBN 978-1-4398-4469-4.
[24] Mitchell, T.: Machine Learning, McGraw Hill, 1997, ISBN 0-07-042807-7,
p.2.
[25] Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar: Foundations of
Machine Learning, The MIT Press, 2012, ISBN 9780262018258.
[26] Ross, Brian H.; Kennedy, Patrick T: Generalizing from the use of earlier
examples in problem solving, Journal of Experimental Psychology:
Learning, Memory, and Cognition, Vol 16(1), Jan 1990, strani 42-55.
[27] Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar Foundations of
Machine Learning, The MIT Press, 2012, ISBN 9780262018258.
[28] Vapnik, V. N. The Nature of Statistical Learning Theory (2nd Ed.),
Springer Verlag, 2000
[29] Oded Maimon and Lior Rokach: DATA MINING AND KNOWLEDGE
DISCOVERY HANDBOOK, Springer, 2010
[30] Hipp, J.; Güntzer, U.; Nakhaeizadeh, G.: "Algorithms for association rule
mining - a general survey and comparison". ACM SIGKDD Explorations
Newsletter 2: 58. doi:10.1145/360402.360421, 2000
[31] J. J. HOPFIELD Neural networks and physical systems with emergent
collective computational abilities. Proc. NatL Acad. Sci. USA Vol. 79, pp.
2554-2558, April 1982 Biophysics
[32] Fogel, L.J., Owens, A.J., Walsh, M.J. (1966), Artificial Intelligence through
Simulated Evolution, John Wiley
[33] Muggleton, S. (1994). "Inductive Logic Programming: Theory and
methods". The Journal of Logic Programming. 19-20: 629–679.
doi:10.1016/0743-1066(94)90035-3
46
Optimization for Multi Layer Perceptron: Without the Gradient
[34] Cortes, Corinna; and Vapnik, Vladimir N.; "Support-Vector Networks",
Machine Learning, 20, 1995. http://www.springerlink.com/content/
k238jx04hm87j80g/
[35] Ben-Gal, Irad (2007). Bayesian Networks (PDF). In Ruggeri, Fabrizio;
Kennett, Ron S.; Faltin, Frederick W. "Encyclopedia of Statistics in Quality
and Reliability". Encyclopedia of Statistics in Quality and Reliability. John
Wiley & Sons. doi:10.1002/9780470061572.eqr089. ISBN 978-0-470-
01861-3.
[36] John Peter Jesan, Donald M. Lauro: Human Brain and Neural Network
behavior a comparison, Ubiquity, Volume 2003 Issue November
[37] McCulloch, W. and Pitts, W. (1943). A logical calculus of the ideas
immanent in nervous activity. Bulletin of Mathematical Biophysics, 7:115 -
133.
[38] Rosenblatt, Frank, The Perceptron--a perceiving and recognizing
automaton. Report 85-460-1, Cornell Aeronautical Laboratory, 1957
[39] Russell, Ingrid. "The Delta Rule". University of Hartford, November 2012
[40] P.J.G. Lisboa, T.A. Etchells, D.C. Pountney: Minimal MLPs do not model
the XOR logic, Neurocomputing, Volume 48, Issues 1–4, October 2002,
Pages 1033–1037
[41] Deza, E.; Deza, M.: Dictionary of Distances, Elsevier, ISBN 0-444-52087-
2, 2006
[42] Roland Priemer : Introductory Signal Processing. World Scientific. p. 1.
ISBN 9971509199, 1991
[43] Estivill-Castro, V.: "Why so many clustering algorithms". ACM SIGKDD
Explorations Newsletter 4: 65. doi:10.1145/568574.568575, 2002
[44] Borodin, A.; El-Yaniv, R.: Online Computation and Competitive Analysis.
Cambridge University Press. ISBN 0-521-56392-5, 1998
[45] Ionosphere data set, http://archive.ics.uci.edu/ml/machine-learning-
databases/ ionosphere, 12. 3. 2013
[46] Alsmadi M. S., Omar B. K. :Back Propagation Agorithm: The Best
Algorithm Among the Multi-layer Perceptron Algorithm, IJCSNS, April
2009
[47] Sharma K. S., Constractive Neural Networks: a reiew, International
Journal of Engineerinf Science and Technology, 2010, pp. 7847-7855
[48] Aizenbeg I., Moraga C.: Multilayer Feedforward Neural Network Based on
Multi-Valued Neurons and Backpropagation Learning Algorithm, Soft
Computing, January 2007, pp. 169-183
47
Dr Bojan Ploj
[49] P. A. Castillo, J. Carpio, J. J. Merelo, A. Prieto, V. Rivas, G, Romero:
Evolving Multilayer Perceptrons, Neural Processing Letters, 2000, pp. 115-
127
[50] J. L. Subirat, L. Franco, I. Molina, J. M. Jerez:Active Learning Using a
Constructive Neural Network Algorithm, Constructive Neural Networks,
pp. 193-206, 2009, Springer Verlag
[51] Y. G. Smetanin: Neural Networks as system for recognizing patterns,
Journal of Mathematical Science, 1998
[52] E. Ferrari, M. Muselli:Efficient Constructiv Tecniques for Training
Switching Neural Networks, Constructive Neural Networks, pp. 24-48,
2009, Springer Verlag
[53] J. F. C. Khaw, B. S. Lim, L. E. N. Lim: Optimal Design of Neural Networks
Using the Taguchi Method, Neorocomputing, 1995, pp. 225-245
[54] B. Ploj, R. Harb, M. Zorman, Border Pairs Method—constructive MLP
learning classification algorithm, Neurocomputing, Volume 126, 27
February 2014, Pages 180-187
48
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.