Content uploaded by Christian Bauckhage
Author content
All content in this area was uploaded by Christian Bauckhage on Jul 15, 2019
Content may be subject to copyright.
Content uploaded by Christian Bauckhage
Author content
All content in this area was uploaded by Christian Bauckhage on Jun 29, 2019
Content may be subject to copyright.
Lecture Notes on Machine Learning
Classifiers for Non-Linearly Separable Classes (Part 2)
Christian Bauckhage
B-IT, University of Bonn
A single binary linear classifier cannot achieve high accuracy if the
two classes under consideration are not linearly separable. Here, we
therefore look at how an ensemble of linear classifiers can accomplish
this. Mastering the ideas discussed in this note will help us later on to
understand the inner workings of neural networks.
Introduction
Previously,1we discussed the problem of binary classification where 1C. Bauckhage. Lecture Notes on Ma-
chine Learning: Binary Linear Classi-
fiers. B-IT, University of Bonn, 2019a
data vectors x∈Rmare supposed to be classified into two distinct
classes. We saw that a binary linear classifier
y(x) = signw|x−θ(1)
with parameters w∈Rmand θ∈Raccomplishes this by mapping
its inputs to either +1 or −1. Analyzing the right hand side of (1), we
observed that the decision of the classifier is simply based on testing
whether or not its input resides within the half-space
HS =nx∈Rmw|x≥θo(2)
above the hyperplane
HP =nx∈Rmw|x=θo. (3)
Given this geometric insight, we concluded that an appropriately
parametrized linear classifier will work perfectly if (the data from)
the two classes are linearly separable.2At the same time, we had to 2C. Bauckhage. Lecture Notes on Ma-
chine Learning: Linear Separability. B-
IT, University of Bonn, 2019d
concede that linear classifiers cannot achieve perfect accuracy if the
two classes are not linearly separable. Yet, lack of linear separability
is nothing to worry about. In fact, we also already discussed how
non-linear transformations of the input data allow us to deal with
such a setting.33C. Bauckhage. Lecture Notes on Ma-
chine Learning: Classifiers for Non-
Linearly Separable Classes (Part 1). B-
IT, University of Bonn, 2019b
−2−1 0 1 2
−2
−1
0
1
2
Figure 1:2D data sampled from two
classes that are not linearly separable.
in this note, we discuss yet another way of how to adapt linear
classifiers to problems involving non-linearly separable classes. This
time, the idea is to combine several linear classifiers into a single non-
linear classifier. Again, at this point, we only outline the underlying
principles and defer the crucial question of how to actually train such
a classifier to later notes.
Classifier Ensembles for Non-Linearly Separable Classes
Figure 1shows an example of 2D data points sampled from classes
that are not linearly separable. Hence, no single hyperplane can
perfectly separate the blue data (representing the first class) from the
orange ones (representing the second class).
© C. Bauckhage
licensed under Creative Commons License CC BY-NC
2 c.bauckhage
−2−1 0 1 2
−2
−1
0
1
2
(a) y1(x) = signw|
1x−θ1
−2−1 0 1 2
−2
−1
0
1
2
(b) y2(x) = signw|
2x−θ2
−2−1 0 1 2
−2
−1
0
1
2
(c) y3(x) = signw|
3x−θ3
−2−1 0 1 2
−2
−1
0
1
2
(d) y4(x) = signw|
4x−θ4
Figure 2: Linear classifiers divide the
data space into two half-spaces.
Dealing with data from non-linearly
separable classes, classification based
on only two half-spaces is usually weak
in that only slightly more that half of
the data are classified correctly. (In each
of the four example shown here, all the
blue and about a quarter of the orange
data points are classified correctly.)
Now, pretend we asked a machine learning expert to have a look
at this data and to design four binary linear classifiers which would
solve our classification problem “half-decently”. Figure 2visualizes
what the expert may come up with.
Each panel in this figure shows a pair of half-spaces of R2tested
for by the following four classifiers
y1(x) = signw|
1x−θ1=signe|
1x+1.3(4)
y2(x) = signw|
2x−θ2=sign−e|
1x+1.3(5)
y3(x) = signw|
3x−θ3=signe|
2x+1.3(6)
y4(x) = signw|
4x−θ4=sign−e|
2x+1.3(7)
where e1= [1, 0]|,e2= [0, 1]|are the standard basis vectors in R2.
Classifiers such as those in Fig.2are called weak classifiers.
This means that they work only slightly better than random guessing
in that they correctly classify just more than half of the given data.
Interestingly, however, an ensemble of weak classifiers can form a
strong classifier.
−2−1 0 1 2
−2
−1
0
1
2
Figure 3: Intersecting the four blue half-
spaces in Fig. 2forms a polytope that
separates the blue from the orange data
points.
In order to see how this works, we recall that the intersection
of finitely many half-spaces forms a polytope.44C. Bauckhage. Lecture Notes on Ma-
chine Learning: Convex Sets. B-IT, Uni-
versity of Bonn, 2019c
Figure 3shows the polytope Pthat results from intersecting the
four blue half-spaces in Fig. 2. Apparently, it contains all of the
points from our first class and none of the points from our second
class. Hence, if we had a function Y(x)to test whether or not x∈R2
resides within P, we would have solved our classification problem.
Writing
P=HS1∩ HS2∩ HS3∩ HS4(8)
where
HSi=nx∈R2w|
ix≥θio=nx∈R2yi(x) = +1o(9)
denotes the half-space of those x∈R2for which the i-th of our weak
classifiers returns a value of +1, we realize that
x∈ P ⇔ y1(x) + y2(x) + y3(x) + y4(x) = 4. (10)
classifiers for non-linearly separable classes 3
In other words, a data point x∈R2is contained in the intersection
of the four blue half-spaces in Fig. 2if each of the four weak classi-
fiers maps it to +1, or, equivalently, if summing the four classifier
responses amounts to 4.
Therefore, since 4 is greater than, say, 3.5 we can write down an
expression for a stronger, non-linear classifier Y(x)that looks eerily
similar to the one for a linear classifier in (1), namely
Y(x) = sign 4
∑
i=1
yi(x)−3.5!. (11)
Finally, to generalize the specific result in (11) to settings where we
are dealing with data x∈Rmand a total of kweak linear classifiers,
we note that we can gather their results yi(x)in a vector55Note that we can also write this vector
as
y(x) = σW|x−θ
where
W=
w|
1
.
.
.
w|
k
,θ=
θ1
.
.
.
θk
,
and σ:Rk→ {−1, +1}kis a vectorial
sign function.
y(x) =
y1(x)
y2(x)
.
.
.
yk(x)
(12)
and write the corresponding strong non-linear classifier as
Y(x) = sign1|y(x)−ϑ. (13)
More generally still, we further note that there may be situations
where we might want to individually weigh the contributions of the
classifiers yi(x)using weights ωi. The resulting ensemble classifier
would then be
Y(x) = signω|y(x)−ϑ. (14)
Summary and Outlook
In this note, we were again concerned with binary classification of
data from two non-linearly separable classes. While a single linear
classifiers cannot work perfectly in such a setting, we discussed by
means of an example how an ensemble of several, individually weak
linear classifiers
yi(x) = signw|
ix−θi
can form a strong(er) non-linear classifier
Y(x) = signω|y(x)−ϑ
which is typically able to achieve much higher classification accuracy.
1x1x2
1y1y2y3y4
Y
−θ1
w11
w12
−θ2
w21
w22
−θ3
w31
w32
−θ4
w41
w42
−ϑ
ω1
ω2
ω3
ω4
Figure 4: Classifiers such as in (14) are
shallow feed forward neural networks!
While we did not discuss how the parameters {wi,θi}k
i=1∪{ω,ϑ}
of such a classifier can be learned from training data (which will hap-
pen later on), we may already point out that Y(x)can be understood
as a shallow feed forward neural network.
4 c.bauckhage
Keeping with our above example, Fig. 4illustrates how combining
four weak classifiers y1(x), . . . , y4(x)into a stronger one Y(x)forms
a neural network that classifies data x∈R2into to classes.
The network consists of three layers of neurons. Neurons labeled
“1” are so called bias units; neurons labeled “xi” are input neurons,
neurons labeled “yi” are internal or hidden neurons, and the neu-
ron labeled “Y” provides the output of this network. The network’s
directed edges are weighted by appropriate parameters and indicate
the flow of information through this system. As information only
flows in one direction, the network is a feed forward network; as
there is only one layer of hidden neurons, it is a shallow network.
Acknowledgments
This material was prepared within project P3ML which is funded by
the Ministry of Education and Research of Germany (BMBF) under
grant number 01/S17064. The authors gratefully acknowledge this
support.
classifiers for non-linearly separable classes 5
References
C. Bauckhage. Lecture Notes on Machine Learning: Binary Linear
Classifiers. B-IT, University of Bonn, 2019a.
C. Bauckhage. Lecture Notes on Machine Learning: Classifiers for
Non-Linearly Separable Classes (Part 1). B-IT, University of Bonn,
2019b.
C. Bauckhage. Lecture Notes on Machine Learning: Convex Sets.
B-IT, University of Bonn, 2019c.
C. Bauckhage. Lecture Notes on Machine Learning: Linear Separa-
bility. B-IT, University of Bonn, 2019d.