Page 1
A MapReduce based distributed SVM
algorithm for binary classification
Ferhat Özgür Çatak1, Mehmet Erdal Balaban2
1National Research Institute of Electronics and Cryptology, TUBITAK, Turkey,
Tel: 0-262-6481070, e-mail: ozgur.catak@tubitak.gov.tr
2Faculty of Business Administration, Quantitative Methods, Istanbul University, Turkey,
Tel: 0-212-4400000, e-mail: balaban@gmail.com
Abstract
Although Support Vector Machine (SVM) algorithm has a high generalization property to classify for
unseen examples after training phase and it has small loss value, the algorithm is not suitable for real-life
classification and regression problems. SVMs cannot solve hundreds of thousands examples in training dataset.
In previous studies on distributed machine learning algorithms, SVM is trained over a costly and preconfigured
computer environment. In this research, we present a MapReduce based distributed parallel SVM training
algorithm for binary classification problems. This work shows how to distribute optimization problem over cloud
computing systems with MapReduce technique. In the second step of this work, we used statistical learning
theory to find the predictive hypothesis that minimize our empirical risks from hypothesis spaces that created
with reduce function of MapReduce. The results of this research are important for training of big datasets for
SVM algorithm based classification problems. We provided that iterative training of split dataset with
MapReduce technique; accuracy of the classifier function will converge to global optimal classifier function’s
accuracy in finite iteration size. The algorithm performance was measured on samples from letter recognition
and pen-based recognition of handwritten digits dataset.
Key Words: Support Vector Machine, Machine Learning, Cloud Computing, MapReduce, Large Scale Dataset
1. Introduction
Most of machine learning algorithms have problems with computational complexity of
training phase with large scale learning datasets. Applications of classification algorithms for
large scale dataset are computationally expensive to process. The computation time and
storage space of Support Vector Machine (SVM) algorithm are very largely determined by
large scale kernel matrix [1]. Computational complexity and the computation time are always
Page 2
2
limiting factor for machine learning in practice. In order to overcome this complexity
problem, researchers developed some techniques; feature selection, feature extraction and
distributed computing.
Feature selection methods are used for machine learning model construction with reduced
number of features. Feature selection is a basic approach for reducing feature vector size [2].
A new combination of feature subset is obtained with various algorithms such as information
gain [3], correlation based feature selection [4], Gini index [5] and t-statistics. Feature
selection methods solve two main problems. The first solution is reducing the number of the
feature set in the training set to effectively use of computing resources like memory and CPU
and second solution is to remove noisy features from the dataset in order to improve the
classification algorithm performance [6].
Feature extraction methods are used to achieve the curse of dimensionality that refers to
the problems as the dimensionality increases. In this approach, high dimensional feature space
is transformed into low dimensional feature space. There are several feature extraction
algorithms such as Principal Component Analysis (PCA) [7], Singular Value Decomposition
(SVD) [8], Independent Component Analysis (ICA) [9].
The last solution to overcome the large amount of memory and computation power
requirements for training large scale dataset is chunking or distributed computing [10]. Graf
et al. [11] proposed the cascade SVM to overcome very large scale classification problems. In
this method, dataset is split into parts in feature space. Non-support vectors of each sub
dataset are filtered and only support vectors are transmitted. The margin optimization process
uses only combined sub dataset to find out the support vectors. Collobert et al. [12] proposed
a new parallel SVM training and classification algorithm that each subset of a dataset is
trained with SVM and then the classifiers are combined into a final single classifier function.
Lu et al. [13] proposed strongly connected network based distributed support vector machine
Page 3
3
algorithm. In this method, dataset is split into roughly equal part for each computer in a
network then, support vectors are exchanged among these computers. Ruping et al. [14]
proposed a novel incremental learning with SVM algorithm. Syed et al. [15] proposed another
incremental learning method. In this method, a fusion center collects all support vectors from
distributed computers. Caragea et al. [16] used previous method. In this algorithm, fusion
center iteratively sends support vectors back to computers. Sun et al. [17] proposed a novel
method for parallelized SVM based on MapReduce technique. This method is based on the
cascade SVM model. Their approach is based on iterative MapReduce model Twister which
is different from our implementation of Hadoop based MapReduce. Their method is same
with cascade SVM model. They use only support vectors of a sub dataset to find an optimal
classifier function. Another difference from our approach is that they apply feature selection
with correlation coefficient method for reducing number of feature in datasets before training
the SVM to improve the training time.
In our previous research [18], we developed a novel approach for MapReduce based SVM
training for binary classification problem. We used some UCI dataset to show generalization
property of our algorithm.
In this paper, we propose a novel approach and formal analysis of the models that
generated with the MapReduce based binary SVM training method. We distribute whole
training dataset over data nodes of cloud computing system. At each node, subset of training
dataset is used for training to find out a binary classifier function. The algorithm collects
support vectors (SVs) from every node in cloud computing system, and then merges all SVs
to save as global SVs. Our algorithm is analyzed with letter recognition [19] and pen-based
recognition of handwritten digits [20] dataset with Hadoop streaming using MrJob python
library. Our algorithm is built on the LibSVM and implemented using the Hadoop
implementation of MapReduce.
Page 4
4
The organization of this article is as follows. In the next section, we will provide an
overview to SVM formulations. In Section 3, we present the MapReduce pattern in detail.
Section 4 explains the system model with our implementation of MapReduce pattern for the
SVM training. In section 5, convergence of our algorithm is explained. In section 6,
simulation results with letter recognition and pen-based recognition of handwritten digits
datasets are shown. Thereafter, we will give concluding remarks in Section 7.
2. Support Vector Machine
In machine learning field, SVM is a supervised learning algorithm for classification and
regression problems depending of the type of output. SVM uses statistical learning theory to
maximize generalization property of generated classifier model. SVM avoids over fitting to
the training dataset. Statistical learning theory generalizes the quality of fitting the training
data (empirical error). Empirical risk is
∑
) )
which is the average loss of
the chosen estimator over the training set ) [21]. SVM use a set of training data and
predicts, for each given input, which of two possible class . As shown in Figure 1, the
hyperplane is defined by , where is a orthogonal to the hyperplane and
is the bias. Giving some training data , a set of point of the form
)
(1)
Page 5
5
Figure 1 Classification of an SVM with Maximum-margin hyper plane trained with
samples from two classes.
where is a -dimensional real vector, is the class of input vector either -1 or 1.
SVMs aim to search a hyper plane that maximizes the margin between the two classes of
samples in with the smallest empirical risk [22]. For the generalization property of SVM,
two parallel hyperplanes are defined such that and . One can
simplify these two functions into new one.
)
(2)
SVM aims to maximize distance between these two hyperplanes. One can calculate the
distance between these two hyperplanes with
‖ ‖ . The training of SVM for the non-separable
case is solved using quadratic optimization problem that shown in Equation 3.
Page 6
6
)
‖ ‖ ∑
(3)
〈 )〉 )
for , where are slack variables and is the cost variable of each slack. is
a control parameter for the margin maximization and empirical risk minimization. The
decision function of SVM is ) ) where the and are calculated by the
optimization problem in Equation (3). By using Lagrange multipliers, the optimization
problem in Equation (3) can be expressed as
)
)
where [ ] ) ) is the Lagrangian multiplier variable. It is not needed to
know function , but it is necessary to know how to compute the modified inner product
which will be called as kernel function represented as ) ) ). Thus,
[ ] ) [23].
3. Map Reduce Model
MapReduce is a programming model derived from the map and reduces function
combination from functional programming. MapReduce model widely used to run parallel
applications for large scale datasets processing. MapReduce uses key/value pair data type in
map and reduce functions. [24]. Overview of MapReduce system is show in Figure 2.
Page 7
7
Figure 2 Overview of MapReduce System
MapReduce pattern is divided into two functions which are map and reduce. These two
functions are separated by a shuffle step of the intermediate key/value data. The MapReduce
framework executes those functions in parallel manner over any number of computers [25].
Simply, a MapReduce job executes three basic operations on a dataset distributed across
many shared-nothing cluster nodes. The first task is Map function that processes in parallel
manner by each node without transferring any data with other notes. In next operation,
processed data by Map function is repartitioned across all nodes of the cluster. Lastly, Reduce
task is executed in parallel manner by each node with partitioned data.
A file in the distributed file system (DFS) is split into multiple chunks and each chunk is
stored on different data nodes. The input of a map function is a key/value pair from input
chunks of dataset and it creates an output in list of key/value pairs:
) )
A reduce function takes a key value and its value list as input. Then, reduce function
generates a list of new values as output:
( )) )
Page 8
8
4. System Model
The cloud computing based binary class support vector machine algorithm works as
follows. The training set of the algorithm is split into subsets. Each node within a cloud
computing system classifies sub dataset locally via SVM algorithm and gets α values (i.e.
support vectors (SVs)), and then passes the calculated SVs to global SVs to merge them. In
Map stage of MapReduce job, the subset of training set is combined with global support
vectors. In Reduce step, the merged subset of training data is evaluated. The resulting new
support vectors are combined with the global support vectors in Reduce step. The algorithm
can be explained as follows. First, each node in a cloud computing system reads the global
support vectors set, then merges global SVs set with subsets of local training dataset and
classifies using SVM algorithm. Finally, all the computed SVs set in cloud nodes are merged.
Thus, algorithm saves global SVs set with new ones. Our algorithm consists of the following
steps. We showed our terminology at Table 1.
Table 1: The notation we used in our work.
Notation Description
Iteration number
Number of computers (or MapReduce function size)
Best hypothesis at iteration
Sub data set at computer
Support vectors at computer
Global support vector
Page 9
9
1. As initialization, the global support vector set as
2. t = t + 1;
3. For any computer in reads global SVs and merge them with subset of training
data.
4. Train SVM algorithm with merged new dataset
5. Find out support vectors
6. After all computers in cloud system complete their training phase, merge all calculated SVs
and save the result to the global SVs
7. If stop, otherwise go to step 2
Algorithm 1 Map Function of Binary SVM Algorithm
// Empty global support vector set
while
for do // For each subset loop
end for
end while
Algorithm 2 Reduce Function of Binary SVM Algorithm
while do
for
// Train merged Dataset to obtain Support
// Vectors and Binary-Class Hypothesis
)
end for
for
end for
end while
Pseudo code of our algorithm's Map and Reduce function are given in Algorithm 1 and
Algorithm 2.
For training SVM classifier functions, we used LibSVM with various kernels. Appropriate
parameters and values were found by cross validation test. We used 10-fold cross
validation method. All system is implemented with Hadoop and streaming Python package
mrjob library.
Page 10
10
5. Convergence of The Algorithm with Statistical Learning Theory
Let denotes a subset of training dataset , ) is the optimal classifier function over
dataset , is the global optimal hypothesis for which has a minimal empirical risk )
over dataset , is the vector space of all possible outputs over sub dataset . Our
algorithm’s aim is to find a classifier function such that ) . Let be
hypothesis space of functions . Our algorithm starts with
, and
generates a non-increasing sequence of positive set of vectors
, where
is the
vector of support vector at the .th iteration. We used hinge loss for testing our models trained
with our algorithm. Hinge loss works well for its purposes in SVM as a classifier, since the
more you violate the margin, the higher the penalty is [26]. The hinge loss function is the
following:
) ) )
(5)
Empirical risk can be computed with an approximation:
)
∑( ) ))
(6)
According to the empirical risk minimization principle the binary class learning algorithm
should choose a hypothesis ̂ in hypothesis space which minimizes the empirical risk:
̂
)
(7)
A hypothesis is found in every cloud node. Let be a subset of training data at cloud node
where ,
is the vector of support vector at the th iteration, is
hypothesis at node with iteration .
Algorithm's stop point is reached when the hypothesis' empirical risk is same with
previous iteration. That is:
Page 11
11
) )
(11)
Lemma: Accuracy of the classifier function of our algorithm at iteration is always
greater or equal to the maximum accuracy of the classifier function at iteration . That is
)
)
(12)
Proof: Without loss of generality, iterated MapReduce binary class SVM monotonically
converges to an optimum classifier.
(13)
where n is the dataset split size (or cloud node size). Then, training set for SVM algorithm
at node is
(14)
Adding more samples cannot decrease the optimal value. Generalization accuracy of the
sub problem in each node monotonically increases in each iteration step.
6. Simulation Results
Our experimental datasets are real handwriting data. The first dataset, the pen-based
recognition of handwriting digit dataset [20] contains 250 samples from 44 different writers.
All input features are numerical. The classification feature of the dataset is in the range from 0
to 9. The second dataset is letter recognition dataset which contains capital letters with 20
different fonts.
Linear kernels were used with optimal parameters ( ). Parameters were estimated by
cross-validation method. In our experiments, datasets are randomly partitioned into 10 sub
dataset approximately equal-size parts. We ensured that all sub datasets are balanced and
classes are uniformly distributed. We fit the classifier function with 90% of original dataset
Page 12
12
and then using this classifier function we predict the class of 10% remaining test dataset. The
cross-validation process is repeated 10 times, with each part is used once as the test samples.
We sum the errors on all 10 parts together to calculate the overall error.
6.1. Computation Time Comparison Between SVM and MapReduce Based SVM
In our experiments, we compared the single node SVM training algorithm with
MapReduce based SVM training algorithm. We used the single node training model as the
baseline to find the speedup. Calculation of the speedup is computation time with MapReduce
divided by the single node training model computation time. We showed the different node
size computation results in Table 2 and Table 3.
Table 2: Letter recognition dataset SVM training speedup using MapReduce with
different node size.
Num. of MapReduce Job Speedup
1 1.00
2 3.39
4 4.45
6 4.76
8 5.97
10 6.42
Page 13
13
Table 3: The pen-based recognition of handwriting digit dataset SVM training speedup
using MapReduce with different node size.
Num. of MapReduce Job Speedup
1 1.00
2 2.72
4 4.39
6 4.56
8 6.46
10 7.78
The speedups in both data sets are from 6x to 7x. The speedup shown in Table 1 and
Table 2 is the average of fifty runs.
6.2. Results with MapReduce Based SVM
Figure 3 shows the average accuracy of the test error for each dataset. The figure shows
the improvement in MapReduce based SVM at each iteration and stability on large datasets.
Figure 4 shows the average number of SVs for each dataset. The figure shows the stability of
the number of SVs with MapReduce based SVM at each iteration.
Figure 3 Hinge loss values over iterations with two datasets.
Page 14
14
Figure 4 Support vector sizes over iterations with two datasets.
To analyze our algorithm, we randomly distributed all the training data to a cloud
computing system with 10 computers with pseudo distributed Hadoop. We developed python
script for distributed support vector machine algorithm with scikit, scipy, numpy, mrjob,
matplotlib and libsvm. Dataset prediction accuracies with iterations are shown in Table 4 and
Table 5.
Table 4: Average, max. and min. value of hinge loss for the pen-based recognition of
handwriting digit dataset with 10 fold cross validation.
Iter. No
Loss( )
0.02550
0.00961
0.00801
0.00694
0.00681
0.00654
0.00654
0.00641
0.00641
0.00641
Loss( )
0.03605
0.01602
0.01335
0.01335
0.01335
0.01335
0.01335
0.01335
0.01335
0.01335
Loss( )
0.01736
0.00401
0.00267
0.00134
0.00134
0.00134
0.00134
0.00134
0.00134
0.00134
1
2
3
4
5
6
7
8
9
10
Page 15
15
Table 5: Average, max. and min. value of hinge loss for the letter recognition dataset with
10 fold cross validation.
Iter. No Loss( )
0.00925
0.00045
0.00005
0.00005
0.00005
0.00005
0.00005
0.00005
0.00005
0.00005
Loss( )
0.01201
0.00150
0.00050
0.00050
0.00050
0.00050
0.00050
0.00050
0.00050
0.00050
Loss( )
0.00600
0.00000
0.00000
0.00000
0.00000
0.00000
0.00000
0.00000
0.00000
0.00000
1
2
3
4
5
6
7
8
9
10
Total numbers of SVs are shown in Table 6. When iteration size becomes 5, test accuracy
values of all datasets reach to the highest values. That’s the smallest value of the hinge loss of
empirical error. If the iteration size is increased, the value of test accuracy falls into a steady
state. The value of test accuracy is not changed for large enough number of iteration size.
Table 6: Average support vectors size for pen-based recognition of handwriting digit and
letter recognition dataset with 10 fold cross validation.
Iter. No Pen digit.
1068.7
2147.6
2837.7
2981.1
3003.8
2995.8
2996.7
2996.5
2997.5
3001.0
Letter recognition
1
2
3
4
5
6
7
8
9
186.9
314.9
418.2
487.6
520.4
541.0
550.1
553.8
556.9
558.2 10
Download full-text