A MapReduce based distributed SVM algorithm for binary classification
ABSTRACT Although Support Vector Machine (SVM) algorithm has a high generalization property to classify for unseen examples after training phase and it has small loss value, the algorithm is not suitable for real-life classification and regression problems. SVMs cannot solve hundreds of thousands examples in training dataset. In previous studies on distributed machine learning algorithms, SVM is trained over a costly and preconfigured computer environment. In this research, we present a MapReduce based distributed parallel SVM training algorithm for binary classification problems. This work shows how to distribute optimization problem over cloud computing systems with MapReduce technique. In the second step of this work, we used statistical learning theory to find the predictive hypothesis that minimize our empirical risks from hypothesis spaces that created with reduce function of MapReduce. The results of this research are important for training of big datasets for SVM algorithm based classification problems. We provided that iterative training of split dataset with MapReduce technique; accuracy of the classifier function will converge to global optimal classifier function’s accuracy in finite iteration size. The algorithm performance was measured on samples from letter recognition and pen-based recognition of handwritten digits dataset.
- SourceAvailable from: cns-classes.bu.edu[Show abstract] [Hide abstract]
ABSTRACT: Machine rule induction was examined on a difficult categorization problem by applying a Holland-style classifier system to a complex letter recognition task. A set of 20,000 unique letter images was generated by randomly distorting pixel images of the 26 uppercase letters from 20 different commercial fonts. The parent fonts represented a full range of character types including script, italic, serif, and Gothic. The features of each of the 20,000 characters were summarized in terms of 16 primitive numerical attributes. Our research focused on machine induction techniques for generating IF-THEN classifiers in which the IF part was a list of values for each of the 16 attributes and the THEN part was the correct category, i.e., one of the 26 letters of the alphabet. We examined the effects of different procedures for encoding attributes, deriving new rules, and apportioning credit among the rules. Binary and Gray-code attribute encodings that required exact matches for rule activation were compared with integer representations that employed fuzzy matching for rule activation. Random and genetic methods for rule creation were compared with instance-based generalization. The strength/specificity method for credit apportionment was compared with a procedure we call “accuracy/utility.”Machine Learning 01/1991; 6:161-182. · 1.47 Impact Factor
Article: Principal component analysis[Show abstract] [Hide abstract]
ABSTRACT: Principal component analysis (PCA) is a multivariate technique that analyzes a data table in which observations are described by several inter-correlated quantitative dependent variables. Its goal is to extract the important information from the table, to represent it as a set of new orthogonal variables called principal components, and to display the pattern of similarity of the observations and of the variables as points in maps. The quality of the PCA model can be evaluated using cross-validation techniques such as the bootstrap and the jackknife. PCA can be generalized as correspondence analysis (CA) in order to handle qualitative variables and as multiple factor analysis (MFA) in order to handle heterogeneous sets of variables. Mathematically, PCA depends upon the eigen-decomposition of positive semi-definite matrices and upon the singular value decomposition (SVD) of rectangular matrices. Copyright © 2010 John Wiley & Sons, Inc.For further resources related to this article, please visit the WIREs website.Wiley Interdisciplinary Reviews: Computational Statistics. 06/2010; 2(4):433 - 459.
Article: Are loss functions all the same?[Show abstract] [Hide abstract]
ABSTRACT: In this letter, we investigate the impact of choosing different loss functions from the viewpoint of statistical learning theory. We introduce a convexity assumption, which is met by all loss functions commonly used in the literature, and study how the bound on the estimation error changes with the loss. We also derive a general result on the minimizer of the expected risk for a convex loss function in the case of classification. The main outcome of our analysis is that for classification, the hinge loss appears to be the loss of choice. Other things being equal, the hinge loss leads to a convergence rate practically indistinguishable from the logistic loss rate and much better than the square loss rate. Furthermore, if the hypothesis space is sufficiently rich, the bounds obtained for the hinge loss are not loosened by the thresholding stage.Neural Computation 06/2004; 16(5):1063-76. · 1.76 Impact Factor
A MapReduce based distributed SVM
algorithm for binary classification
Ferhat Özgür Çatak1, Mehmet Erdal Balaban2
1National Research Institute of Electronics and Cryptology, TUBITAK, Turkey,
Tel: 0-262-6481070, e-mail: firstname.lastname@example.org
2Faculty of Business Administration, Quantitative Methods, Istanbul University, Turkey,
Tel: 0-212-4400000, e-mail: email@example.com
Although Support Vector Machine (SVM) algorithm has a high generalization property to classify for
unseen examples after training phase and it has small loss value, the algorithm is not suitable for real-life
classification and regression problems. SVMs cannot solve hundreds of thousands examples in training dataset.
In previous studies on distributed machine learning algorithms, SVM is trained over a costly and preconfigured
computer environment. In this research, we present a MapReduce based distributed parallel SVM training
algorithm for binary classification problems. This work shows how to distribute optimization problem over cloud
computing systems with MapReduce technique. In the second step of this work, we used statistical learning
theory to find the predictive hypothesis that minimize our empirical risks from hypothesis spaces that created
with reduce function of MapReduce. The results of this research are important for training of big datasets for
SVM algorithm based classification problems. We provided that iterative training of split dataset with
MapReduce technique; accuracy of the classifier function will converge to global optimal classifier function’s
accuracy in finite iteration size. The algorithm performance was measured on samples from letter recognition
and pen-based recognition of handwritten digits dataset.
Key Words: Support Vector Machine, Machine Learning, Cloud Computing, MapReduce, Large Scale Dataset
Most of machine learning algorithms have problems with computational complexity of
training phase with large scale learning datasets. Applications of classification algorithms for
large scale dataset are computationally expensive to process. The computation time and
storage space of Support Vector Machine (SVM) algorithm are very largely determined by
large scale kernel matrix . Computational complexity and the computation time are always
limiting factor for machine learning in practice. In order to overcome this complexity
problem, researchers developed some techniques; feature selection, feature extraction and
Feature selection methods are used for machine learning model construction with reduced
number of features. Feature selection is a basic approach for reducing feature vector size .
A new combination of feature subset is obtained with various algorithms such as information
gain , correlation based feature selection , Gini index  and t-statistics. Feature
selection methods solve two main problems. The first solution is reducing the number of the
feature set in the training set to effectively use of computing resources like memory and CPU
and second solution is to remove noisy features from the dataset in order to improve the
classification algorithm performance .
Feature extraction methods are used to achieve the curse of dimensionality that refers to
the problems as the dimensionality increases. In this approach, high dimensional feature space
is transformed into low dimensional feature space. There are several feature extraction
algorithms such as Principal Component Analysis (PCA) , Singular Value Decomposition
(SVD) , Independent Component Analysis (ICA) .
The last solution to overcome the large amount of memory and computation power
requirements for training large scale dataset is chunking or distributed computing . Graf
et al.  proposed the cascade SVM to overcome very large scale classification problems. In
this method, dataset is split into parts in feature space. Non-support vectors of each sub
dataset are filtered and only support vectors are transmitted. The margin optimization process
uses only combined sub dataset to find out the support vectors. Collobert et al.  proposed
a new parallel SVM training and classification algorithm that each subset of a dataset is
trained with SVM and then the classifiers are combined into a final single classifier function.
Lu et al.  proposed strongly connected network based distributed support vector machine
algorithm. In this method, dataset is split into roughly equal part for each computer in a
network then, support vectors are exchanged among these computers. Ruping et al. 
proposed a novel incremental learning with SVM algorithm. Syed et al.  proposed another
incremental learning method. In this method, a fusion center collects all support vectors from
distributed computers. Caragea et al.  used previous method. In this algorithm, fusion
center iteratively sends support vectors back to computers. Sun et al.  proposed a novel
method for parallelized SVM based on MapReduce technique. This method is based on the
cascade SVM model. Their approach is based on iterative MapReduce model Twister which
is different from our implementation of Hadoop based MapReduce. Their method is same
with cascade SVM model. They use only support vectors of a sub dataset to find an optimal
classifier function. Another difference from our approach is that they apply feature selection
with correlation coefficient method for reducing number of feature in datasets before training
the SVM to improve the training time.
In our previous research , we developed a novel approach for MapReduce based SVM
training for binary classification problem. We used some UCI dataset to show generalization
property of our algorithm.
In this paper, we propose a novel approach and formal analysis of the models that
generated with the MapReduce based binary SVM training method. We distribute whole
training dataset over data nodes of cloud computing system. At each node, subset of training
dataset is used for training to find out a binary classifier function. The algorithm collects
support vectors (SVs) from every node in cloud computing system, and then merges all SVs
to save as global SVs. Our algorithm is analyzed with letter recognition  and pen-based
recognition of handwritten digits  dataset with Hadoop streaming using MrJob python
library. Our algorithm is built on the LibSVM and implemented using the Hadoop
implementation of MapReduce.
The organization of this article is as follows. In the next section, we will provide an
overview to SVM formulations. In Section 3, we present the MapReduce pattern in detail.
Section 4 explains the system model with our implementation of MapReduce pattern for the
SVM training. In section 5, convergence of our algorithm is explained. In section 6,
simulation results with letter recognition and pen-based recognition of handwritten digits
datasets are shown. Thereafter, we will give concluding remarks in Section 7.
2. Support Vector Machine
In machine learning field, SVM is a supervised learning algorithm for classification and
regression problems depending of the type of output. SVM uses statistical learning theory to
maximize generalization property of generated classifier model. SVM avoids over fitting to
the training dataset. Statistical learning theory generalizes the quality of fitting the training
data (empirical error). Empirical risk is
which is the average loss of
the chosen estimator over the training set ) . SVM use a set of training data and
predicts, for each given input, which of two possible class . As shown in Figure 1, the
hyperplane is defined by , where is a orthogonal to the hyperplane and
is the bias. Giving some training data , a set of point of the form
Figure 1 Classification of an SVM with Maximum-margin hyper plane trained with
samples from two classes.
where is a -dimensional real vector, is the class of input vector either -1 or 1.
SVMs aim to search a hyper plane that maximizes the margin between the two classes of
samples in with the smallest empirical risk . For the generalization property of SVM,
two parallel hyperplanes are defined such that and . One can
simplify these two functions into new one.
SVM aims to maximize distance between these two hyperplanes. One can calculate the
distance between these two hyperplanes with
‖ ‖ . The training of SVM for the non-separable
case is solved using quadratic optimization problem that shown in Equation 3.
‖ ‖ ∑
〈 )〉 )
for , where are slack variables and is the cost variable of each slack. is
a control parameter for the margin maximization and empirical risk minimization. The
decision function of SVM is ) ) where the and are calculated by the
optimization problem in Equation (3). By using Lagrange multipliers, the optimization
problem in Equation (3) can be expressed as
where [ ] ) ) is the Lagrangian multiplier variable. It is not needed to
know function , but it is necessary to know how to compute the modified inner product
which will be called as kernel function represented as ) ) ). Thus,
[ ] ) .
3. Map Reduce Model
MapReduce is a programming model derived from the map and reduces function
combination from functional programming. MapReduce model widely used to run parallel
applications for large scale datasets processing. MapReduce uses key/value pair data type in
map and reduce functions. . Overview of MapReduce system is show in Figure 2.
Figure 2 Overview of MapReduce System
MapReduce pattern is divided into two functions which are map and reduce. These two
functions are separated by a shuffle step of the intermediate key/value data. The MapReduce
framework executes those functions in parallel manner over any number of computers .
Simply, a MapReduce job executes three basic operations on a dataset distributed across
many shared-nothing cluster nodes. The first task is Map function that processes in parallel
manner by each node without transferring any data with other notes. In next operation,
processed data by Map function is repartitioned across all nodes of the cluster. Lastly, Reduce
task is executed in parallel manner by each node with partitioned data.
A file in the distributed file system (DFS) is split into multiple chunks and each chunk is
stored on different data nodes. The input of a map function is a key/value pair from input
chunks of dataset and it creates an output in list of key/value pairs:
A reduce function takes a key value and its value list as input. Then, reduce function
generates a list of new values as output:
( )) )
4. System Model
The cloud computing based binary class support vector machine algorithm works as
follows. The training set of the algorithm is split into subsets. Each node within a cloud
computing system classifies sub dataset locally via SVM algorithm and gets α values (i.e.
support vectors (SVs)), and then passes the calculated SVs to global SVs to merge them. In
Map stage of MapReduce job, the subset of training set is combined with global support
vectors. In Reduce step, the merged subset of training data is evaluated. The resulting new
support vectors are combined with the global support vectors in Reduce step. The algorithm
can be explained as follows. First, each node in a cloud computing system reads the global
support vectors set, then merges global SVs set with subsets of local training dataset and
classifies using SVM algorithm. Finally, all the computed SVs set in cloud nodes are merged.
Thus, algorithm saves global SVs set with new ones. Our algorithm consists of the following
steps. We showed our terminology at Table 1.
Table 1: The notation we used in our work.
Number of computers (or MapReduce function size)
Best hypothesis at iteration
Sub data set at computer
Support vectors at computer
Global support vector
1. As initialization, the global support vector set as
2. t = t + 1;
3. For any computer in reads global SVs and merge them with subset of training
4. Train SVM algorithm with merged new dataset
5. Find out support vectors
6. After all computers in cloud system complete their training phase, merge all calculated SVs
and save the result to the global SVs
7. If stop, otherwise go to step 2
Algorithm 1 Map Function of Binary SVM Algorithm
// Empty global support vector set
for do // For each subset loop
Algorithm 2 Reduce Function of Binary SVM Algorithm
// Train merged Dataset to obtain Support
// Vectors and Binary-Class Hypothesis
Pseudo code of our algorithm's Map and Reduce function are given in Algorithm 1 and
For training SVM classifier functions, we used LibSVM with various kernels. Appropriate
parameters and values were found by cross validation test. We used 10-fold cross
validation method. All system is implemented with Hadoop and streaming Python package
5. Convergence of The Algorithm with Statistical Learning Theory
Let denotes a subset of training dataset , ) is the optimal classifier function over
dataset , is the global optimal hypothesis for which has a minimal empirical risk )
over dataset , is the vector space of all possible outputs over sub dataset . Our
algorithm’s aim is to find a classifier function such that ) . Let be
hypothesis space of functions . Our algorithm starts with
generates a non-increasing sequence of positive set of vectors
vector of support vector at the .th iteration. We used hinge loss for testing our models trained
with our algorithm. Hinge loss works well for its purposes in SVM as a classifier, since the
more you violate the margin, the higher the penalty is . The hinge loss function is the
) ) )
Empirical risk can be computed with an approximation:
∑( ) ))
According to the empirical risk minimization principle the binary class learning algorithm
should choose a hypothesis ̂ in hypothesis space which minimizes the empirical risk:
A hypothesis is found in every cloud node. Let be a subset of training data at cloud node
is the vector of support vector at the th iteration, is
hypothesis at node with iteration .
Algorithm's stop point is reached when the hypothesis' empirical risk is same with
previous iteration. That is:
Lemma: Accuracy of the classifier function of our algorithm at iteration is always
greater or equal to the maximum accuracy of the classifier function at iteration . That is
Proof: Without loss of generality, iterated MapReduce binary class SVM monotonically
converges to an optimum classifier.
where n is the dataset split size (or cloud node size). Then, training set for SVM algorithm
at node is
Adding more samples cannot decrease the optimal value. Generalization accuracy of the
sub problem in each node monotonically increases in each iteration step.
6. Simulation Results
Our experimental datasets are real handwriting data. The first dataset, the pen-based
recognition of handwriting digit dataset  contains 250 samples from 44 different writers.
All input features are numerical. The classification feature of the dataset is in the range from 0
to 9. The second dataset is letter recognition dataset which contains capital letters with 20
Linear kernels were used with optimal parameters ( ). Parameters were estimated by
cross-validation method. In our experiments, datasets are randomly partitioned into 10 sub
dataset approximately equal-size parts. We ensured that all sub datasets are balanced and
classes are uniformly distributed. We fit the classifier function with 90% of original dataset
and then using this classifier function we predict the class of 10% remaining test dataset. The
cross-validation process is repeated 10 times, with each part is used once as the test samples.
We sum the errors on all 10 parts together to calculate the overall error.
6.1. Computation Time Comparison Between SVM and MapReduce Based SVM
In our experiments, we compared the single node SVM training algorithm with
MapReduce based SVM training algorithm. We used the single node training model as the
baseline to find the speedup. Calculation of the speedup is computation time with MapReduce
divided by the single node training model computation time. We showed the different node
size computation results in Table 2 and Table 3.
Table 2: Letter recognition dataset SVM training speedup using MapReduce with
different node size.
Num. of MapReduce Job Speedup
Table 3: The pen-based recognition of handwriting digit dataset SVM training speedup
using MapReduce with different node size.
Num. of MapReduce Job Speedup
The speedups in both data sets are from 6x to 7x. The speedup shown in Table 1 and
Table 2 is the average of fifty runs.
6.2. Results with MapReduce Based SVM
Figure 3 shows the average accuracy of the test error for each dataset. The figure shows
the improvement in MapReduce based SVM at each iteration and stability on large datasets.
Figure 4 shows the average number of SVs for each dataset. The figure shows the stability of
the number of SVs with MapReduce based SVM at each iteration.
Figure 3 Hinge loss values over iterations with two datasets.
Figure 4 Support vector sizes over iterations with two datasets.
To analyze our algorithm, we randomly distributed all the training data to a cloud
computing system with 10 computers with pseudo distributed Hadoop. We developed python
script for distributed support vector machine algorithm with scikit, scipy, numpy, mrjob,
matplotlib and libsvm. Dataset prediction accuracies with iterations are shown in Table 4 and
Table 4: Average, max. and min. value of hinge loss for the pen-based recognition of
handwriting digit dataset with 10 fold cross validation.
Table 5: Average, max. and min. value of hinge loss for the letter recognition dataset with
10 fold cross validation.
Iter. No Loss( )
Total numbers of SVs are shown in Table 6. When iteration size becomes 5, test accuracy
values of all datasets reach to the highest values. That’s the smallest value of the hinge loss of
empirical error. If the iteration size is increased, the value of test accuracy falls into a steady
state. The value of test accuracy is not changed for large enough number of iteration size.
Table 6: Average support vectors size for pen-based recognition of handwriting digit and
letter recognition dataset with 10 fold cross validation.
Iter. No Pen digit.