ResearchPDF Available

Distributed Machine Learning: A Review of current progress



This paper was part of a Coursework @ Leeds University ABSTRACT The need for solving Machine Learning problems at scale using the power of distributed computing is evident due to the increasing amount of daily untapped data where lots of insights can be discovered, many solutions has emerged in the last 5 years to help tackling this problem, this paper is a review of the current status and future direction of distributed machine learning followed by a performance comparison of two popular products in this field Mahout and Spark+MLlib
Distributed Machine Learning
A Review of current progress
Karim Ouda
School of Computing, University of Leeds , LS2 9JT, United Kingdom
Abstract—The need for solving Machine Learning problems
at scale using the power of distributed computing is evident due
to the increasing amount of daily untapped data where lots of
insights can be discovered, many solutions has emerged in the
last 5 years to help tackling this problem, this paper is a review of
the current status and future direction of distributed machine
learning followed by a performance comparison of two popular
products in this field Mahout and Spark+MLlib
Machine Learning is one of the old key research and
application fields in Computer Science that is rapidly becoming
part of our daily life, think of song and movie recommendation
system, cell phone and web personalization, computer vision
and CCTV applications and so forth, one of the main drivers of
the current boom in Machine Learning demand is the huge
amount of data produced after the web 2.0 era, Facebook alone
used to process 500TB of new data per day in 2012 [1] it is
also estimated that the Digital Universe will reach 44
zettabytes in 2020 which is 50-fold growth since 2010 [2]
having such pile of untapped data at hand, companies will then
need to make use of it by finding patterns and insights which in
turn can lead to business performance improvement and more
user understanding and this is where Machine Learning comes
for rescue
Machine Learning (ML) is a problem solving technique in
which the Machine (Computer) learn from data automatically
mimicking the way the brain works, the process starts by
extracting Features (a unique machine friendly representation
for each data item) then running one of the several ML
algorithms to produce a learned Model (a representation of the
collective patterns and rules in the data) which can then be
used to predict, classify and perform functionalities on new
unseen data
A real life practical application for Machine Learning in
BigData is Recommendation Systems for Songs, Videos and
Movies, companies like NetFlix can earn more money by
recommending the right film for the right person based on his
previous history and the collective patterns of other users,
another application is Display Advertising targeting, companies
like Google can gain more revenues if the Ad system is
targeting users of interest showing the right Ad to the right
person which is also based on patterns to be learned from huge
amount of data held by a company like Google
Trying to apply Machine Learning on Big Data, companies
faced many difficulties, ML algorithms and tools are
processing and memory intensive and were not designed to
handle such amount of data efficiently, and that makes sense
because such rapid growth of data happened only in the last 10
years, on the other hand following Moore's law the processing
power is doubling every 18 months with more cores per
system, also the existence of the cloud computing concept
where commodity distributed machines are used instead big
super computers, finally the existence of distribute software
frameworks and parallel programming models like MapReduce
inspired academics and big companies to find a solution to
parallelize Machine Learning tasks on multicore systems
The next sections of this paper will discuss approaches and
solutions to the problem explained above, the common
architecture will be explained and also the result of
performance comparison between Hadoop/Spark will be
demonstrated as well as the result of a classification task - PoC
which was run on Spark+MLlib on a could platform, finally I
will be discussing the future trends and direction.
I believe the first effective endeavor to tackle the lack of
distributed framework for machine learning problem was a
paper from Stanford in 2007 co-authored by Andrew Ng titled
Map-reduce for machine learning on multicore [3] which
was later implemented by Apache community and now known
as Mahout project [4]
In that paper the authors showed that if any algorithm can fit in
a specific “Statistical Query Model can then be written in a
“summation form” that can be fed to a map-reduce like
framework for processing, they claimed that for the class of
algorithms that depends on statistics or gradients can fit into
the model and that such calculations can be spread across cores
and aggregated at the end of processing, they experimented on
10 algorithms and showed linear speedups proportional to the
number of cores
Following is the review of the current solutions starting with
A. Mahout
A scalable machine learning library on the top of Hadoop
[5] which is an open source framework for distributed
computing, Mahout started originally as part of Apache Lucene
project in 2008 with the goal of implementing the paper
mentioned above, the first version with the name Mahout was
released in April 2010 [6] the library supports many ML
algorithms [7] such as
Naive Bayes
Hidden Markov Models
Logistic Regression
Random Forest
Fuzzy k-Means
Streaming KMeans
Spectral Clustering
B. MLlib (MLBase)
MLbase is a product of a paper titled MLbase: A
Distributed Machine-learning System.” by Tim Kraska et al. In
2013 [8] the aim of the paper was to create an optimized
scalable ML framework for users (ML Researchers) with basic
background in distributed systems, in addition to that they
provided a declarative way to specify ML tasks with an
optimizer that can auto-tune and dynamically choose
As we can see in Illustration 1, MLBase aimed to provide the
user with high-level ML language which abstracts and keep the
user away from thinking of other non-ML issues such as
handling distributed computing or even which algorithm to
choose, in this example the user specifies the training data (X)
and labels (y) and the system will try different algorithms and
train on the best algorithm and return the final model (fn-
model) & summary information for the user
Currently MLBase is part of Apache Spark [9] since September
2013 [10] with the name MLlib, Apache Spark is a fast,
scalable and general engine for data processing, general in this
context means it supports many data processing models such as
Graphs, Streaming, Machine Learning and SQL
Unfortunately the current version (1.3.0) shipped with Spark
does not include the ML language and algorithm optimization
parts, only the support for different ML Algorithms listed
below [11]
Linear SVM and logistic regression
Classification and regression tree
K-means clustering
Recommendation via alternating least squares
Singular value decomposition
Linear regression with L1- and L2-regularization
Multinomial naive Bayes
Basic statistics
Feature transformations
C. GraphLab
GraphLab [12] is a python library for graph processing as
well as distributed machine learning, it provides integration
with Spark/Hadoop and the ability for parameter auto tuning,
the product originally initiated as a research work in CMU in
2012 [13]
D. SystemML
A System proposed by researchers at IBM to provide a
Declarative Machine Learning (DML) language that can be
transformed into optimized MapReduce scripts to be executed
on distributed environment [14]
E. Parameter Server
An opensource project [15] by researchers from CMU,
Google and Baidu [16] which aim to provide a flexible scalable
distributed machine learning framework with all nodes sharing
global parameters
F. Sibyl
Google's distributed machine learning system [17] utilizing
MapReduce and Google GFS and other proprietary
G. Other Systems
In addition to the ones listed above, there are many other
relevant systems such as YahooLDA [18] for topic modeling,
Petnum [19], DistBelief [20] a framework for distributed Deep
Illustration 1: MLBase Highlevel ML Language example [8]
Learning, Google Predict [21] a commercial PaaS service for
ML tasks
Since most Distributed Machine Learning solutions are
built on the top of MapReduce or MapReduce-like systems, I
will start by showing the general map-reduce concept and
A. Apache Hadoop
As shown in illustration #2 above, a typical distributed
processing system require a Distributed Filesystem which
store the data and replicate it in a distributed fashion so that it
can be processed near the core and also to handle failures
The NameNode holds a centralized index of the distributed
filesystem, while the DataNode store and provide access to
the real files
C. MapReduce
The MapReduce processing engine uses the DFS to process
data in an efficient and reliable manner, the JobTracker
receives MapReduce jobs and distribute tasks (ex: Map task for
data subset) to TaskTrackers according to the location of the
data in the cluster
MapReduce is based on the fact that many computations
can be reduced to simple mapping and/or arithmetic operations
thus can be split across different machines and aggregated
afterwards, this model can also be applied for Machine learning
B. MapReduce & Machine Learning
An application example for MapReduce in Machine
Learning is k-means algorithm execution, where list of points
can be split into partitions, each partition is sent to a different
machine for processing, the distance between each point in the
partition and the centroids is calculated and then assigned to
one of the centroids in a map operation, once all partitions are
processed and each point is assigned to a centroid, a reduce
function can be applied to calculate the new position of each
centroid, this also be done in a separate machine for each
centroid as shown in Illustration #3 below
C. Apache Spark
Spark architecture [24] is simple and flexible, each
application (Job in Hadoop terminology) has its own executor
process and it can also run on the top of Hadoop or in its own
cluster by sending data and initiating new executors on slave
nodes (Worker nodes) as shown in Illustration #4
A. Cluster Setup
Two virtual machines were used to setup a simple cluster
for the experiment provisioned by OpenNebula cloud manager,
both machines are Ubuntu 12.04.1 (server edition) x86_64 with
kernel version 3.2.0-69-generic, the Master machine has 6GB
of memory, 4 cores and 12 VCPUs, the Slave machine contains
4GB of memory, 4 cores and 10 VCPUs
Illustration 2: Hadoop Architecture [22]
Illustration 4: Spark Architecture [34]
Illustration 3: MapReduce Example for Machine Learning -
KMeans [23]
Software versions:
Apach Saprk 1.3.0 [27]
Mahout 0.9 [28]
Hadoop 2.6.0 [29]
B. DataSets
Dataset #1 is a corpus of labeled tweets used for sentiment
analysis by Sentiment140 [25] there are 2 files in the dataset
“training.1600000.processed.noemoticon.csv” is the file used
for testing the map-reduce efficiency for word counting, it is
288 MB in size and contains 1,600,000 rows
Dataset #2 is a Bag-Of-words set provided by University of
California, Irvine [26] the “NYTimes news articles” were
chosen for the machine learning clustering experiment, the file
docword.nytimes.txt was reduced to 680 MB and 50,000,000
records before being used for clustering
C. Simple MapReduce comparison between Hadoop/Spark
To test the efficiency of both systems, Dataset #1 was
passed to Hadoop and Spark clusters to count all words in all
tweets and sort them in descending order, for hadoop a
modified version of [30] example was used,
the map function was changed to parse only the last column in
the dataset which includes the tweet text, also an additional
map function and a new job were added to sort the final counts
by value instead of by key, for Spark experiment a modified
version [31] example was used and
additional mapping/sorting was added and an additional line of
code to save the output to the filesystem
D. Spark + MLLib Distributed Machine Learning Experiment
To demonstrate the concept of Distributed Machine
Learning, an experiment was conducted to cluster a big data
file (Dataset #2) on the Spark/MLlib cluster, a modified version
of [32] was used, some changes were done
so that the file works on the new MLlib version and an
additional prediction function was added to predict a new point
based on the learned model
The following command/settings were used to run the
experiment, number of clusters k = 2 and max_iterations = 10
~/spark-1.3.0-bin-hadoop2.4/bin/spark-submit --master
spark://hadoopmaster:7077 --driver-memory 5G --class
MyJavaKMeansNYDS my-mllib-kmeans-nyds.jar ~/spark-
1.3.0-bin-hadoop2.4/data/mllib/docword.nytimes.splitaa2 2 10
In all experiments Spark showed better performance
compared to the Hadoop counterpart, below is the final results
for each experiment
A. Simple MapReduce comparison between Hadoop/Spark
Hadoop/Spark MapReduce performance
Hadoop Spark
Time 66 seconds 20 seconds
B. Spark + MLLib Distributed Machine Learning Experiment
Spark/MLlib Clustering Performance
Clustering time 10 minutes
As the amount of data continues to increase over time,
current solutions are moving towards more optimized code,
better architectures, auto-tuning, improved scheduling and
distributed communications, and better compression
techniques, also some researchers are working on introducing
high-level abstract languages [14][33] and constructs for
Machine Learning to make machine learning application more
simple for the end users
Machine learning is the present and the future of problem
solving in computing, along with the increasing trends in Data
and Processing power Distributed Machine learning solutions
are evolving to cater for the needs and challenges in both
scientific and business worlds, in this paper I discussed the
concept of Distributed Machine Learning, the problem it
solves, reviewed the current solutions and showed the results
of the practical experiments conducted and a glimpse on the
future direction
[1] CNet, “Facebook processes more than 500 TB of data daily”
[2] IMC, New Digital Universe Study Reveals Big Data Gap
[3] Chu, Cheng, et al. "Map-reduce for machine learning on multicore."
Advances in neural information processing systems 19 (2007): 281.
[4] Apache Mahout: Scalable machine learning and data mining
https:// mahout
[5] What Is Apache Hadoop?
[6] Where can I find the origins of the Mahout project?
[7] Mahout Algorithms
[8] Kraska, Tim, et al. "MLbase: A Distributed Machine-learning System."
CIDR. 2013.
[9] Apache Spark,
[10] Ameet Talwalkar, “MLlib: Spark’s Machine Learning Library”,
Presentation https://databricks-
[11] Apache Spark MLLib,
[12] GraphLab
[13] Low, Yucheng, et al. "Distributed GraphLab: a framework for machine
learning and data mining in the cloud." Proceedings of the VLDB
Endowment 5.8 (2012): 716-727.
[14] Ghoting, Amol, et al. "SystemML: Declarative machine learning on
MapReduce." Data Engineering (ICDE), 2011 IEEE 27th International
Conference on. IEEE, 2011.
[15] Parameter Server
[16] Li, Mu, et al. "Scaling distributed machine learning with the parameter
server." Operating Systems Design and Implementation (OSDI). 2014.
[17] Chandra, Tushar, et al. "Sibyl: a system for large scale machine
learning." Keynote I PowerPoint presentation, Jul 28 (2010).
[18] YahooLDA
[19] Petnum
[20] Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances
in Neural Information Processing Systems. 2012.
[21] Google Predict
[22] Hadoop
[23] Designing algorithms for Map Reduce
[24] Hadoop multi node cluster setup
[25] Tweets Dataset
[26] Bag of Words Dataset
[27] Spark Download
[28] Mahout Download
[29] Hadoop Download
[30] Hadoop WordCount example
[31] Spark JavaWordCount example file
[32] MLLib JavaKMeans example file
[33] Sparks, Evan R., et al. "MLI: An API for distributed machine learning."
Data Mining (ICDM), 2013 IEEE 13th International Conference on.
IEEE, 2013.
[34] Spark Architecture
Conference Paper
Full-text available
Today, computing environment provides the possibility of carrying out various data-intensive natural language processing tasks. Language tokenization methods applied for multi-class text classification are recently investigated by many data scientists. The authors of this paper investigate Logistic Regression method by evaluating classification accuracy which correlates on the size of the training data, POS and number of n-grams. Logistic Regression method is implemented in Apache Spark, the in-memory intensive computing platform. Experimental results have shown that applied multi-class classification method for Amazon product-review data using POS features has higher classification accuracy.
Full-text available
Recent work in unsupervised feature learning and deep learning has shown that be-ing able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. Downpour SGD and Sandblaster L-BFGS both increase the scale and speed of deep network train-ing. We have successfully used our system to train a deep network 30x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k cate-gories. We show that these same techniques dramatically accelerate the training of a more modestly-sized deep network for a commercial speech recognition ser-vice. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learning algorithm.
Full-text available
Machine learning (ML) and statistical techniques are key to transforming big data into actionable knowledge. In spite of the modern primacy of data, the complexity of existing ML algorithms is often overwhelming—many users do not understand the trade-offs and challenges of parameterizing and choosing between different learning techniques. Fur-thermore, existing scalable systems that support machine learning are typically not accessible to ML researchers with-out a strong background in distributed systems and low-level primitives. In this work, we present our vision for MLbase, a novel system harnessing the power of machine learning for both end-users and ML researchers. MLbase provides (1) a simple declarative way to specify ML tasks, (2) a novel opti-mizer to select and dynamically adapt the choice of learning algorithm, (3) a set of high-level operators to enable ML re-searchers to scalably implement a wide range of ML methods without deep systems knowledge, and (4) a new run-time optimized for the data-access patterns of these high-level operators.
Conference Paper
Full-text available
MapReduce is emerging as a generic parallel programming paradigm for large clusters of machines. This trend combined with the growing need to run machine learning (ML) algorithms on massive datasets has led to an increased interest in implementing ML algorithms on MapReduce. However, the cost of implementing a large class of ML algorithms as low-level MapReduce jobs on varying data and machine cluster sizes can be prohibitive. In this paper, we propose SystemML in which ML algorithms are expressed in a higher-level language and are compiled and executed in a MapReduce environment. This higher-level language exposes several constructs including linear algebra primitives that constitute key building blocks for a broad class of supervised and unsupervised ML algorithms. The algorithms expressed in SystemML are compiled and optimized into a set of MapReduce jobs that can run on a cluster of machines. We describe and empirically evaluate a number of optimization strategies for efficiently executing these algorithms on Hadoop, an open-source MapReduce implementation. We report an extensive performance evaluation on three ML algorithms on varying data and cluster sizes.
Conference Paper
Big data may contain big values, but also brings lots of challenges to the computing theory, architecture, framework, knowledge discovery algorithms, and domain specific tools and applications. Beyond the 4-V or 5-V characters of big datasets, the data processing shows the features like inexact, incremental, and inductive manner. This brings new research opportunities to research community across theory, systems, algorithms, and applications. Is there some new "theory" for the big data? How to handle the data computing algorithms in an operatable manner? This report shares some view on new challenges identified, and covers some of the application scenarios such as micro-blog data analysis and data processing in building next generation search engines.
Scalable machine learning and data mining https:// mahout [5] What Is Apache Hadoop
  • Apache Mahout
Apache Mahout: Scalable machine learning and data mining https:// [5] What Is Apache Hadoop?
Hadoop [23] Designing algorithms for Map Reduce http
  • Google Predict
Google Predict [22] Hadoop [23] Designing algorithms for Map Reduce reduce.html
Sibyl: a system for large scale machine learning Keynote I PowerPoint presentation [18] YahooLDA httpsLarge scale distributed deep networks
  • Tushar Chandra
Chandra, Tushar, et al. "Sibyl: a system for large scale machine learning." Keynote I PowerPoint presentation, Jul 28 (2010). [18] YahooLDA [19] Petnum [20] Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances in Neural Information Processing Systems. 2012.
Facebook processes more than 500 TB of data daily
  • Cnet
CNet, "Facebook processes more than 500 TB of data daily"
Spark JavaWordCount example file /apache/spark/examples/ [32] MLLib JavaKMeans example file An API for distributed machine learning
  • R Evan
Hadoop WordCount example [31] Spark JavaWordCount example file /apache/spark/examples/ [32] MLLib JavaKMeans example file /apache/spark/examples/mllib/ [33] Sparks, Evan R., et al. "MLI: An API for distributed machine learning." Data Mining (ICDM), 2013 IEEE 13th International Conference on. IEEE, 2013. arnumber=6729619