Content uploaded by Sauptik Dhar

Author content

All content in this area was uploaded by Sauptik Dhar on Nov 11, 2016

Content may be subject to copyright.

Content uploaded by Sauptik Dhar

Author content

All content in this area was uploaded by Sauptik Dhar on Jan 25, 2016

Content may be subject to copyright.

ADMM based Scalable Machine Learning on Spark

Sauptik Dhar

Research and Technology Center

Robert Bosch LLC

Palo Alto, CA 94304, USA

sauptik.dhar@us.bosch.com

Congrui Yi

Department of Statistics and Actuarial Science

University of Iowa

Iowa City, IA 52242, USA

congrui-yi@uiowa.edu

Naveen Ramakrishnan

Research and Technology Center

Robert Bosch LLC

Palo Alto, CA 94304, USA

naveen.ramakrishnan@us.bosch.com

Mohak Shah

Research and Technology Center

Robert Bosch LLC

Palo Alto, CA 94304, USA

mohak.shah@us.bosch.com

Abstract—Most machine learning algorithms involve solving

a convex optimization problem. Traditional in-memory convex

optimization solvers do not scale well with the increase in

data. This paper identiﬁes a generic convex problem for most

machine learning algorithms and solves it using the Alternating

Direction Method of Multipliers (ADMM). Finally such an

ADMM problem transforms to an iterative system of linear

equations, which can be easily solved at scale in a distributed

fashion. We implement this framework in Apache Spark and

compare it with the widely used Machine Learning LIBrary

(MLLIB) in Apache Spark 1.3.

Keywords-Distributed Optimization; ADMM; Spark; ML-

LIB;

I. INTRODUCTION

Convex optimization lies at the core of machine learn-

ing algorithms like, linear regression, logistic regression,

support vector machines etc. With the advent of big-data,

the traditional machine learning algorithms face critical

challenges with the continually increasing volume of data.

This motivates the need for research in scalable systems and

algorithms, particularly suited for solving general classes

of convex optimization problems that would in turn help

scale machine learning algorithms. There are two aspects of

this research which need to be considered jointly as they

play a crucial role in the performance of any solution to

optimization solvers for the big data setting: 1. Algorithms

for distributed optimization, and 2. Systems for big data

framework. We brieﬂy describe the state-of-the-art for these

aspects and the corresponding choices we make for this

paper in the following subsections.

A. Algorithms

Recent years have seen a deluge of novel optimization

algorithms for solving big-data machine learning problems.

S. Dhar and C. Yi contributed equally. This work was done during the

course of C. Yi’s internship at Robert Bosch LLC, Palo Alto, CA 94304,

USA

Majority of those approaches follow a distributed frame-

work, and can be broadly categorized as:

1) variations of stochastic gradient descent (SGD) [1],

[2],

2) Alternating Direction Method of Multipliers

(ADMM) [3]–[6],

3) approaches that utilize functional approximation based

on local portion of the data [7]–[9],

4) Bayesian approaches [10], and

5) Distributed Delayed Optimization [11].

Among all, SGD based approaches have been the most

inﬂuential and widely used. For example, the Machine

Learning Library (MLLIB) packaged with the Spark 1.3 dis-

tribution uses SGD [12]. However, current ongoing research

and advancements in ADMM presents it as a competitive

candidate for such distributed problems [13]–[15]. Unfortu-

nately, very few tools offer any ADMM based distributed

machine learning solutions [14], [15]. In this paper we

explore the ADMM approach to tackle a generic convex

problem, which in turn can be used to solve many machine

learning algorithms. We show that, at the heart of the

ADMM algorithm is a Quadratic Program (QP) which can

be solved in a distributed fashion. The proposed framework

provides a scalable solution for a gamut of Machine Learn-

ing algorithms (see table I); and is comparable (in terms of

computational complexity), to publicly available solutions

provided by MLLIB [12].

B. Systems

Another important aspect is the big-data framework used.

A variety of architectures have been proposed for Big-

Data analytics. On the basis of storage and computation

technology, it can be broadly categorized as,

1) Single-node in-memory analytics where the entire

data is loaded and processed in the memory of a

single computer. Such systems require huge amount

of memory. Typical tools that use such an approach

include, MATLAB [16], R [17], KNIME [18], Rapid

Miner [19], Weka [20] etc.

2) In-disk analytics where the entire data resides in disk,

and chunks of it are loaded and processed in memory.

Typical tools that use such an approach are, Revo-

lution R (ScaleR) [21], MATLAB (memmap) [16],

GraphLab [22] etc.

3) In-database analytics where the data is stored in a

database and the processing is taken to the database

where the data resides. Typical tools that use such an

approach are, Oracle Data Miner [23], HP Vertica [24],

Pivotal [25] etc.

4) Distributed Storage and Computing systems, where the

data resides in multiple nodes, and the computation

is distributed among those nodes. Typical tools that

use such an approach are, Rhadoop (Map-Reduce on

Hadoop File System a.k.a HDFS) [26], Mahout (Map-

Reduce on HDFS) [27], Apache Spark (distributed

in-memory analytics with storage on HDFS) [28],

Alpine Data Labs (Map-Reduce/Spark with storage on

HDFS) [29] etc.

In this paper we use the Spark computing framework [28]

over data stored in HDFS, as it offers several advantages

over MapReduce. Speciﬁcally, the caching mechanism and

lazy execution model of Spark makes it very fast and fault-

tolerant, especially for iterative tasks compared to MapRe-

duce which needs to write all intermediate data to the disk.

For details on the Spark framework and its performance

comparison to MapReduce please refer to [28].

Lately, there has been enormous amount of research on

ADMM and it’s modiﬁcations typically directed towards

faster convergence under speciﬁc conditions [3], [15]. This

paper does not provide new modiﬁcations to the ADMM

algorithm. The main contribution of this paper includes

identifying a generic optimization problem applicable to a

gamut of machine learning algorithms (see table I), and

using the standard ADMM algorithm to solve the optimiza-

tion in a distributed fashion in Spark. Availability of such

a repository of machine learning algorithms in Spark, as

an alternative to the currently available Machine Learning

LIBrary (MLLIB) [12], can be very useful to the big-data

analytics community. We provide the update steps for all

the algorithms (in table I), and show that at the core of the

ADMM updates is a QP which can be easily solved in a

distributed fashion. We benchmark the performance of this

generic solver (implemented on Spark 1.3), and compare

it with the publicly available Machine Learning LIBrary

(MLLIB) for big-data problems.

The rest of the paper is organized as follows. In section

II, we introduce the basics of ADMM following [3]. In

section III we present the generic optimization problem

and provide ADMM updates for a number of machine

learning algorithms (shown in Table I). Section IV presents

performance comparison of our ADMM implementation and

MLlib. Finally we provide the conclusions in Section V.

II. ALTERN ATIN G DIRECTION METHOD OF MULTIPLIERS

Alternating Direction Method of Multipliers (ADMM)

was ﬁrst proposed in the mid-70s by Glowinski & Mar-

rocco [6] and Gabay & Mercier [4] as a general convex

optimization algorithm. Lately there has been tremendous

amount of research in ADMM due to its applicability

to the distributed data setting. As an outcome of those

research, ADMM presents itself as a competitive technique

for distributed optimization. A critical feature of the ADMM

formulation is that it divides an optimization problem into

smaller sub-problems and enables solutions to them in a

distributed setting. Next we present a brief description of

the ADMM methodology. A more detailed description can

be found in [3].

A. Basic Form

Let’s consider optimization problems of the following

form 1:

min

w,zf(w) + g(z)

s.t. Aw+Bz=c

(1)

We form the augmented Lagrangian given below,

Lρ(w,z,u) = f(w) + g(z) + u>(Aw+Bz−c)

+ρ

2kAw+Bz−ck2

2,(2)

where, uis the lagrange multiplier. Note that, the aug-

mented Lagrangian contains a quadratic penalty term in

addition to the usual Lagrangian which is controlled by the

penalization factor ρ(see [3] for details). Then the ADMM

iterations to solve eq. 1 are,

wk+1 =argmin

w

Lρ(w,zk,uk)

zk+1 =argmin

z

Lρ(wk+1,z,uk)(3)

uk+1 =uk+ρ(Awk+1 +Bzk+1 −c)

For practical purposes, a more widely used version is the

scaled ADMM. Typically in that case, the linear and the

quadratic terms of the primal residual r=Aw+Bz−c

in 2 are combined and the resulting ADMM updates become,

wk+1 =argmin

w

f(w) + ρ

2kAw+Bzk−c+ukk2

2

zk+1 =argmin

z

g(z) + ρ

2kAwk+1 +Bz−c+ukk2

2

uk+1 =uk+ (Awk+1 +Bzk+1 −c)(4)

1Note that, we use lowercase bold alphabets for representing vectors

throughout the paper.

For the rest of the paper we shall use the scaled version of

ADMM following [3].

Note that for a problem where the objective function can

be decomposed as a sum of two functions (f(w), g(z)in

eq. 1), ADMM provides a framework to solve two separate

sub-problems (w-step, z-step of eq. 3) to obtain the ﬁnal

solution.

B. Consensus ADMM

Next we present a speciﬁc form called the consensus

ADMM. This serves as a very useful approach to solve

many problems in a distributed setting (as shown later in

III for linear SVM). For this case, consider an optimization

formulation where the f(w)in eq. 1 can be decomposed into

Mindependent parts i.e. PM

t=1 ft(wt). Then the consensus

ADMM can be written as,

min

w1,...,wM,zf1(w1) + . . . +fM(wM) + g(z)

subject to Aw1+Bz=c(5)

.

.

.

AwM+Bz=c

Note that, different from eq.1, here we solve M in-

dependent sub-problems. The equality constraint is called

the global consensus constraint since it requires all the

w1. . . wMvectors to have a consensus with a global vari-

able z. This results in the following ADMM steps (in its

scaled form) (see [3] for details),

wk+1

t=argmin

wt

ft(wt) + ρ

2kAwt+Bzk−c−uk

tk2

2

zk+1 =argmin

z

g(z) + ρ

2XkAwk+1

t+Bz−c+uk

tk2

2

uk+1

t=uk

t+Awk+1

t+Bzk+1 −c(6)

Compared to eq 4, wt’s in eq 6 are updated independently

and can be easily parallelized.

III. ADMM BAS ED DISTRIBUTED MACHINE LEARNING

ALGORITHMS

In this section we discuss how we can utilize this ADMM

framework to solve many machine learning algorithms. Un-

der inductive settings a typical supervised machine learning

problem involves estimating a function from noisy train-

ing samples (xi, yi)N

i=1,N= no. of training samples [30],

[31]. There are two common types of supervised learning

problems:-

•Regression or real-valued function estimation,y=

ˆ

f(x). In this case we have y∈ < and x∈ <D,

D= dimension of the input space. The quality of

prediction/estimation is measured by a user-deﬁned

loss function L(ˆ

fw,b(xi), yi). Typical examples include,

squared loss, -insensitive loss etc.

•Classiﬁcation or estimation of indicator function,y=

ˆ

f(x). In this case we have y∈ {+1,−1}and x∈ <D,

D= dimension of the input space. As before, the quality

of prediction/estimation is measured by a user-deﬁned

loss function L(ˆ

fw,b(xi), yi)like, logit loss, hinge loss,

0/1-loss etc.

A common optimization problem that is solved for both the

supervised learning problems discussed above is:

min

w,b

1

N

N

X

i=1

L(ˆ

fw,b(xi), yi) + λR(w)(7)

Here Nis the total number of samples used to estimate the

model ( ˆ

fw,b) parameterized by w∈ <Dand b∈ <. In this

paper, we limit ourselves to linear parameterizations where,

ˆ

fw,b(x) = w>x+b.Lis a convex loss which measures

the discrepancy between the model estimates and their true

values/labels. Ris a convex regularizer that penalizes the

model complexity for better generalization on unseen future

test samples.

In this paper we propose to solve a general class of

optimization problem shown in eq. 7, and use this solver

for many popular supervised machine learning algorithms

(see Table I). We provide the ADMM updates for each of

these algorithms and show that, at the heart of the ADMM

updates for eq.7 is a QP problem during the w-step which

has the following form,

min

w

1

2w>Pw−q>w

subject to lwu

(8)

We adopt the following strategies to solve this QP,

1) Unconstrained case (i.e. l=−∞, u =∞)

In this case we use a direct matrix inversion, w∗=

P−1q. Note that, for high-dimensional problems such

matrix-inversion operations could become a bottle-

neck. However, more advanced QP solvers can be

added in future versions of this work, e.g. ones based

on conjugate gradient.

2) Constrainined case (i.e. l,uare ﬁnite)

We solve the QP problem using L-BFGS [32] method

and apply warm-start strategy, i.e. initialize with w

value from previous iteration.

Next we present the ADMM updates for the different ML

algorithms in Table I.

A. L1/L2 Regression

In this sub-section, we consider the more generic elastic-

net regularizer [33] with the least squares loss. The problem

formulation is:-

Given input training data (xi, yi)N

i=1 with x∈ <Dand y∈

<, linear regression with elastic net regularization solves the

Table I

MACH INE LE AR NIN G AL GOR IT HMS I N TH E FOR M OF E Q. 7

Methods Loss Functions (with ˆ

fw,b(x) = w>x+b)Regularizer

L1,L2,L1-L2 regularized linear regression least-square: 1

2NPi(yi−b−x>

iw)2

L1,L2,L1-L2 regularized logistic regression logit loss: 1

NPilog(1 + e−yi(x>

iw+b))αkwk1+ (1 −α)·1

2kwk2

2

L1,L2,L1-L2 regularized linear SVM hinge loss: 1

NPi1−yi(x>

iw+b)+with, α∈[0,1]

Group-Lasso 1

2NPi(yi−b−x>

iw)2PG

k=1 √dkαkwkk2+ (1 −α)·1

2kwkk2

2

G:= total groups, dk:= size of the kth group

following optimization:

minwλ

D

X

j=1

δjα|wj|+ (1 −α)·wj2

2(9)

+1

2N

N

X

i=1

(yi−xi>w)2

In this form we can include the intercept in the optimization

problem by augmenting a column of ones to the input

samples, i.e., ˆ

x= [x,1](D+1)×1, and solving for ˆ

w=

[w, b](D+1)×1.

Note that, the current form is more generic and can

be easily adapted to solve both lasso (α= 1) and ridge

regression (α= 0), in addition to elastic net [33]. Further, δj

provides additional ﬂexibility to this optimization problem,

•As discussed in [31], penalization of the intercept

would make the algorithm depend on the origin chosen

for y. Hence, we can avoid that by setting δD+1 = 0.

•In addition, we can incorporate apriori information to

the penalization term. A special case is the group-lasso,

where we set δj=√dkfor the kth group of size dk.

Here, the ADMM formulation is given as,

min

w,z

1

2N

N

X

i=1

(yi−xi>w)2(10)

+λ

D

X

j=1

δjα|zj|+ (1 −α)·zj2

2

subject to w−z=0.

and the corresponding updates are,

wk+1 =argmin

w

1

2N

N

X

i=1

(yi−xi>w)2

+ρ

2kw−zk+ukk2

2(11)

=argmin

w

1

2w>Pw−q>w

=P−1q

P=1

N

N

P

i=1

xix>

i+ρIDand q=1

N

N

P

i=1

yixi+ρ(zk−uk)

zk+1 =argmin

z

λX

j

δj(α|zj|+ (1 −α)·z2

j

2)

+ρ

2kwk+1 −z+ukk2

2(12)

zk+1

i=cκj(wk+1

j+uk

j)

1 + λδj(1 −α)/ρ

where κj=λδjα/ρ and Sκ(t) = 1−κ

|t|+t= (t−κ)+−

(−t−κ)+is the soft-thresholding operator and,

uk+1 =uk+wk+1 −zk+1 (13)

As seen above, for big-data problems w-step poses as

the main bottleneck. This however can be scaled through

distributed computation of PixixT

iand Piyixi. Finally,

the w-update is transformed to a matrix inversion problem

as shown in eq. 11. The z,u- updates can be easily obtained

as shown in eq. 12 and 13.

B. Group-Lasso

Next we consider a very speciﬁc method called

the Group Lasso. In this case we assume that the

apriori grouping information is available to form a

composite weight vector of Ggroups, denoted as

w= [w(1)

1·· ·w(1)

d1

| {z }

group 1

, . . . , w(g)

1·· ·w(g)

dg

| {z }

group g

, . . . , w(G)], where

w(g)

k=kth feature of the gth group, w(G)=b(the intercept),

dg= size of the gth group. Then the group-lasso regularized

linear regression model is given by,

min

w

1

2NPN

i=1(yi−x>

iw)2(14)

+λPG

g=1 δgαkwgk2+ (1 −α)·1

2kwgk2

2

In practice, we use δg=pdgand δG= 0 for the

intercept. Following the same procedure as above, we get

the w-update and u-update which are exactly the same as

in the elastic-net regularized case, the only difference is the

z-update which is given as,

zk+1

g=Sκg(wk+1

g+uk

g)

1 + λδg(1 −α)/ρ (15)

where κg=λδgα/ρ, and Sκis the block soft-thresholding

operator Sκ(t) = 1−κ

ktk2+t

C. L1/L2-Logistic Regression

In this sub-section, we switch towards classiﬁcation prob-

lems. Speciﬁcally, we consider the logistic regression clas-

siﬁcation method. Given input training data (xi, yi)n

i=1 with

x∈ <Dand y∈ {−1,+1}, the logistic regression model is

estimated by solving the following optimization problem:

min

w

1

NPN

i=1 log(1 + e−yix>

iw)(16)

+λ

D

P

j=1

δjnα|wj|+ (1 −α)·w2

j

2o(17)

The corresponding ADMM form is as follows:

min

w,z

1

NPN

i=1 log(1 + e−yix>

iw)(18)

+λ

D

P

j=1

δjnα|zj|+ (1 −α)·zj2

2o

subject to w−z= 0

Same as before we use ˆ

x= [x,1](D+1)×1, with ˆ

w=

[w, b](D+1)×1. The resulting ADMM updates are,

wk+1 =argmin

w

1

NX

i

log(1 + e−yix>

iw)(19)

+ρ

2kw−zk+ukk2

2

zk+1 =argmin

z

λX

j

δj(α|zj|+ (1 −α)·z2

j

2)

+ρ

2kwk+1 −z+ukk2

2

uk+1 =wk+1 −zk+1 +uk

Note that, the zand uupdates are same as in eq 12 and 13

respectively. For the w-step we use Newton updates given

below. Let,

l(w) = 1

NX

i

log(1 + e−yix>

iw)

+ρ

2kw−zk+ukk2

2

then,

5wl(w) = −1

NX

i

yi(1 −pi)xi+ρ(w−zk+uk),

52

wl(w) = 1

NX

i

pi(1 −pi)xix>

i+ρI

where, pi= 1(1 + e−w>x). Hence, the optimal wk+1 can

be obtained through the iterative Algorithm 1

Algorithm 1: Iterative Algorithm for wk+1

Input:wk,zk,uk

Output:wk+1

initialize v(0) ←wk,j←0;

while not converged do

p(j)

i←1(1 + e−x>

iv(j));

P(j)←1

NPip(j)

i(1 −p(j)

i)xix>

i+ρI ;

q(j)← − 1

NPiyi(1−p(j)

i)xi+ρ(v(j−1) −zk+uk);

v(j+1) ←v(j)−(P(j))−1q(j)(distributed) ;

j←j+ 1;

return wk+1 ←v(j);

D. Linear SVM

Finally we show how to use this similar framework

to solve Linear SVM. Note that, a detailed analysis for

distributed SVM using ADMM has already been shown in

[15]. However, even though the technicalities are similar,we

solve a slightly different problem (hinge loss + elastic net),

and show it for completeness. Finally, different from [15],

[34], we rather use an L-BFGS approach to solve each sub-

problem as discussed next.

The SVM-problem formulation is provided next, Given

input training data (xi, yi)n

i=1 with x∈ <Dand y∈

{−1,+1}, the elastic-net regularized linear SVM solves the

following optimization problem:

min

w

C

NPN

i=1(1 −yix>

iw)+(20)

+

D

P

j=1

δjnα|wj|+ (1 −α)·w2

j

2o

As again, α∈[0,1] controls the effect of L1 vs. L2

regularization and δD= 0 is used to avoid regularization in

intercept space. Here, unlike the previous models the hinge

loss in SVM is nonsmooth. To tackle this issue, we use

the consensus ADMM and rather solve smaller SVM-like

sub-problems as also shown in [15]. The advantage to this

approach is that each smaller SVM-like sub-problem can be

now solved in the dual space using a QP solver. This is

shown next.

The consensus ADMM formulation for the problem is,

min

w1,...,wM,z

C

NPM

t=1 Pi∈Bt(1 −yix>

iwt)+

+P

j

δjnα|zj|+ (1 −α)·z2

j

2o(21)

subject to wt−z= 0, t = 1, . . . , M

and the corresponding updates are,

wk+1

t=argmin

wt

C

NX

i∈Bt

(1 −yix>

iwt)+

+ρ

2kwt−zk+uk

tk2

2, t = 1, . . . , M (22)

zk+1

j=Sκj1

MPM

t=1(wk+1

tj +uk

tj )

1 + λδj(1 −α)/ρ , j = 1, . . . , D

uk+1

t=uk

t+wk+1

t−zk+1, t = 1, . . . , M

Note that now the w-update is an SVM like problem on

a subset Bt. This can be solved in the dual form as shown

next.

For each subset Bt,

wk+1

t=argmin

wt

C

NX

i∈Bt

ξi+ρ

2kwt−zk+uk

tk2

2

s.t. yix>

iw≥1−ξi, ξi≥0, i ∈Bt(23)

This transforms to the following QP (with constraints) given

below,

min

α

1

2α>Pα+q>α(24)

s.t. 0≤αi≤C/N, i ∈Bt

with,

Pij =yiyjx>

ixj

qi=yix>

i(zk−uk

t)−1

We use L-BFGS to solve the above QP and ﬁnally obtain,

wk+1 =1

ρX

i∈Bt

αk+1

iyixi+zk−uk

t(25)

as the ﬁnal SVM solution. This can also be used to accom-

modate for non-linear SVM following [35].

IV. EXPERIMENTS AND RES ULT S

Next we provide the performance comparison of our

implemented algorithms with the publicly available MLLIB

library packaged with Apache Spark 3.0.

A. System Conﬁguration

The Hadoop cluster conﬁguration for our experiments is

provided below,

– No. of Nodes = 6 (Hadoop Version - Apache 1.1.1)

– No. of cores (per node) = 12 core (Intel Xeon @

3.20GHz)

– RAM size (per node) = 32 GB

– Hard Disk size (per node) = 500 GB

For implementation we use the python interface (pyspark)

already available in [28]. Further, our spark framework has

been conﬁgured based on the recommendations available at

[36]. i.e.

– spark.num.executors = 17

– spark.executor.memory = 6 GB

– spark.driver.memory = 4 GB

– spark.driver.maxResultSize = 4 GB

B. Datasets

We generate synthetic datasets of different sizes for our

experiments. The datasets are generated to capture the

sparsity as well as the grouping behavior of the different

methods. The dataset used for the classiﬁcation methods is

described below,

Dataset for Classiﬁcation Problems: In this case x∈ <Dis

generated from a multivariate normal distribution N(0,Σ).

Here,

Correlation Matrix,Σ =

1 0.2 0.2

0.2...0.2

0.2 0.2 1

10

0

0...

G∗10

is block diagonal, and controls the grouping properties of

the problem. We ﬁx the number of variables per group to

10. The pairwise correlation within each group is 0.2, and

that between each group is 0. The y- value (class label) is

generated as shown below,

y=sign(w>x+ε)(26)

where, wcontrols the sparsity of the problem. For this paper

we set the sparsity parameter to 0.8, i.e. 80% of the groups

have zero weight vector. For the remaining 20%, the weight

vectors are alternated between +1/-1. i.e.

w= [1,1,1,1,1,−1,−1,−1,−1,−1

| {z }

group 1

, . . .

. . . 1,1,1,1,1,−1,−1,−1,−1,−1

| {z }

group g

, . . .

. . . 0,0,0, . . .

| {z }

remaining 80 % sparse groups

]

Further we add a gaussian noise to the model ε∼ N(0,1).

The above settings are used to generate two separate data of

different sizes (shown below),

– No. of training samples (N) = 2000000 and Dimension

of each samples (D) = 100,

– No. of training samples (N) = 20000000 and Dimension

of each samples (D) = 100

The generated data is saved in a comma separated format,

which takes upto (approx.) 5 GB and 50 GB of disk space

respectively.

Dataset for Regression Problems: The generation of this

Table II

COM PUTATI ON TI ME C OMPA RIS ON B ETW EE N ADMM VS. MLLIB (I N SE C)FO R CLA SS IFIC ATIO N ME THO DS

Methods ADMM MLLIB

Data Set size = 5 GB with N = 2000000, D = 100

L2- logistic regression (λ= 0.1, α = 0) 157.57 (0.04) 139.68 (2.06)

L1- logistic regression (λ= 0.1, α = 1) 157.05 (1.54) 266.9 (169.16)

L1+L2- logistic regression (λ= 0.1, α = 0.5) 155.2(1.23) Not available

Data Set size = 50 GB with N = 20000000, D = 100

L2- logistic regression (λ= 0.1, α = 0) 13937.3 (10.34) 14045.7 (411.78)

L1- logistic regression (λ= 0.1, α = 1) 15381.8 (5.59) 13155.2 (307.60)

L1+L2- logistic regression (λ= 0.1, α = 0.5) 15472.1 (13.25) Not available

data follows exactly the same as in classiﬁcation problems;

except that the y-values are generated as below,

y=w>x+2+ε(27)

As before we use two separate data of different sizes,

– No. of training samples (N) = 2000000 and Dimension

of each samples (D) = 100,

– No. of training samples (N) = 20000000 and Dimension

of each samples (D) = 100

C. Results

Here we provide comparison of the computation times for

our ADMM implementation vs. MLLIB for both classiﬁca-

tion and regression problems. In general, the computation

time of the ADMM based methods depend heavily on a

number of parameters like , ρ- update (see [3], [13]), con-

vergence criteria etc. For simplicity, we follow the ρ-update

suggested in (eq 3.13 of [3]). Further, our current stopping

criteria dictates convergence of the solution, when the primal

and dual residual conjointly goes below a tolerance value

of 10−3(following [3]). On the other hand, MLLIB does

not provide any control on the convergence criteria. Hence

for our experiments we keep the default settings. Tables II

amd III provides the average computation times (in seconds)

over three runs of the experiment for the classiﬁcation and

regression problems respectively. The standard deviation are

provided in parenthesis. In the current version of the paper

our results are limited to L1/L2-Logistic and L1/L2-Linear

regression. Additional results for L1/L2-SVM and Group

Lasso shall be provided in a extended version of the paper.

Based on our results in Table II and III, the ADMM

implementation performs similar to the MLLIB in terms

of computation speed except for the regression problem 2.

For the regression problem we report the computation time

for one iteration of the ADMM updates. This approximate

solution still outperfomed the MLLIB’s solution in terms

of accuracy. Hence, the current ADMM based framework

provides as a viable alternative to the SGD based approach

implemented in MLLIB. In addition, this framework sup-

ports a wide range of scalable ML algorithms, which can

prove as an useful arsenal for data-scientists to tackle big-

data problems.

V. C ONCLUSION

In this paper we present a generic convex optimization

problem for most ML algorithms. We identify ADMM as

a viable approach to solve this generic convex optimization

problem, and derive the ADMM updates speciﬁc to each ML

algorithms (listed in Table I). The current paper provides the

update steps for linear parameterization. However, it can be

easily extended to non-linear cases following [14]. As shown

in section III, at the heart of the ADMM updates lies a QP

which can be solved in a distributed fashion. Our results

show that this ADMM based approach performs similar

in comparison to the publicly available MLLIB in terms

of computation speed. This presents ADMM as a viable

alternative to MLLIB for big-data problems, with the added

advantage of more machine learning algorithms.

Finally, we note that the current implementation is limited

by the dimension of the problem; as it needs to solve a

QP in the w- update. This motivates the need for future

2The MLLIB package distributed with Spark 1.3 provides incorrect

implementation of the original Logistic Regression algorithm. A correction

has been made in the latest Spark 1.4 version (see [37]). This has not

been included in this paper. However, the Spark 1.3 ’s implementation can

still be considered as an approximate comparison representative of the SGD

approach. Further, the convergence criteria for MLLIB cannot be controlled.

In terms of accuracy, for both classiﬁcation and regression problems, the

MLLIB tool provided sub-optimal solutions.

Table III

COM PUTATI ON TI ME C OMPA RIS ON B ETW EE N ADMM VS. MLLIB (I N SEC )FOR REGRESSION METHODS

Methods ADMM MLLIB

Data Set size = 5 GB with N = 2000000, D = 100

L2- linear regression (λ= 0.1, α = 0) 425.09 (17.26) 429.83(190.02)

L1- linear regression (λ= 0.1, α = 1) 416.95 (3.5) 444.79 (210.22)

L1+L2- linear regression (λ= 0.1, α = 0.5) 409.50(2.5) Not available

Data Set size = 50 GB with N = 20000000, D = 100

L2- linear regression (λ= 0.1, α = 0) 4209.95(10.5) 29244.43(100.12)

L1- linear regression (λ= 0.1, α = 1) 4233.13 (6.89) 23526.45(200.23)

L1+L2- linear regression (λ= 0.1, α = 0.5) 4150.81 (10.25) Not available

research towards scalable options for the QP problem. In

addition to that, there has been a gamut of research towards

newer ADMM update strategies for faster convergence of the

algorithm [3], [13]. These advanced strategies have not been

included in this version of the paper and can be extended as

future work.

ACKNOWLEDGMENT

The authors would like to thank Juergen Heit from

Research and Technology Center, Robert Bosch LLC; for

multiple discussions on the Spark conﬁgurations for the

experimental settings. They would also like to thank Max

Rizvanov for his support with the Hadoop cluster manage-

ment.

REFERENCES

[1] B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lock-

free approach to parallelizing stochastic gradient descent,” in

Advances in Neural Information Processing Systems, 2011,

pp. 693–701.

[2] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola, “Paral-

lelized stochastic gradient descent,” in Advances in Neural

Information Processing Systems, 2010, pp. 2595–2603.

[3] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein,

“Distributed optimization and statistical learning via the al-

ternating direction method of multipliers,” Foundations and

Trends R

in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.

[4] D. Gabay and B. Mercier, “A dual algorithm for the solution

of nonlinear variational problems via ﬁnite element approxi-

mation,” Computers & Mathematics with Applications, vol. 2,

no. 1, pp. 17–40, 1976.

[5] T. Goldstein, B. ODonoghue, and S. Setzer, “Fast alternating

direction optimization methods,” CAM report, pp. 12–35,

2012.

[6] R. Glowinski and A. Marroco, “Sur l’approximation, par

´

el´

ements ﬁnis d’ordre un, et la r´

esolution, par p´

enalisation-

dualit´

e d’une classe de probl`

emes de dirichlet non lin´

eaires,”

ESAIM: Mathematical Modelling and Numerical Analysis-

Mod´

elisation Math´

ematique et Analyse Num´

erique, vol. 9,

no. R2, pp. 41–76, 1975.

[7] D. Mahajan, S. S. Keerthi, S. Sundararajan, and L. Bottou,

“A functional approximation based distributed learning algo-

rithm,” arXiv preprint arXiv:1310.8418, 2013.

[8] O. Shamir, N. Srebro, and T. Zhang, “Communication efﬁ-

cient distributed optimization using an approximate newton-

type method,” arXiv preprint arXiv:1312.7853, 2013.

[9] C. H. Teo, S. Vishwanthan, A. J. Smola, and Q. V. Le,

“Bundle methods for regularized risk minimization,” The

Journal of Machine Learning Research, vol. 11, pp. 311–365,

2010.

[10] X. Zhang, “Probabilistic methods for distributed learning,”

Ph.D. dissertation, Duke University, 2014.

[11] A. Agarwal and J. C. Duchi, “Distributed delayed stochastic

optimization,” in Advances in Neural Information Processing

Systems, 2011, pp. 873–881.

[12] X. Meng, J. K. Bradley, B. Yavuz, E. R. Sparks,

S. Venkataraman, D. Liu, J. Freeman, D. B. Tsai, M. Amde,

S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh,

M. Zaharia, and A. Talwalkar, “Mllib: Machine learning in

apache spark,” CoRR, vol. abs/1505.06807, 2015. [Online].

Available: http://arxiv.org/abs/1505.06807

[13] R. Nishihara, L. Lessard, B. Recht, A. Packard, and M. I.

Jordan, “A General Analysis of the Convergence of ADMM,”

ArXiv e-prints, Feb. 2015.

[14] V. Sindhwani and H. Avron, “High-performance Kernel Ma-

chines with Implicit Distributed Optimization and Random-

ization,” ArXiv e-prints, Sep. 2014.

[15] C. Zhang, H. Lee, and K. G. Shin, “Efﬁcient distributed linear

classiﬁcation algorithms via the alternating direction method

of multipliers,” in International Conference on Artiﬁcial In-

telligence and Statistics, 2012, pp. 1398–1406.

[16] MATLAB, version 8.5 (R2015a). Natick, Massachusetts:

The MathWorks Inc., 2015.

[17] R Core Team, R: A Language and Environment for Statistical

Computing, R Foundation for Statistical Computing, Vienna,

Austria, 2013. [Online]. Available: http://www.R-project.org/

[18] M. R. Berthold, N. Cebron, F. Dill, T. R. Gabriel, T. K¨

otter,

T. Meinl, P. Ohl, C. Sieb, K. Thiel, and B. Wiswedel,

“KNIME: The Konstanz Information Miner,” in Studies in

Classiﬁcation, Data Analysis, and Knowledge Organization

(GfKL 2007). Springer, 2007.

[19] “Rapidminer,” https://rapidminer.com/, accessed: 2015-06-30.

[20] M. Hall, E. Frank, G. Holmes, B. Pfahringer,

P. Reutemann, and I. H. Witten, “The WEKA data

mining software: An update,” SIGKDD Explorations,

vol. 11, no. 1, pp. 10–18, 2009. [Online]. Avail-

able: http://www.sigkdd.org/explorations/issues/11-1-2009-

07/p2V11n1.pdf

[21] “Revolution r,” http://www.revolutionanalytics.com/revolution-

r-enterprise, accessed: 2015-06-30.

[22] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and

J. M. Hellerstein, “Graphlab: A new parallel framework for

machine learning,” in Conference on Uncertainty in Artiﬁcial

Intelligence (UAI), Catalina Island, California, July 2010.

[23] “Oracle data miner,” http://www.oracle.com, accessed: 2015-

06-30.

[24] “Hp vertica,” http://www.vertica.com/, accessed: 2015-06-30.

[25] “Pivotal,” http://pivotal.io/, accessed: 2015-06-30.

[26] “Revolution analytics rhadoop,”

https://github.com/RevolutionAnalytics/RHadoop/wiki,

accessed: 2015-06-30.

[27] Apache Software Foundation. Apache mahout:: Scalable

machine-learning and data-mining library. [Online].

Available: http://mahout.apache.org

[28] “Apache spark,” https://spark.apache.org/, accessed: 2015-06-

30.

[29] “Alpine data labs,” http://alpinenow.com/, accessed: 2015-06-

30.

[30] V. Cherkassky and F. M. Mulier, Learning from Data: Con-

cepts, Theory, and Methods. Wiley-IEEE Press, 2007.

[31] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of

Statistical Learning, ser. Springer Series in Statistics. New

York, NY, USA: Springer New York Inc., 2001.

[32] D. P. Bertsekas, Nonlinear Programming. Belmont, MA:

Athena Scientiﬁc, 1999.

[33] H. Zou and T. Hastie, “Regularization and variable selection

via the elastic net,” Journal of the Royal Statistical Society,

Series B, vol. 67, pp. 301–320, 2005.

[34] C.-Y. Lin, C.-H. Tsai, C.-P. Lee, and C.-J. Lin, “Large-scale

logistic regression and linear support vector machines using

spark,” in Big Data (Big Data), 2014 IEEE International

Conference on. IEEE, 2014, pp. 519–528.

[35] A. Rahimi and B. Recht, “Random features for large-scale

kernel machines,” in Advances in neural information process-

ing systems, 2007, pp. 1177–1184.

[36] “How-to: Tune your apache spark jobs (part 2),”

http://blog.cloudera.com/blog/2015/03/how-to-tune-your-

apache-spark-jobs-part-2/, accessed: 2015-06-30.

[37] “Mllib (spark) question.” https://www.mail-

archive.com/user@spark.apache.org/msg32244.html,

accessed: 2015-06-30.