Conference PaperPDF Available

ADMM based Scalable Machine Learning on Spark

Authors:

Abstract

Most machine learning algorithms involve solving a convex optimization problem. Traditional in-memory convex optimization solvers do not scale well with the increase in data. This paper identifies a generic convex problem for most machine learning algorithms and solves it using the Alternating Direction Method of Multipliers (ADMM). Finally such an ADMM problem transforms to an iterative system of linear equations, which can be easily solved at scale in a distributed fashion. We implement this framework in Apache Spark and compare it with the widely used Machine Learning LIBrary (MLLIB) in Apache Spark 1.3.
ADMM based Scalable Machine Learning on Spark
Sauptik Dhar
Research and Technology Center
Robert Bosch LLC
Palo Alto, CA 94304, USA
sauptik.dhar@us.bosch.com
Congrui Yi
Department of Statistics and Actuarial Science
University of Iowa
Iowa City, IA 52242, USA
congrui-yi@uiowa.edu
Naveen Ramakrishnan
Research and Technology Center
Robert Bosch LLC
Palo Alto, CA 94304, USA
naveen.ramakrishnan@us.bosch.com
Mohak Shah
Research and Technology Center
Robert Bosch LLC
Palo Alto, CA 94304, USA
mohak.shah@us.bosch.com
Abstract—Most machine learning algorithms involve solving
a convex optimization problem. Traditional in-memory convex
optimization solvers do not scale well with the increase in
data. This paper identifies a generic convex problem for most
machine learning algorithms and solves it using the Alternating
Direction Method of Multipliers (ADMM). Finally such an
ADMM problem transforms to an iterative system of linear
equations, which can be easily solved at scale in a distributed
fashion. We implement this framework in Apache Spark and
compare it with the widely used Machine Learning LIBrary
(MLLIB) in Apache Spark 1.3.
Keywords-Distributed Optimization; ADMM; Spark; ML-
LIB;
I. INTRODUCTION
Convex optimization lies at the core of machine learn-
ing algorithms like, linear regression, logistic regression,
support vector machines etc. With the advent of big-data,
the traditional machine learning algorithms face critical
challenges with the continually increasing volume of data.
This motivates the need for research in scalable systems and
algorithms, particularly suited for solving general classes
of convex optimization problems that would in turn help
scale machine learning algorithms. There are two aspects of
this research which need to be considered jointly as they
play a crucial role in the performance of any solution to
optimization solvers for the big data setting: 1. Algorithms
for distributed optimization, and 2. Systems for big data
framework. We briefly describe the state-of-the-art for these
aspects and the corresponding choices we make for this
paper in the following subsections.
A. Algorithms
Recent years have seen a deluge of novel optimization
algorithms for solving big-data machine learning problems.
S. Dhar and C. Yi contributed equally. This work was done during the
course of C. Yi’s internship at Robert Bosch LLC, Palo Alto, CA 94304,
USA
Majority of those approaches follow a distributed frame-
work, and can be broadly categorized as:
1) variations of stochastic gradient descent (SGD) [1],
[2],
2) Alternating Direction Method of Multipliers
(ADMM) [3]–[6],
3) approaches that utilize functional approximation based
on local portion of the data [7]–[9],
4) Bayesian approaches [10], and
5) Distributed Delayed Optimization [11].
Among all, SGD based approaches have been the most
influential and widely used. For example, the Machine
Learning Library (MLLIB) packaged with the Spark 1.3 dis-
tribution uses SGD [12]. However, current ongoing research
and advancements in ADMM presents it as a competitive
candidate for such distributed problems [13]–[15]. Unfortu-
nately, very few tools offer any ADMM based distributed
machine learning solutions [14], [15]. In this paper we
explore the ADMM approach to tackle a generic convex
problem, which in turn can be used to solve many machine
learning algorithms. We show that, at the heart of the
ADMM algorithm is a Quadratic Program (QP) which can
be solved in a distributed fashion. The proposed framework
provides a scalable solution for a gamut of Machine Learn-
ing algorithms (see table I); and is comparable (in terms of
computational complexity), to publicly available solutions
provided by MLLIB [12].
B. Systems
Another important aspect is the big-data framework used.
A variety of architectures have been proposed for Big-
Data analytics. On the basis of storage and computation
technology, it can be broadly categorized as,
1) Single-node in-memory analytics where the entire
data is loaded and processed in the memory of a
single computer. Such systems require huge amount
of memory. Typical tools that use such an approach
include, MATLAB [16], R [17], KNIME [18], Rapid
Miner [19], Weka [20] etc.
2) In-disk analytics where the entire data resides in disk,
and chunks of it are loaded and processed in memory.
Typical tools that use such an approach are, Revo-
lution R (ScaleR) [21], MATLAB (memmap) [16],
GraphLab [22] etc.
3) In-database analytics where the data is stored in a
database and the processing is taken to the database
where the data resides. Typical tools that use such an
approach are, Oracle Data Miner [23], HP Vertica [24],
Pivotal [25] etc.
4) Distributed Storage and Computing systems, where the
data resides in multiple nodes, and the computation
is distributed among those nodes. Typical tools that
use such an approach are, Rhadoop (Map-Reduce on
Hadoop File System a.k.a HDFS) [26], Mahout (Map-
Reduce on HDFS) [27], Apache Spark (distributed
in-memory analytics with storage on HDFS) [28],
Alpine Data Labs (Map-Reduce/Spark with storage on
HDFS) [29] etc.
In this paper we use the Spark computing framework [28]
over data stored in HDFS, as it offers several advantages
over MapReduce. Specifically, the caching mechanism and
lazy execution model of Spark makes it very fast and fault-
tolerant, especially for iterative tasks compared to MapRe-
duce which needs to write all intermediate data to the disk.
For details on the Spark framework and its performance
comparison to MapReduce please refer to [28].
Lately, there has been enormous amount of research on
ADMM and it’s modifications typically directed towards
faster convergence under specific conditions [3], [15]. This
paper does not provide new modifications to the ADMM
algorithm. The main contribution of this paper includes
identifying a generic optimization problem applicable to a
gamut of machine learning algorithms (see table I), and
using the standard ADMM algorithm to solve the optimiza-
tion in a distributed fashion in Spark. Availability of such
a repository of machine learning algorithms in Spark, as
an alternative to the currently available Machine Learning
LIBrary (MLLIB) [12], can be very useful to the big-data
analytics community. We provide the update steps for all
the algorithms (in table I), and show that at the core of the
ADMM updates is a QP which can be easily solved in a
distributed fashion. We benchmark the performance of this
generic solver (implemented on Spark 1.3), and compare
it with the publicly available Machine Learning LIBrary
(MLLIB) for big-data problems.
The rest of the paper is organized as follows. In section
II, we introduce the basics of ADMM following [3]. In
section III we present the generic optimization problem
and provide ADMM updates for a number of machine
learning algorithms (shown in Table I). Section IV presents
performance comparison of our ADMM implementation and
MLlib. Finally we provide the conclusions in Section V.
II. ALTERN ATIN G DIRECTION METHOD OF MULTIPLIERS
Alternating Direction Method of Multipliers (ADMM)
was first proposed in the mid-70s by Glowinski & Mar-
rocco [6] and Gabay & Mercier [4] as a general convex
optimization algorithm. Lately there has been tremendous
amount of research in ADMM due to its applicability
to the distributed data setting. As an outcome of those
research, ADMM presents itself as a competitive technique
for distributed optimization. A critical feature of the ADMM
formulation is that it divides an optimization problem into
smaller sub-problems and enables solutions to them in a
distributed setting. Next we present a brief description of
the ADMM methodology. A more detailed description can
be found in [3].
A. Basic Form
Let’s consider optimization problems of the following
form 1:
min
w,zf(w) + g(z)
s.t. Aw+Bz=c
(1)
We form the augmented Lagrangian given below,
Lρ(w,z,u) = f(w) + g(z) + u>(Aw+Bzc)
+ρ
2kAw+Bzck2
2,(2)
where, uis the lagrange multiplier. Note that, the aug-
mented Lagrangian contains a quadratic penalty term in
addition to the usual Lagrangian which is controlled by the
penalization factor ρ(see [3] for details). Then the ADMM
iterations to solve eq. 1 are,
wk+1 =argmin
w
Lρ(w,zk,uk)
zk+1 =argmin
z
Lρ(wk+1,z,uk)(3)
uk+1 =uk+ρ(Awk+1 +Bzk+1 c)
For practical purposes, a more widely used version is the
scaled ADMM. Typically in that case, the linear and the
quadratic terms of the primal residual r=Aw+Bzc
in 2 are combined and the resulting ADMM updates become,
wk+1 =argmin
w
f(w) + ρ
2kAw+Bzkc+ukk2
2
zk+1 =argmin
z
g(z) + ρ
2kAwk+1 +Bzc+ukk2
2
uk+1 =uk+ (Awk+1 +Bzk+1 c)(4)
1Note that, we use lowercase bold alphabets for representing vectors
throughout the paper.
For the rest of the paper we shall use the scaled version of
ADMM following [3].
Note that for a problem where the objective function can
be decomposed as a sum of two functions (f(w), g(z)in
eq. 1), ADMM provides a framework to solve two separate
sub-problems (w-step, z-step of eq. 3) to obtain the final
solution.
B. Consensus ADMM
Next we present a specific form called the consensus
ADMM. This serves as a very useful approach to solve
many problems in a distributed setting (as shown later in
III for linear SVM). For this case, consider an optimization
formulation where the f(w)in eq. 1 can be decomposed into
Mindependent parts i.e. PM
t=1 ft(wt). Then the consensus
ADMM can be written as,
min
w1,...,wM,zf1(w1) + . . . +fM(wM) + g(z)
subject to Aw1+Bz=c(5)
.
.
.
AwM+Bz=c
Note that, different from eq.1, here we solve M in-
dependent sub-problems. The equality constraint is called
the global consensus constraint since it requires all the
w1. . . wMvectors to have a consensus with a global vari-
able z. This results in the following ADMM steps (in its
scaled form) (see [3] for details),
wk+1
t=argmin
wt
ft(wt) + ρ
2kAwt+Bzkcuk
tk2
2
zk+1 =argmin
z
g(z) + ρ
2XkAwk+1
t+Bzc+uk
tk2
2
uk+1
t=uk
t+Awk+1
t+Bzk+1 c(6)
Compared to eq 4, wt’s in eq 6 are updated independently
and can be easily parallelized.
III. ADMM BAS ED DISTRIBUTED MACHINE LEARNING
ALGORITHMS
In this section we discuss how we can utilize this ADMM
framework to solve many machine learning algorithms. Un-
der inductive settings a typical supervised machine learning
problem involves estimating a function from noisy train-
ing samples (xi, yi)N
i=1,N= no. of training samples [30],
[31]. There are two common types of supervised learning
problems:-
Regression or real-valued function estimation,y=
ˆ
f(x). In this case we have y∈ < and x∈ <D,
D= dimension of the input space. The quality of
prediction/estimation is measured by a user-defined
loss function L(ˆ
fw,b(xi), yi). Typical examples include,
squared loss, -insensitive loss etc.
Classification or estimation of indicator function,y=
ˆ
f(x). In this case we have y∈ {+1,1}and x∈ <D,
D= dimension of the input space. As before, the quality
of prediction/estimation is measured by a user-defined
loss function L(ˆ
fw,b(xi), yi)like, logit loss, hinge loss,
0/1-loss etc.
A common optimization problem that is solved for both the
supervised learning problems discussed above is:
min
w,b
1
N
N
X
i=1
L(ˆ
fw,b(xi), yi) + λR(w)(7)
Here Nis the total number of samples used to estimate the
model ( ˆ
fw,b) parameterized by w∈ <Dand b∈ <. In this
paper, we limit ourselves to linear parameterizations where,
ˆ
fw,b(x) = w>x+b.Lis a convex loss which measures
the discrepancy between the model estimates and their true
values/labels. Ris a convex regularizer that penalizes the
model complexity for better generalization on unseen future
test samples.
In this paper we propose to solve a general class of
optimization problem shown in eq. 7, and use this solver
for many popular supervised machine learning algorithms
(see Table I). We provide the ADMM updates for each of
these algorithms and show that, at the heart of the ADMM
updates for eq.7 is a QP problem during the w-step which
has the following form,
min
w
1
2w>Pwq>w
subject to lwu
(8)
We adopt the following strategies to solve this QP,
1) Unconstrained case (i.e. l=−∞, u =)
In this case we use a direct matrix inversion, w=
P1q. Note that, for high-dimensional problems such
matrix-inversion operations could become a bottle-
neck. However, more advanced QP solvers can be
added in future versions of this work, e.g. ones based
on conjugate gradient.
2) Constrainined case (i.e. l,uare finite)
We solve the QP problem using L-BFGS [32] method
and apply warm-start strategy, i.e. initialize with w
value from previous iteration.
Next we present the ADMM updates for the different ML
algorithms in Table I.
A. L1/L2 Regression
In this sub-section, we consider the more generic elastic-
net regularizer [33] with the least squares loss. The problem
formulation is:-
Given input training data (xi, yi)N
i=1 with x∈ <Dand y
<, linear regression with elastic net regularization solves the
Table I
MACH INE LE AR NIN G AL GOR IT HMS I N TH E FOR M OF E Q. 7
Methods Loss Functions (with ˆ
fw,b(x) = w>x+b)Regularizer
L1,L2,L1-L2 regularized linear regression least-square: 1
2NPi(yibx>
iw)2
L1,L2,L1-L2 regularized logistic regression logit loss: 1
NPilog(1 + eyi(x>
iw+b))αkwk1+ (1 α)·1
2kwk2
2
L1,L2,L1-L2 regularized linear SVM hinge loss: 1
NPi1yi(x>
iw+b)+with, α[0,1]
Group-Lasso 1
2NPi(yibx>
iw)2PG
k=1 dkαkwkk2+ (1 α)·1
2kwkk2
2
G:= total groups, dk:= size of the kth group
following optimization:
minwλ
D
X
j=1
δjα|wj|+ (1 α)·wj2
2(9)
+1
2N
N
X
i=1
(yixi>w)2
In this form we can include the intercept in the optimization
problem by augmenting a column of ones to the input
samples, i.e., ˆ
x= [x,1](D+1)×1, and solving for ˆ
w=
[w, b](D+1)×1.
Note that, the current form is more generic and can
be easily adapted to solve both lasso (α= 1) and ridge
regression (α= 0), in addition to elastic net [33]. Further, δj
provides additional flexibility to this optimization problem,
As discussed in [31], penalization of the intercept
would make the algorithm depend on the origin chosen
for y. Hence, we can avoid that by setting δD+1 = 0.
In addition, we can incorporate apriori information to
the penalization term. A special case is the group-lasso,
where we set δj=dkfor the kth group of size dk.
Here, the ADMM formulation is given as,
min
w,z
1
2N
N
X
i=1
(yixi>w)2(10)
+λ
D
X
j=1
δjα|zj|+ (1 α)·zj2
2
subject to wz=0.
and the corresponding updates are,
wk+1 =argmin
w
1
2N
N
X
i=1
(yixi>w)2
+ρ
2kwzk+ukk2
2(11)
=argmin
w
1
2w>Pwq>w
=P1q
P=1
N
N
P
i=1
xix>
i+ρIDand q=1
N
N
P
i=1
yixi+ρ(zkuk)
zk+1 =argmin
z
λX
j
δj(α|zj|+ (1 α)·z2
j
2)
+ρ
2kwk+1 z+ukk2
2(12)
zk+1
i=j(wk+1
j+uk
j)
1 + λδj(1 α)
where κj=λδjα/ρ and Sκ(t) = 1κ
|t|+t= (tκ)+
(tκ)+is the soft-thresholding operator and,
uk+1 =uk+wk+1 zk+1 (13)
As seen above, for big-data problems w-step poses as
the main bottleneck. This however can be scaled through
distributed computation of PixixT
iand Piyixi. Finally,
the w-update is transformed to a matrix inversion problem
as shown in eq. 11. The z,u- updates can be easily obtained
as shown in eq. 12 and 13.
B. Group-Lasso
Next we consider a very specific method called
the Group Lasso. In this case we assume that the
apriori grouping information is available to form a
composite weight vector of Ggroups, denoted as
w= [w(1)
1·· ·w(1)
d1
| {z }
group 1
, . . . , w(g)
1·· ·w(g)
dg
| {z }
group g
, . . . , w(G)], where
w(g)
k=kth feature of the gth group, w(G)=b(the intercept),
dg= size of the gth group. Then the group-lasso regularized
linear regression model is given by,
min
w
1
2NPN
i=1(yix>
iw)2(14)
+λPG
g=1 δgαkwgk2+ (1 α)·1
2kwgk2
2
In practice, we use δg=pdgand δG= 0 for the
intercept. Following the same procedure as above, we get
the w-update and u-update which are exactly the same as
in the elastic-net regularized case, the only difference is the
z-update which is given as,
zk+1
g=Sκg(wk+1
g+uk
g)
1 + λδg(1 α)(15)
where κg=λδgα/ρ, and Sκis the block soft-thresholding
operator Sκ(t) = 1κ
ktk2+t
C. L1/L2-Logistic Regression
In this sub-section, we switch towards classification prob-
lems. Specifically, we consider the logistic regression clas-
sification method. Given input training data (xi, yi)n
i=1 with
x∈ <Dand y∈ {−1,+1}, the logistic regression model is
estimated by solving the following optimization problem:
min
w
1
NPN
i=1 log(1 + eyix>
iw)(16)
+λ
D
P
j=1
δjnα|wj|+ (1 α)·w2
j
2o(17)
The corresponding ADMM form is as follows:
min
w,z
1
NPN
i=1 log(1 + eyix>
iw)(18)
+λ
D
P
j=1
δjnα|zj|+ (1 α)·zj2
2o
subject to wz= 0
Same as before we use ˆ
x= [x,1](D+1)×1, with ˆ
w=
[w, b](D+1)×1. The resulting ADMM updates are,
wk+1 =argmin
w
1
NX
i
log(1 + eyix>
iw)(19)
+ρ
2kwzk+ukk2
2
zk+1 =argmin
z
λX
j
δj(α|zj|+ (1 α)·z2
j
2)
+ρ
2kwk+1 z+ukk2
2
uk+1 =wk+1 zk+1 +uk
Note that, the zand uupdates are same as in eq 12 and 13
respectively. For the w-step we use Newton updates given
below. Let,
l(w) = 1
NX
i
log(1 + eyix>
iw)
+ρ
2kwzk+ukk2
2
then,
5wl(w) = 1
NX
i
yi(1 pi)xi+ρ(wzk+uk),
52
wl(w) = 1
NX
i
pi(1 pi)xix>
i+ρI
where, pi= 1(1 + ew>x). Hence, the optimal wk+1 can
be obtained through the iterative Algorithm 1
Algorithm 1: Iterative Algorithm for wk+1
Input:wk,zk,uk
Output:wk+1
initialize v(0) wk,j0;
while not converged do
p(j)
i1(1 + ex>
iv(j));
P(j)1
NPip(j)
i(1 p(j)
i)xix>
i+ρI ;
q(j)← − 1
NPiyi(1p(j)
i)xi+ρ(v(j1) zk+uk);
v(j+1) v(j)(P(j))1q(j)(distributed) ;
jj+ 1;
return wk+1 v(j);
D. Linear SVM
Finally we show how to use this similar framework
to solve Linear SVM. Note that, a detailed analysis for
distributed SVM using ADMM has already been shown in
[15]. However, even though the technicalities are similar,we
solve a slightly different problem (hinge loss + elastic net),
and show it for completeness. Finally, different from [15],
[34], we rather use an L-BFGS approach to solve each sub-
problem as discussed next.
The SVM-problem formulation is provided next, Given
input training data (xi, yi)n
i=1 with x∈ <Dand y
{−1,+1}, the elastic-net regularized linear SVM solves the
following optimization problem:
min
w
C
NPN
i=1(1 yix>
iw)+(20)
+
D
P
j=1
δjnα|wj|+ (1 α)·w2
j
2o
As again, α[0,1] controls the effect of L1 vs. L2
regularization and δD= 0 is used to avoid regularization in
intercept space. Here, unlike the previous models the hinge
loss in SVM is nonsmooth. To tackle this issue, we use
the consensus ADMM and rather solve smaller SVM-like
sub-problems as also shown in [15]. The advantage to this
approach is that each smaller SVM-like sub-problem can be
now solved in the dual space using a QP solver. This is
shown next.
The consensus ADMM formulation for the problem is,
min
w1,...,wM,z
C
NPM
t=1 PiBt(1 yix>
iwt)+
+P
j
δjnα|zj|+ (1 α)·z2
j
2o(21)
subject to wtz= 0, t = 1, . . . , M
and the corresponding updates are,
wk+1
t=argmin
wt
C
NX
iBt
(1 yix>
iwt)+
+ρ
2kwtzk+uk
tk2
2, t = 1, . . . , M (22)
zk+1
j=Sκj1
MPM
t=1(wk+1
tj +uk
tj )
1 + λδj(1 α), j = 1, . . . , D
uk+1
t=uk
t+wk+1
tzk+1, t = 1, . . . , M
Note that now the w-update is an SVM like problem on
a subset Bt. This can be solved in the dual form as shown
next.
For each subset Bt,
wk+1
t=argmin
wt
C
NX
iBt
ξi+ρ
2kwtzk+uk
tk2
2
s.t. yix>
iw1ξi, ξi0, i Bt(23)
This transforms to the following QP (with constraints) given
below,
min
α
1
2α>Pα+q>α(24)
s.t. 0αiC/N, i Bt
with,
Pij =yiyjx>
ixj
qi=yix>
i(zkuk
t)1
We use L-BFGS to solve the above QP and finally obtain,
wk+1 =1
ρX
iBt
αk+1
iyixi+zkuk
t(25)
as the final SVM solution. This can also be used to accom-
modate for non-linear SVM following [35].
IV. EXPERIMENTS AND RES ULT S
Next we provide the performance comparison of our
implemented algorithms with the publicly available MLLIB
library packaged with Apache Spark 3.0.
A. System Configuration
The Hadoop cluster configuration for our experiments is
provided below,
No. of Nodes = 6 (Hadoop Version - Apache 1.1.1)
– No. of cores (per node) = 12 core (Intel Xeon @
3.20GHz)
RAM size (per node) = 32 GB
Hard Disk size (per node) = 500 GB
For implementation we use the python interface (pyspark)
already available in [28]. Further, our spark framework has
been configured based on the recommendations available at
[36]. i.e.
spark.num.executors = 17
spark.executor.memory = 6 GB
spark.driver.memory = 4 GB
spark.driver.maxResultSize = 4 GB
B. Datasets
We generate synthetic datasets of different sizes for our
experiments. The datasets are generated to capture the
sparsity as well as the grouping behavior of the different
methods. The dataset used for the classification methods is
described below,
Dataset for Classification Problems: In this case x∈ <Dis
generated from a multivariate normal distribution N(0,Σ).
Here,
Correlation Matrix,Σ =
1 0.2 0.2
0.2...0.2
0.2 0.2 1
10
0
0...
G10
is block diagonal, and controls the grouping properties of
the problem. We fix the number of variables per group to
10. The pairwise correlation within each group is 0.2, and
that between each group is 0. The y- value (class label) is
generated as shown below,
y=sign(w>x+ε)(26)
where, wcontrols the sparsity of the problem. For this paper
we set the sparsity parameter to 0.8, i.e. 80% of the groups
have zero weight vector. For the remaining 20%, the weight
vectors are alternated between +1/-1. i.e.
w= [1,1,1,1,1,1,1,1,1,1
| {z }
group 1
, . . .
. . . 1,1,1,1,1,1,1,1,1,1
| {z }
group g
, . . .
. . . 0,0,0, . . .
| {z }
remaining 80 % sparse groups
]
Further we add a gaussian noise to the model ε∼ N(0,1).
The above settings are used to generate two separate data of
different sizes (shown below),
No. of training samples (N) = 2000000 and Dimension
of each samples (D) = 100,
No. of training samples (N) = 20000000 and Dimension
of each samples (D) = 100
The generated data is saved in a comma separated format,
which takes upto (approx.) 5 GB and 50 GB of disk space
respectively.
Dataset for Regression Problems: The generation of this
Table II
COM PUTATI ON TI ME C OMPA RIS ON B ETW EE N ADMM VS. MLLIB (I N SE C)FO R CLA SS IFIC ATIO N ME THO DS
Methods ADMM MLLIB
Data Set size = 5 GB with N = 2000000, D = 100
L2- logistic regression (λ= 0.1, α = 0) 157.57 (0.04) 139.68 (2.06)
L1- logistic regression (λ= 0.1, α = 1) 157.05 (1.54) 266.9 (169.16)
L1+L2- logistic regression (λ= 0.1, α = 0.5) 155.2(1.23) Not available
Data Set size = 50 GB with N = 20000000, D = 100
L2- logistic regression (λ= 0.1, α = 0) 13937.3 (10.34) 14045.7 (411.78)
L1- logistic regression (λ= 0.1, α = 1) 15381.8 (5.59) 13155.2 (307.60)
L1+L2- logistic regression (λ= 0.1, α = 0.5) 15472.1 (13.25) Not available
data follows exactly the same as in classification problems;
except that the y-values are generated as below,
y=w>x+2+ε(27)
As before we use two separate data of different sizes,
No. of training samples (N) = 2000000 and Dimension
of each samples (D) = 100,
No. of training samples (N) = 20000000 and Dimension
of each samples (D) = 100
C. Results
Here we provide comparison of the computation times for
our ADMM implementation vs. MLLIB for both classifica-
tion and regression problems. In general, the computation
time of the ADMM based methods depend heavily on a
number of parameters like , ρ- update (see [3], [13]), con-
vergence criteria etc. For simplicity, we follow the ρ-update
suggested in (eq 3.13 of [3]). Further, our current stopping
criteria dictates convergence of the solution, when the primal
and dual residual conjointly goes below a tolerance value
of 103(following [3]). On the other hand, MLLIB does
not provide any control on the convergence criteria. Hence
for our experiments we keep the default settings. Tables II
amd III provides the average computation times (in seconds)
over three runs of the experiment for the classification and
regression problems respectively. The standard deviation are
provided in parenthesis. In the current version of the paper
our results are limited to L1/L2-Logistic and L1/L2-Linear
regression. Additional results for L1/L2-SVM and Group
Lasso shall be provided in a extended version of the paper.
Based on our results in Table II and III, the ADMM
implementation performs similar to the MLLIB in terms
of computation speed except for the regression problem 2.
For the regression problem we report the computation time
for one iteration of the ADMM updates. This approximate
solution still outperfomed the MLLIB’s solution in terms
of accuracy. Hence, the current ADMM based framework
provides as a viable alternative to the SGD based approach
implemented in MLLIB. In addition, this framework sup-
ports a wide range of scalable ML algorithms, which can
prove as an useful arsenal for data-scientists to tackle big-
data problems.
V. C ONCLUSION
In this paper we present a generic convex optimization
problem for most ML algorithms. We identify ADMM as
a viable approach to solve this generic convex optimization
problem, and derive the ADMM updates specific to each ML
algorithms (listed in Table I). The current paper provides the
update steps for linear parameterization. However, it can be
easily extended to non-linear cases following [14]. As shown
in section III, at the heart of the ADMM updates lies a QP
which can be solved in a distributed fashion. Our results
show that this ADMM based approach performs similar
in comparison to the publicly available MLLIB in terms
of computation speed. This presents ADMM as a viable
alternative to MLLIB for big-data problems, with the added
advantage of more machine learning algorithms.
Finally, we note that the current implementation is limited
by the dimension of the problem; as it needs to solve a
QP in the w- update. This motivates the need for future
2The MLLIB package distributed with Spark 1.3 provides incorrect
implementation of the original Logistic Regression algorithm. A correction
has been made in the latest Spark 1.4 version (see [37]). This has not
been included in this paper. However, the Spark 1.3 ’s implementation can
still be considered as an approximate comparison representative of the SGD
approach. Further, the convergence criteria for MLLIB cannot be controlled.
In terms of accuracy, for both classification and regression problems, the
MLLIB tool provided sub-optimal solutions.
Table III
COM PUTATI ON TI ME C OMPA RIS ON B ETW EE N ADMM VS. MLLIB (I N SEC )FOR REGRESSION METHODS
Methods ADMM MLLIB
Data Set size = 5 GB with N = 2000000, D = 100
L2- linear regression (λ= 0.1, α = 0) 425.09 (17.26) 429.83(190.02)
L1- linear regression (λ= 0.1, α = 1) 416.95 (3.5) 444.79 (210.22)
L1+L2- linear regression (λ= 0.1, α = 0.5) 409.50(2.5) Not available
Data Set size = 50 GB with N = 20000000, D = 100
L2- linear regression (λ= 0.1, α = 0) 4209.95(10.5) 29244.43(100.12)
L1- linear regression (λ= 0.1, α = 1) 4233.13 (6.89) 23526.45(200.23)
L1+L2- linear regression (λ= 0.1, α = 0.5) 4150.81 (10.25) Not available
research towards scalable options for the QP problem. In
addition to that, there has been a gamut of research towards
newer ADMM update strategies for faster convergence of the
algorithm [3], [13]. These advanced strategies have not been
included in this version of the paper and can be extended as
future work.
ACKNOWLEDGMENT
The authors would like to thank Juergen Heit from
Research and Technology Center, Robert Bosch LLC; for
multiple discussions on the Spark configurations for the
experimental settings. They would also like to thank Max
Rizvanov for his support with the Hadoop cluster manage-
ment.
REFERENCES
[1] B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lock-
free approach to parallelizing stochastic gradient descent,” in
Advances in Neural Information Processing Systems, 2011,
pp. 693–701.
[2] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola, “Paral-
lelized stochastic gradient descent,” in Advances in Neural
Information Processing Systems, 2010, pp. 2595–2603.
[3] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein,
“Distributed optimization and statistical learning via the al-
ternating direction method of multipliers,” Foundations and
Trends R
in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.
[4] D. Gabay and B. Mercier, “A dual algorithm for the solution
of nonlinear variational problems via finite element approxi-
mation,” Computers & Mathematics with Applications, vol. 2,
no. 1, pp. 17–40, 1976.
[5] T. Goldstein, B. ODonoghue, and S. Setzer, “Fast alternating
direction optimization methods,” CAM report, pp. 12–35,
2012.
[6] R. Glowinski and A. Marroco, “Sur l’approximation, par
´
el´
ements finis d’ordre un, et la r´
esolution, par p´
enalisation-
dualit´
e d’une classe de probl`
emes de dirichlet non lin´
eaires,”
ESAIM: Mathematical Modelling and Numerical Analysis-
Mod´
elisation Math´
ematique et Analyse Num´
erique, vol. 9,
no. R2, pp. 41–76, 1975.
[7] D. Mahajan, S. S. Keerthi, S. Sundararajan, and L. Bottou,
“A functional approximation based distributed learning algo-
rithm,” arXiv preprint arXiv:1310.8418, 2013.
[8] O. Shamir, N. Srebro, and T. Zhang, “Communication effi-
cient distributed optimization using an approximate newton-
type method,” arXiv preprint arXiv:1312.7853, 2013.
[9] C. H. Teo, S. Vishwanthan, A. J. Smola, and Q. V. Le,
“Bundle methods for regularized risk minimization,The
Journal of Machine Learning Research, vol. 11, pp. 311–365,
2010.
[10] X. Zhang, “Probabilistic methods for distributed learning,”
Ph.D. dissertation, Duke University, 2014.
[11] A. Agarwal and J. C. Duchi, “Distributed delayed stochastic
optimization,” in Advances in Neural Information Processing
Systems, 2011, pp. 873–881.
[12] X. Meng, J. K. Bradley, B. Yavuz, E. R. Sparks,
S. Venkataraman, D. Liu, J. Freeman, D. B. Tsai, M. Amde,
S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh,
M. Zaharia, and A. Talwalkar, “Mllib: Machine learning in
apache spark,” CoRR, vol. abs/1505.06807, 2015. [Online].
Available: http://arxiv.org/abs/1505.06807
[13] R. Nishihara, L. Lessard, B. Recht, A. Packard, and M. I.
Jordan, “A General Analysis of the Convergence of ADMM,
ArXiv e-prints, Feb. 2015.
[14] V. Sindhwani and H. Avron, “High-performance Kernel Ma-
chines with Implicit Distributed Optimization and Random-
ization,” ArXiv e-prints, Sep. 2014.
[15] C. Zhang, H. Lee, and K. G. Shin, “Efficient distributed linear
classification algorithms via the alternating direction method
of multipliers,” in International Conference on Artificial In-
telligence and Statistics, 2012, pp. 1398–1406.
[16] MATLAB, version 8.5 (R2015a). Natick, Massachusetts:
The MathWorks Inc., 2015.
[17] R Core Team, R: A Language and Environment for Statistical
Computing, R Foundation for Statistical Computing, Vienna,
Austria, 2013. [Online]. Available: http://www.R-project.org/
[18] M. R. Berthold, N. Cebron, F. Dill, T. R. Gabriel, T. K¨
otter,
T. Meinl, P. Ohl, C. Sieb, K. Thiel, and B. Wiswedel,
“KNIME: The Konstanz Information Miner,” in Studies in
Classification, Data Analysis, and Knowledge Organization
(GfKL 2007). Springer, 2007.
[19] “Rapidminer,” https://rapidminer.com/, accessed: 2015-06-30.
[20] M. Hall, E. Frank, G. Holmes, B. Pfahringer,
P. Reutemann, and I. H. Witten, “The WEKA data
mining software: An update,” SIGKDD Explorations,
vol. 11, no. 1, pp. 10–18, 2009. [Online]. Avail-
able: http://www.sigkdd.org/explorations/issues/11-1-2009-
07/p2V11n1.pdf
[21] “Revolution r,” http://www.revolutionanalytics.com/revolution-
r-enterprise, accessed: 2015-06-30.
[22] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and
J. M. Hellerstein, “Graphlab: A new parallel framework for
machine learning,” in Conference on Uncertainty in Artificial
Intelligence (UAI), Catalina Island, California, July 2010.
[23] “Oracle data miner,” http://www.oracle.com, accessed: 2015-
06-30.
[24] “Hp vertica,” http://www.vertica.com/, accessed: 2015-06-30.
[25] “Pivotal,” http://pivotal.io/, accessed: 2015-06-30.
[26] “Revolution analytics rhadoop,
https://github.com/RevolutionAnalytics/RHadoop/wiki,
accessed: 2015-06-30.
[27] Apache Software Foundation. Apache mahout:: Scalable
machine-learning and data-mining library. [Online].
Available: http://mahout.apache.org
[28] “Apache spark,” https://spark.apache.org/, accessed: 2015-06-
30.
[29] “Alpine data labs,” http://alpinenow.com/, accessed: 2015-06-
30.
[30] V. Cherkassky and F. M. Mulier, Learning from Data: Con-
cepts, Theory, and Methods. Wiley-IEEE Press, 2007.
[31] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of
Statistical Learning, ser. Springer Series in Statistics. New
York, NY, USA: Springer New York Inc., 2001.
[32] D. P. Bertsekas, Nonlinear Programming. Belmont, MA:
Athena Scientific, 1999.
[33] H. Zou and T. Hastie, “Regularization and variable selection
via the elastic net,” Journal of the Royal Statistical Society,
Series B, vol. 67, pp. 301–320, 2005.
[34] C.-Y. Lin, C.-H. Tsai, C.-P. Lee, and C.-J. Lin, “Large-scale
logistic regression and linear support vector machines using
spark,” in Big Data (Big Data), 2014 IEEE International
Conference on. IEEE, 2014, pp. 519–528.
[35] A. Rahimi and B. Recht, “Random features for large-scale
kernel machines,” in Advances in neural information process-
ing systems, 2007, pp. 1177–1184.
[36] “How-to: Tune your apache spark jobs (part 2),
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-
apache-spark-jobs-part-2/, accessed: 2015-06-30.
[37] “Mllib (spark) question.” https://www.mail-
archive.com/user@spark.apache.org/msg32244.html,
accessed: 2015-06-30.
... It is very suitable for solving large-scale optimization problems due to this natural parallel characteristic. Moreover, the ADMM can quickly obtain a moderately accurate solution, so it is widely used in solving regression, classification and other supervised machine learning applications [1,6,12]. The existing mature parallel frameworks also provide a broad platform for the implementation of the distributed ADMM [6,13]. ...
... Moreover, the ADMM can quickly obtain a moderately accurate solution, so it is widely used in solving regression, classification and other supervised machine learning applications [1,6,12]. The existing mature parallel frameworks also provide a broad platform for the implementation of the distributed ADMM [6,13]. Message passing interface (MPI), as the most important programming method to execute scientific applications on large scale distributed memory, has also become one of the main platforms to implement the distributed ADMM [2,[17][18][19]25]. ...
... Lubell-doughtie et al. [13] implemented the distributed ADMM by MapReduce programming model. Dhar et al. [6] implemented the distributed ADMM on Spark. Zhang et al. [25], Wang et al. [17,18] implemented the distributed ADMM based on MPI programming model. ...
Article
Full-text available
The distributed alternating direction method of multipliers (ADMM) is one of the most widely used algorithms to solve large-scale optimization problems. Since the memory consumption, communication cost and convergence of the distributed ADMM are affected by the number of workers, how to improve the scalability of the distributed ADMM is one of the main challenges. To address this challenge, this paper proposes an asynchronous distributed ADMM based on the hybrid parallel model (HPAD-ADMM), which uses OpenMP for parallelization inside the node and MPI for message passing between nodes in the distributed system. Each worker solves sub-problem in parallel by multithreading, which reduces the system time at each iteration without affecting the convergence of the system or increasing the communication cost and memory consumption. Furthermore, this paper designs efficient parallelized algorithms to solve sub-problems for different applications. For the L1-regularized logistic regression problem, the sub-problem is solved by parallel trust region newton method and system time is reduced by adjusting the accuracy of the sub-problem. For the lasso problem, parallel matrix inversion algorithms are selected dynamically to reduce the system time according to the size of the data set. Finally, large-scale data sets are used to test the performance of the HPAD-ADMM. Experimental results show that compared with the state-of-the-art distributed ADMM, the HPAD-ADMM has higher scalability without losing accuracy.
... This algorithm is typically applied to solve a linearly constrained version of problem (1.1) (1.13) where g and h are possibly convex and nondifferentiable functions, A ∈ R p×n and B ∈ R p×m . ADMM-type algorithms have a wide spectrum of applications in machine learning [13,14,33,31], artificial intelligence [58,32], MIMO detection [56], image reconstruction [58], compressed sensing [38,37,56] and model predictive control [39,59,12,45,11,47,53]. ...
Preprint
Full-text available
In this paper, we improve upon our previous work[24,22] and establish convergence bounds on the objective function values of approximate proximal-gradient descent (AxPGD), approximate accelerated proximal-gradient descent (AxAPGD) and approximate proximal ADMM (AxWLM-ADMM) schemes. We consider approximation errors that manifest rare extreme events and we propagate their effects through iterations. We establish probabilistic asymptotic and non-asymptotic convergence bounds as functions of the range (upper/lower bounds) and variance of approximation errors. We use the derived bound to assess AxPGD in a sparse model predictive control of a spacecraft system and compare its accuracy with previously derived bounds.
... ADMM-type algorithms have a wide spectrum of applications in machine learning [35,50,61,63], artificial intelligence [47,51], MIMO detection [55], image reconstruction [47], compressed sensing [55] and model predictive control [44,49,52,58,42,54]. In this paper, we apply the proposed algorithm to a synthesized LASSO problem using randomly generated data and to a synthesized k-support-norm regularized robust regression using real solver's inaccuracies combined with algorithmic loop approximation errors. ...
Preprint
Full-text available
We analyse the convergence of an approximate, fully inexact, ADMM algorithm under additive, deterministic and probabilistic error models. We consider the generalized ADMM scheme that is derived from generalized Lagrangian penalty with additive (smoothing) adaptive-metric quadratic proximal perturbations. We derive explicit deterministic and probabilistic convergence upper bounds for the lower-C2 nonconvex case as well as the convex case under the Lipschitz continuity condition. We also present more practical conditions on the proximal errors under which convergence of the approximate ADMM to a suboptimal solution is guaranteed with high probability. We consider statistically and dynamically-unstructured conditional mean independent bounded error sequences. We validate our results using both simulated and practical software and algorithmic computational perturbations. We apply the proposed algorithm to a synthetic LASSO and robust regression with k-support norm regularization problems and test our proposed bounds under different computational noise levels. Compared to classical convergence results, the adaptive probabilistic bounds are more accurate in predicting the distance from the optimal set and parasitic residual error under different sources of inaccuracies.
... It is closely related to the Douglas-Rachford [25] and Peachman-Rachford [58] operator splitting methods that date back to the 1950s. Due to of its success in solving structured convex optimization, ADMM has been widely used in various applications such as machine learning [14,24,55], compressive sensing [19,68,69], image processing [65,66,[72][73][74], and reconstruction [18,36,71], sparse and low-rank optimization [49,56]. ...
Article
Full-text available
In this paper, we consider a proximal linearized alternating direction method of multipliers, or PL-ADMM, for solving linearly constrained nonconvex and possibly nonsmooth optimization problems. The algorithm is generalized by using variable metric proximal terms in the primal updates and an over-relaxation step in the multiplier update. Extended results based on the augmented Lagrangian including subgradient band, limiting continuity, descent and monotonicity properties are established. We prove that the PL-ADMM sequence is bounded. Under the powerful Kurdyka-Łojasiewicz inequality we show that the PL-ADMM sequence has a finite length thus converges, and we drive its convergence rates.
... 4. Distributed File System Computing: Probably one of the most popular approach for handling big-data storage and compute is the distributed computing (using Spark[109], Hadoop ) with data stored in Hadoop Distributed File System (HDFS). This eco-system has been greatly researched and developed and is one of the more mature approaches to build ML models using big-data (like, ADMML[110], MLLIB[111], PhotonML[112] etc). Recently there have been a few newer technologies like (DataBricks) Horovod Runner [113], Intel's BigDL [114] (for Deep Learning on Spark) etc., or Visualization on Spark (using Tableau ...
Preprint
Full-text available
In recent times, advances in artificial intelligence (AI) and IoT have enabled seamless and viable maintenance of appliances in home and building environments. Several studies have shown that AI has the potential to provide personalized customer support which could predict and avoid errors more reliably than ever before. In this paper, we have analyzed the various building blocks needed to enable a successful AI-driven predictive maintenance use-case. Unlike, existing surveys which mostly provide a deep dive into the recent AI algorithms for Predictive Maintenance (PdM), our survey provides the complete view; starting from business impact to recent technology advancements in algorithms as well as systems research and model deployment. Furthermore, we provide exemplar use-cases on predictive maintenance of appliances using publicly available data sets. Our survey can serve as a template needed to design a successful predictive maintenance use-case. Finally, we touch upon existing public data sources and provide a step-wise breakdown of an AI-driven proactive customer care (PCC) use-case, starting from generic anomaly detection to fault prediction and finally root-cause analysis. We highlight how such a step-wise approach can be advantageous for accurate model building and helpful for gaining insights into predictive maintenance of electromechanical appliances.
... is process of capturing the dataset for straggler and nonstraggler in the training phase requires little time, and we incrementally increased the number of stragglers in the system. e standard feature normalized data is fed to the ADMM SVM written in the Spark environment by Dhar et al. [41]. is reduces the model building time with a small amount of model parameter transfer. ...
Article
Full-text available
Modern big data applications tend to prefer a cluster computing approach as they are linked to the distributed computing framework that serves users jobs as per demand. It performs rapid processing of tasks by subdividing them into tasks that execute in parallel. Because of the complex environment, hardware and software issues, tasks might run slowly leading to delayed job completion, and such phenomena are also known as stragglers. The performance improvement of distributed computing framework is a bottleneck by straggling nodes due to various factors like shared resources, heavy system load, or hardware issues leading to the prolonged job execution time. Many state-of-the-art approaches use independent models per node and workload. With increased nodes and workloads, the number of models would increase, and even with large numbers of nodes. Not every node would be able to capture the stragglers as there might not be sufficient training data available of straggler patterns, yielding suboptimal straggler prediction. To alleviate such problems, we propose a novel collaborative learning-based approach for straggler prediction, the alternate direction method of multipliers (ADMM), which is resource-efficient and learns how to efficiently deal with mitigating stragglers without moving data to a centralized location. The proposed framework shares information among the various models, allowing us to use larger training data and bring training time down by avoiding data transfer. We rigorously evaluate the proposed method on various datasets with high accuracy results.
Chapter
Full-text available
In the age of massive datasets and real-time applications, scalable and adaptive deep learning algorithms are critical to meeting the ever-increasing demands of large-scale machine learning (ML) systems. The state-of-the-art developments in scalable deep learning methods are examined in this research, with particular attention paid to architectural breakthroughs that facilitate effective model training, adaptive learning, and inference across distributed systems. It is emphasized that contemporary algorithms—like distributed gradient descent optimization, model parallelism, and sophisticated reinforcement learning techniques—are essential for controlling the complexity of big datasets without compromising performance. The research also explores how resource optimization and auto-scaling mechanisms work together, which is crucial for reducing computational overhead in cloud-based machine learning systems. It is highlighted that adaptive models—which can modify their architecture in response to patterns in input data and changes in the surrounding environment—are essential for maintaining robustness and flexibility. High-dimensional data, dynamic workload allocation, and latency minimization in real-time learning tasks are among the scalability challenges tackled. A closer look at more recent frameworks like Federated Learning, which makes it easier for decentralized model training across edge devices, shows how promising these scalable methods can be for privacy-preserving applications. The areas include automated machine learning (AutoML), hyperparameter tuning, and self-supervised learning.
Preprint
Full-text available
Purpose: The importance of robust proton treatment planning to mitigate the impact of uncertainty is well understood. However, its computational cost grows with the number of uncertainty scenarios, prolonging the treatment planning process. We developed a fast and scalable distributed optimization platform that parallelizes this computation over the scenarios. Methods: We modeled the robust proton treatment planning problem as a weighted least-squares problem. To solve it, we employed an optimization technique called the Alternating Direction Method of Multipliers with Barzilai-Borwein step size (ADMM-BB). We reformulated the problem in such a way as to split the main problem into smaller subproblems, one for each proton therapy uncertainty scenario. The subproblems can be solved in parallel, allowing the computational load to be distributed across multiple processors (e.g., CPU threads/cores). We evaluated ADMM-BB on four head-and-neck proton therapy patients, each with 13 scenarios accounting for 3 mm setup and 3:5% range uncertainties. We then compared the performance of ADMM-BB with projected gradient descent (PGD) applied to the same problem. Results: For each patient, ADMM-BB generated a robust proton treatment plan that satisfied all clinical criteria with comparable or better dosimetric quality than the plan generated by PGD. However, ADMM-BB's total runtime averaged about 6 to 7 times faster. This speedup increased with the number of scenarios. Conclusion: ADMM-BB is a powerful distributed optimization method that leverages parallel processing platforms, such as multi-core CPUs, GPUs, and cloud servers, to accelerate the computationally intensive work of robust proton treatment planning. This results in 1) a shorter treatment planning process and 2) the ability to consider more uncertainty scenarios, which improves plan quality.
Chapter
With todays’ modern technology and lifestyle, vast amounts of data are generated exponentially day by day. Storing, processing, and analyzing such huge data is a complex problem and challenging for many applications. Since the data is extremely large, a conventional machine learning algorithm may not perform well with regards to time complexity, data scalability, and accuracy. A good amount of research work is being carried out to solve such problems by different research groups, including distributed and parallel computing, GPU based parallel and distributed computing, etc. A Support Vector Machine is a popular machine learning algorithm for classification problems and a systematic retrospect of SVM algorithms from the perspective of big data are of ample significance, which motivates for development of SVM optimization in parallel distributed computing. In this paper, we describe the state of the art SVM algorithms with their pros, cons, and challenges in parallel and distributed computing methods. Mechanisms for efficient computation of SVM for big data and avenues for the future research are also explored.
Article
Full-text available
Background The importance of robust proton treatment planning to mitigate the impact of uncertainty is well understood. However, its computational cost grows with the number of uncertainty scenarios, prolonging the treatment planning process. Purpose We developed a fast and scalable distributed optimization platform that parallelizes the robust proton treatment plan computation over the uncertainty scenarios. Methods We modeled the robust proton treatment planning problem as a weighted least‐squares problem. To solve it, we employed an optimization technique called the alternating direction method of multipliers with Barzilai–Borwein step size (ADMM‐BB). We reformulated the problem in such a way as to split the main problem into smaller subproblems, one for each proton therapy uncertainty scenario. The subproblems can be solved in parallel, allowing the computational load to be distributed across multiple processors (e.g., CPU threads/cores). We evaluated ADMM‐BB on four head‐and‐neck proton therapy patients, each with 13 scenarios accounting for 3 mm setup and 3.5% range uncertainties. We then compared the performance of ADMM‐BB with projected gradient descent (PGD) applied to the same problem. Results For each patient, ADMM‐BB generated a robust proton treatment plan that satisfied all clinical criteria with comparable or better dosimetric quality than the plan generated by PGD. However, ADMM‐BB's total runtime averaged about 6 to 7 times faster. This speedup increased with the number of scenarios. Conclusions ADMM‐BB is a powerful distributed optimization method that leverages parallel processing platforms, such as multicore CPUs, GPUs, and cloud servers, to accelerate the computationally intensive work of robust proton treatment planning. This results in (1) a shorter treatment planning process and (2) the ability to consider more uncertainty scenarios, which improves plan quality.
Article
Full-text available
We provide a new proof of the linear convergence of the alternating direction method of multipliers (ADMM) when one of the objective terms is strongly convex. Our proof is based on a framework for analyzing optimization algorithms introduced in Lessard et al. (2014), reducing algorithm convergence to verifying the stability of a dynamical system. This approach generalizes a number of existing results and obviates any assumptions about specific choices of algorithm parameters. On a numerical example, we demonstrate that minimizing the derived bound on the convergence rate provides a practical approach to selecting algorithm parameters for particular ADMM instances. We complement our upper bound by constructing a nearly-matching lower bound on the worst-case rate of convergence.
Article
Full-text available
Linear classification has demonstrated suc-cess in many areas of applications. Modern algorithms for linear classification can train reasonably good models while going through the data in only tens of rounds. However, large data often does not fit in the memory of a single machine, which makes the bottleneck in large-scale learning the disk I/O, not the CPU. Following this observation, Yu et al. (2010) made significant progress in reducing disk usage, and their algorithms now outper-form LIBLINEAR. In this paper, rather than optimizing algorithms on a single machine, we propose and implement distributed algo-rithms that achieve parallel disk loading and access the disk only once. Our large-scale learning algorithms are based on the frame-work of alternating direction methods of mul-tipliers. The framework derives a subproblem that remains to be solved efficiently for which we propose using dual coordinate descent and trust region Newton method. Our experi-mental evaluations on large datasets demon-strate that the proposed algorithms achieve significant speedup over the classifier pro-posed by Yu et al. running on a single ma-chine. Our algorithms are faster than exist-ing distributed solvers, such as Zinkevich et al. (2010)'s parallel stochastic gradient de-scent and Vowpal Wabbit.
Article
Logistic regression and linear SVM are useful methods for large-scale classification. However, their distributed implementations have not been well studied. Recently, because of the inefficiency of the MapReduce framework on iterative algorithms, Spark, an in-memory cluster-computing platform, has been proposed. It has emerged as a popular framework for large-scale data processing and analytics. In this work, we consider a distributed Newton method for solving logistic regression as well linear SVM and implement it on Spark. We carefully examine many implementation issues significantly affecting the running time and propose our solutions. After conducting thorough empirical investigations, we release an efficient and easy-to-use tool for the Spark community.
Article
Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLlib provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLlib supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLlib has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.
Article
An interdisciplinary framework for learning methodologies-covering statistics, neural networks, and fuzzy logic, this book provides a unified treatment of the principles and methods for learning dependencies from data. It establishes a general conceptual framework in which various learning methods from statistics, neural networks, and fuzzy logic can be applied-showing that a few fundamental principles underlie most new methods being proposed today in statistics, engineering, and computer science. Complete with over one hundred illustrations, case studies, and examples making this an invaluable text.