Content uploaded by Tao Sun
Author content
All content in this area was uploaded by Tao Sun on Nov 11, 2018
Content may be subject to copyright.
On Markov Chain Gradient Descent
Tao Sun, Yuejiao Sun, and Wotao Yin
National University of Defense Technology & University of California, Los Angeles
1. Markov Chain Gradient Descent
We consider a kind of stochastic algorithm, which is developed
on the trajectory of a Markov chain, for solving the finite-
sum minimization problem (and its continuous generation, i.e.,
population risk minimization)
minimize
x∈X⊆Rd
f(x)≡PM
i=1 fi(x)
M
,(1)
where X⊆Rnis a closed convex set, and fiis convex or non-
convex but differentiable. The algorithm is proposed to over-
come two drawback of implementing the traditional stochastic
gradient descent: 1. the direct sampling is difficult, thus, a
Markov chain is used for the sampling; 2. the data is dis-
tributedly stored on different machines, which is connected on
a graph. The iteration of MCGD is, therefore, modeled as
xk+1 =ProjX xk−γkˆ
∇fjk(xk)!,(2)
where (jk)k≥0is a trajectory of a Markov chain on
{1,2, . . . , M}that has a uniform stationary distribution, and
ˆ
∇fidenotes the subgradient of fiif fiis convex, and the
gradient of fiif fiis nonconvex but differentiable.
2. Numerical results
We present two kinds of numerical results. The first one is to
show that MCGD uses fewer samples to train both a convex
model and a nonconvex model. The second one demonstrat
-es the advantage of the faster mixing of a non-reversible
Markov chain.
2.1 Comparison with SGD
Let us compare:
1MCGD, where jkis taken from one trajectory of the
Markov chain;
2SGDT, for T= 1,2,4,8,16,32, where each jkis the Tth
sample of a fresh, independent trajectory. All trajectories
are generated by starting from the same state 0.
To compute Tgradients, SGDTuses Ttimes as many samples
as MCGD. We did not try to adapt Tas kincreases because
there lacks a theoretical guidance.
In the first test, we recover a vector ufrom an auto re-
gressive process, which closely resembles the first experiment.
Set matrix A as a subdiagonal matrix with random entries
Ai,i−1i.i.d
∼ U[0.8,0.99]. Randomly sample a vector u∈Rd,
d= 50, with the unit 2-norm. Our data (ξ1
t, ξ2
t)∞
t=1 are gener-
ated according to the following auto regressive process:
ξ1
t=Aξ1
t−1+e1Wt, Wti.i.d
∼N(0,1)
¯
ξ2
t=
1,if hu, ξ1
ti>0,
0,otherwise;
ξ2
t=
¯
ξ2
t,with probability 0.8,
1−¯
ξ2
t,with probability 0.2.
Clearly, (ξ1
t, ξ2
t)∞
t=1 forms a Markov chain. Let Πdenote the
stationary distribution of this Markov chain. We recover uas
the solution to the following problem:
minimize
xE(ξ1,ξ2)∼Π`(x;ξ1, ξ2).
We consider both convex and nonconvex loss functions, which
were not done before in the literature. The convex one is the
logistic loss
`(x;ξ1, ξ2) = −ξ2log(σ(hx, ξ1i)) −(1 −ξ2) log(1 −σ(hx, ξ1i)),
where σ(t) = 1
1+exp(−t).
100101102103104
Number of gradient evaluations
10-3
10-2
10-1
100Convex case
MCSGD
SGD1
SGD2
SGD4
SGD8
SGD16
SGD32
fxk−f(x∗)
100101102103104
Number of samples
10-3
10-2
10-1
100Convex case
fxk−f(x∗)
100101102103104
Number of gradient evaluations
10-3
10-2
10-1 Nonconvex case
fxk−f(x∗)
100101102103104
Number of samples
10-3
10-2
10-1 Nonconvex case
fxk−f(x∗)
Figure 1: Comparisons of MCGD and SGDTfor T= 1,2,4,8,16,32.xk
is the average of x1, . . . , xk.
And the nonconvex one is taken as
`(x;ξ1, ξ2) = 1
2(σ(hx, ξ1i)−ξ2)2.
We choose γk=1
kqas our stepsize, where q= 0.501. This
choice is consistently with our theory below. Our results in
Figure 1 are surprisingly positive on MCGD, more so to our
expectation. As we had expected, MCGD used significantly
fewer total samples than SGD on every T. But, it is
surprising that MCGD did not need even more gradient
evaluations. Randomly generated data must have helped
homogenize the samples over the different states, making it
less important for a trajectory to converge. It is important to
note that SGD1 and SGD2, as well as SGD4, in the
nonconvex case, stagnate at noticeably lower accuracies
because their Tvalues are too small for convergence.
2.2 Comparison of reversible and
non-reversible Markov chains for MCGD
We also compare the convergence of MCGD when work-
ing with reversible and non-reversible Markov chains. It is
well-known transforming a reversible Markov chain into non-
reversible Markov chain can significantly accelerate the mixing
process. This technique also helps to accelerate the conver-
gence of MCGD.
In our experiment, we first construct an undirected connected
graph with n= 20 nodes with edges randomly generated. Let
Gdenote the adjacency matrix of the graph, that is,
Gi,j =
1,if i, j are connected;
0,otherwise.
Let dmax be the maximum number of outgoing edges of a node.
Select d= 10 and compute β∗∼ N (0, Id). The transition
probability of the reversible Markov chain is then defined by,
known as Metropolis-Hastings markov chain,
Pi,j =
1
dmax ,if j6=i,Gi,j = 1;
1−Pj6=iGi,j
dmax ,if j=i;
0,otherwise.
100101102103104
Iteration
10-3
10-2
10-1
100
101
102
f( k)-f*
Reversible
Non-reversible
Figure 2: Comparison of reversible and irreversible Markov chains. The
second largest eigenvalues of reversible and non-reversible Markov chains
are 0.75 and 0.66 respectively.
Obviously, Pis symmetric and the stationary distribution is
uniform. The non-reversible Markov chain is constructed by
adding cycles. The edges of these cycles are directed and let
Vdenote the adjacency matrix of these cycles. If Vi,j = 1,
then Vj,i = 0. Let w0>0be the weight of flows along these
cycles. Then we construct the transition probability of the
non-reversible Markov chain as follows,
Qi,j =Wi,j
PlWi,l
,
where W=dmaxP+w0V.
In our experiment, we add 5 cycles of length 4, with edges
existing in G.w0is set to be dmax
2. We test MCGD on a least
square problem. First, we select β∗∼ N (0, Id); and then for
each node i, we generate xi∼ N (0, Id), and yi=xT
iβ∗. The
objective function is defined as,
f(β) = 1
2
n
X
i=1(xT
iβ−yi)2.
The convergence results are depicted in Figure 2.
3. Convergence analysis
Assumption 1
The Markov chain (Xk)k≥0is time-homogeneous, irre-
ducible, and aperiodic. It has a transition matrix Pand
has stationary distribution π∗.
3.1 Convex cases
Assumption 2
The set Xis assumed to be convex and compact.
Convergence of MCGD in the convex cases
Let Assumptions 1 and 2 hold and (xk)k≥0be generated
by scheme (2). Assume that fi,i∈[M], are convex
functions, and the stepsizes satisfy
X
kγk= +∞,X
kln k·γ2
k<+∞.(3)
Then, we have limkEf(xk) = f∗.
The stepsizes requirement (3) is nearly identical to the one
of SGD and subgradient algorithms. In the theorem above,
we use the stepsize setting γk=O(1
kq)as 1
2< q < 1. This
kind of stepsize requirements also works for SGD and
subgradient algorithms. The convergence rate of MCGD is
O(1
Pk
i=1 γi) = O(1
k1−q), which is also as the same as SGD and
subgradient algorithms for γk=O(1
kq).
3.2 Nonconvex cases
Assumption 3
The gradients of fiare assumed to be bounded, i.e., there
exists D > 0such that k∇fi(x)k ≤ D, i ∈[M].
Nonconvex case
Let Assumptions 1 and 3 hold, and Xbe the full space,
and (xk)k≥0be generated by MCGD. Also, assume fiis
differentiable and ∇fiis L-Lipschitz, and the stepsizes
satisfy
X
kγk= +∞,X
kln2k·γ2
k<+∞.(4)
Then, we have limkEk∇f(xk)k= 0.
Compared with MCGD in the convex case, the stepsize require-
ments of nonconvex MCGD become a tad higher; in summable
part, we need Pkln2k·γ2
k<+∞rather than Pkln k·γ2
k<+∞.
Nevertheless, we can still use γk=O(1
kq)for 1
2< q < 1.
5. Convergence analysis for continuous
state space
When the state space Ξis a continuum, there are infinitely
many possible states. In this case, we consider an infinite-
state Markov chain that is time-homogeneous and reversible.
We aim to solve the following problem:
minimize
x∈X⊆RnEξ F(x;ξ)!=ZΠF(x, ξ)dΠ(ξ).(5)
and the algorithm
xk+1 =ProjX xk−γk(ˆ
∇F(xk;ξk))!.(6)
Convex cases
Assume F(·;ξ)is convex for each ξ∈Ξ. Let the
stepsizes satisfy (3) and (xk)k≥0be generated by Algo-
rithm (6). Let F∗:= minx∈XEξ(F(x;ξ)). Assume
that for any ξ∈Ξ,|F(x;ξ)−F(y;ξ)| ≤ Lkx−
yk,supx∈X,ξ∈Ξ{k ˆ
∇F(x;ξ)k} ≤ D,Eξˆ
∇F(x;ξ)∈
∂EξF(x;ξ), and supx,y∈X,ξ∈Ξ|F(x;ξ)−F(y;ξ)| ≤ H.
Then we have
lim
k
E Eξ(F(xk;ξ)) −F∗!= 0,
Non-convex cases
Let the stepsizes satisfy (4), (xk)k≥0be generated by Algo-
rithm (6) with Xbeing full space. Assume for any ξ∈Ξ,
F(x;ξ)is differentiable, and k∇F(x;ξ)− ∇F(y;ξ)k ≤
Lkx−yk. In addition, supx∈X,ξ∈Ξ{k∇F(x;ξ)k} <+∞,
Xis the full space, and Eξ∇F(x;ξ) = ∇EξF(x;ξ).
Then, we have
lim
k
Ek∇Eξ(F(xk;ξ))k= 0.