PosterPDF Available

# On Markov Chain Gradient Descent

Authors:
Tao Sun, Yuejiao Sun, and Wotao Yin
National University of Defense Technology & University of California, Los Angeles
We consider a kind of stochastic algorithm, which is developed
on the trajectory of a Markov chain, for solving the ﬁnite-
sum minimization problem (and its continuous generation, i.e.,
population risk minimization)
minimize
xXRd
f(x)PM
i=1 fi(x)
M
,(1)
where XRnis a closed convex set, and fiis convex or non-
convex but diﬀerentiable. The algorithm is proposed to over-
come two drawback of implementing the traditional stochastic
gradient descent: 1. the direct sampling is diﬃcult, thus, a
Markov chain is used for the sampling; 2. the data is dis-
tributedly stored on diﬀerent machines, which is connected on
a graph. The iteration of MCGD is, therefore, modeled as
xk+1 =ProjX xkγkˆ
fjk(xk)!,(2)
where (jk)k0is a trajectory of a Markov chain on
{1,2, . . . , M}that has a uniform stationary distribution, and
ˆ
fidenotes the subgradient of fiif fiis convex, and the
gradient of fiif fiis nonconvex but diﬀerentiable.
2. Numerical results
We present two kinds of numerical results. The ﬁrst one is to
show that MCGD uses fewer samples to train both a convex
model and a nonconvex model. The second one demonstrat
-es the advantage of the faster mixing of a non-reversible
Markov chain.
2.1 Comparison with SGD
Let us compare:
1MCGD, where jkis taken from one trajectory of the
Markov chain;
2SGDT, for T= 1,2,4,8,16,32, where each jkis the Tth
sample of a fresh, independent trajectory. All trajectories
are generated by starting from the same state 0.
To compute Tgradients, SGDTuses Ttimes as many samples
as MCGD. We did not try to adapt Tas kincreases because
there lacks a theoretical guidance.
In the ﬁrst test, we recover a vector ufrom an auto re-
gressive process, which closely resembles the ﬁrst experiment.
Set matrix A as a subdiagonal matrix with random entries
Ai,i1i.i.d
∼ U[0.8,0.99]. Randomly sample a vector uRd,
d= 50, with the unit 2-norm. Our data (ξ1
t, ξ2
t)
t=1 are gener-
ated according to the following auto regressive process:
ξ1
t=1
t1+e1Wt, Wti.i.d
N(0,1)
¯
ξ2
t=
1,if hu, ξ1
ti>0,
0,otherwise;
ξ2
t=
¯
ξ2
t,with probability 0.8,
1¯
ξ2
t,with probability 0.2.
Clearly, (ξ1
t, ξ2
t)
t=1 forms a Markov chain. Let Πdenote the
stationary distribution of this Markov chain. We recover uas
the solution to the following problem:
minimize
xE(ξ12)Π(x;ξ1, ξ2).
We consider both convex and nonconvex loss functions, which
were not done before in the literature. The convex one is the
logistic loss
(x;ξ1, ξ2) = ξ2log(σ(hx, ξ1i)) (1 ξ2) log(1 σ(hx, ξ1i)),
where σ(t) = 1
1+exp(t).
100101102103104
10-3
10-2
10-1
100Convex case
MCSGD
SGD1
SGD2
SGD4
SGD8
SGD16
SGD32
fxkf(x)
100101102103104
Number of samples
10-3
10-2
10-1
100Convex case
fxkf(x)
100101102103104
10-3
10-2
10-1 Nonconvex case
fxkf(x)
100101102103104
Number of samples
10-3
10-2
10-1 Nonconvex case
fxkf(x)
Figure 1: Comparisons of MCGD and SGDTfor T= 1,2,4,8,16,32.xk
is the average of x1, . . . , xk.
And the nonconvex one is taken as
`(x;ξ1, ξ2) = 1
2(σ(hx, ξ1i)ξ2)2.
We choose γk=1
kqas our stepsize, where q= 0.501. This
choice is consistently with our theory below. Our results in
Figure 1 are surprisingly positive on MCGD, more so to our
expectation. As we had expected, MCGD used signiﬁcantly
fewer total samples than SGD on every T. But, it is
surprising that MCGD did not need even more gradient
evaluations. Randomly generated data must have helped
homogenize the samples over the diﬀerent states, making it
less important for a trajectory to converge. It is important to
note that SGD1 and SGD2, as well as SGD4, in the
nonconvex case, stagnate at noticeably lower accuracies
because their Tvalues are too small for convergence.
2.2 Comparison of reversible and
non-reversible Markov chains for MCGD
We also compare the convergence of MCGD when work-
ing with reversible and non-reversible Markov chains. It is
well-known transforming a reversible Markov chain into non-
reversible Markov chain can signiﬁcantly accelerate the mixing
process. This technique also helps to accelerate the conver-
gence of MCGD.
In our experiment, we ﬁrst construct an undirected connected
graph with n= 20 nodes with edges randomly generated. Let
Gdenote the adjacency matrix of the graph, that is,
Gi,j =
1,if i, j are connected;
0,otherwise.
Let dmax be the maximum number of outgoing edges of a node.
Select d= 10 and compute β N (0, Id). The transition
probability of the reversible Markov chain is then deﬁned by,
known as Metropolis-Hastings markov chain,
Pi,j =
1
dmax ,if j6=i,Gi,j = 1;
1Pj6=iGi,j
dmax ,if j=i;
0,otherwise.
100101102103104
Iteration
10-3
10-2
10-1
100
101
102
f( k)-f*
Reversible
Non-reversible
Figure 2: Comparison of reversible and irreversible Markov chains. The
second largest eigenvalues of reversible and non-reversible Markov chains
are 0.75 and 0.66 respectively.
Obviously, Pis symmetric and the stationary distribution is
uniform. The non-reversible Markov chain is constructed by
adding cycles. The edges of these cycles are directed and let
Vdenote the adjacency matrix of these cycles. If Vi,j = 1,
then Vj,i = 0. Let w0>0be the weight of ﬂows along these
cycles. Then we construct the transition probability of the
non-reversible Markov chain as follows,
Qi,j =Wi,j
PlWi,l
,
where W=dmaxP+w0V.
In our experiment, we add 5 cycles of length 4, with edges
existing in G.w0is set to be dmax
2. We test MCGD on a least
square problem. First, we select β N (0, Id); and then for
each node i, we generate xi N (0, Id), and yi=xT
iβ. The
objective function is deﬁned as,
f(β) = 1
2
n
X
i=1(xT
iβyi)2.
The convergence results are depicted in Figure 2.
3. Convergence analysis
Assumption 1
The Markov chain (Xk)k0is time-homogeneous, irre-
ducible, and aperiodic. It has a transition matrix Pand
has stationary distribution π.
3.1 Convex cases
Assumption 2
The set Xis assumed to be convex and compact.
Convergence of MCGD in the convex cases
Let Assumptions 1 and 2 hold and (xk)k0be generated
by scheme (2). Assume that fi,i[M], are convex
functions, and the stepsizes satisfy
X
kγk= +,X
kln k·γ2
k<+.(3)
Then, we have limkEf(xk) = f.
The stepsizes requirement (3) is nearly identical to the one
of SGD and subgradient algorithms. In the theorem above,
we use the stepsize setting γk=O(1
kq)as 1
2< q < 1. This
kind of stepsize requirements also works for SGD and
subgradient algorithms. The convergence rate of MCGD is
O(1
Pk
i=1 γi) = O(1
k1q), which is also as the same as SGD and
kq).
3.2 Nonconvex cases
Assumption 3
The gradients of fiare assumed to be bounded, i.e., there
exists D > 0such that k∇fi(x)k ≤ D, i [M].
Nonconvex case
Let Assumptions 1 and 3 hold, and Xbe the full space,
and (xk)k0be generated by MCGD. Also, assume fiis
diﬀerentiable and fiis L-Lipschitz, and the stepsizes
satisfy
X
kγk= +,X
kln2k·γ2
k<+.(4)
Then, we have limkEk∇f(xk)k= 0.
Compared with MCGD in the convex case, the stepsize require-
ments of nonconvex MCGD become a tad higher; in summable
part, we need Pkln2k·γ2
k<+rather than Pkln k·γ2
k<+.
Nevertheless, we can still use γk=O(1
kq)for 1
2< q < 1.
5. Convergence analysis for continuous
state space
When the state space Ξis a continuum, there are inﬁnitely
many possible states. In this case, we consider an inﬁnite-
state Markov chain that is time-homogeneous and reversible.
We aim to solve the following problem:
minimize
xXRnEξ F(x;ξ)!=ZΠF(x, ξ)dΠ(ξ).(5)
and the algorithm
xk+1 =ProjX xkγk(ˆ
F(xk;ξk))!.(6)
Convex cases
Assume F(·;ξ)is convex for each ξΞ. Let the
stepsizes satisfy (3) and (xk)k0be generated by Algo-
rithm (6). Let F:= minxXEξ(F(x;ξ)). Assume
that for any ξΞ,|F(x;ξ)F(y;ξ)| ≤ Lkx
yk,supxX,ξΞ{k ˆ
F(x;ξ)k} ≤ D,Eξˆ
F(x;ξ)
EξF(x;ξ), and supx,yX,ξΞ|F(x;ξ)F(y;ξ)| ≤ H.
Then we have
lim
k
E Eξ(F(xk;ξ)) F!= 0,
Non-convex cases
Let the stepsizes satisfy (4), (xk)k0be generated by Algo-
rithm (6) with Xbeing full space. Assume for any ξΞ,
F(x;ξ)is diﬀerentiable, and k∇F(x;ξ)− ∇F(y;ξ)k ≤