Content uploaded by Tao Sun

Author content

All content in this area was uploaded by Tao Sun on Nov 11, 2018

Content may be subject to copyright.

On Markov Chain Gradient Descent

Tao Sun, Yuejiao Sun, and Wotao Yin

National University of Defense Technology & University of California, Los Angeles

1. Markov Chain Gradient Descent

We consider a kind of stochastic algorithm, which is developed

on the trajectory of a Markov chain, for solving the ﬁnite-

sum minimization problem (and its continuous generation, i.e.,

population risk minimization)

minimize

x∈X⊆Rd

f(x)≡PM

i=1 fi(x)

M

,(1)

where X⊆Rnis a closed convex set, and fiis convex or non-

convex but diﬀerentiable. The algorithm is proposed to over-

come two drawback of implementing the traditional stochastic

gradient descent: 1. the direct sampling is diﬃcult, thus, a

Markov chain is used for the sampling; 2. the data is dis-

tributedly stored on diﬀerent machines, which is connected on

a graph. The iteration of MCGD is, therefore, modeled as

xk+1 =ProjX xk−γkˆ

∇fjk(xk)!,(2)

where (jk)k≥0is a trajectory of a Markov chain on

{1,2, . . . , M}that has a uniform stationary distribution, and

ˆ

∇fidenotes the subgradient of fiif fiis convex, and the

gradient of fiif fiis nonconvex but diﬀerentiable.

2. Numerical results

We present two kinds of numerical results. The ﬁrst one is to

show that MCGD uses fewer samples to train both a convex

model and a nonconvex model. The second one demonstrat

-es the advantage of the faster mixing of a non-reversible

Markov chain.

2.1 Comparison with SGD

Let us compare:

1MCGD, where jkis taken from one trajectory of the

Markov chain;

2SGDT, for T= 1,2,4,8,16,32, where each jkis the Tth

sample of a fresh, independent trajectory. All trajectories

are generated by starting from the same state 0.

To compute Tgradients, SGDTuses Ttimes as many samples

as MCGD. We did not try to adapt Tas kincreases because

there lacks a theoretical guidance.

In the ﬁrst test, we recover a vector ufrom an auto re-

gressive process, which closely resembles the ﬁrst experiment.

Set matrix A as a subdiagonal matrix with random entries

Ai,i−1i.i.d

∼ U[0.8,0.99]. Randomly sample a vector u∈Rd,

d= 50, with the unit 2-norm. Our data (ξ1

t, ξ2

t)∞

t=1 are gener-

ated according to the following auto regressive process:

ξ1

t=Aξ1

t−1+e1Wt, Wti.i.d

∼N(0,1)

¯

ξ2

t=

1,if hu, ξ1

ti>0,

0,otherwise;

ξ2

t=

¯

ξ2

t,with probability 0.8,

1−¯

ξ2

t,with probability 0.2.

Clearly, (ξ1

t, ξ2

t)∞

t=1 forms a Markov chain. Let Πdenote the

stationary distribution of this Markov chain. We recover uas

the solution to the following problem:

minimize

xE(ξ1,ξ2)∼Π`(x;ξ1, ξ2).

We consider both convex and nonconvex loss functions, which

were not done before in the literature. The convex one is the

logistic loss

`(x;ξ1, ξ2) = −ξ2log(σ(hx, ξ1i)) −(1 −ξ2) log(1 −σ(hx, ξ1i)),

where σ(t) = 1

1+exp(−t).

100101102103104

Number of gradient evaluations

10-3

10-2

10-1

100Convex case

MCSGD

SGD1

SGD2

SGD4

SGD8

SGD16

SGD32

fxk−f(x∗)

100101102103104

Number of samples

10-3

10-2

10-1

100Convex case

fxk−f(x∗)

100101102103104

Number of gradient evaluations

10-3

10-2

10-1 Nonconvex case

fxk−f(x∗)

100101102103104

Number of samples

10-3

10-2

10-1 Nonconvex case

fxk−f(x∗)

Figure 1: Comparisons of MCGD and SGDTfor T= 1,2,4,8,16,32.xk

is the average of x1, . . . , xk.

And the nonconvex one is taken as

`(x;ξ1, ξ2) = 1

2(σ(hx, ξ1i)−ξ2)2.

We choose γk=1

kqas our stepsize, where q= 0.501. This

choice is consistently with our theory below. Our results in

Figure 1 are surprisingly positive on MCGD, more so to our

expectation. As we had expected, MCGD used signiﬁcantly

fewer total samples than SGD on every T. But, it is

surprising that MCGD did not need even more gradient

evaluations. Randomly generated data must have helped

homogenize the samples over the diﬀerent states, making it

less important for a trajectory to converge. It is important to

note that SGD1 and SGD2, as well as SGD4, in the

nonconvex case, stagnate at noticeably lower accuracies

because their Tvalues are too small for convergence.

2.2 Comparison of reversible and

non-reversible Markov chains for MCGD

We also compare the convergence of MCGD when work-

ing with reversible and non-reversible Markov chains. It is

well-known transforming a reversible Markov chain into non-

reversible Markov chain can signiﬁcantly accelerate the mixing

process. This technique also helps to accelerate the conver-

gence of MCGD.

In our experiment, we ﬁrst construct an undirected connected

graph with n= 20 nodes with edges randomly generated. Let

Gdenote the adjacency matrix of the graph, that is,

Gi,j =

1,if i, j are connected;

0,otherwise.

Let dmax be the maximum number of outgoing edges of a node.

Select d= 10 and compute β∗∼ N (0, Id). The transition

probability of the reversible Markov chain is then deﬁned by,

known as Metropolis-Hastings markov chain,

Pi,j =

1

dmax ,if j6=i,Gi,j = 1;

1−Pj6=iGi,j

dmax ,if j=i;

0,otherwise.

100101102103104

Iteration

10-3

10-2

10-1

100

101

102

f( k)-f*

Reversible

Non-reversible

Figure 2: Comparison of reversible and irreversible Markov chains. The

second largest eigenvalues of reversible and non-reversible Markov chains

are 0.75 and 0.66 respectively.

Obviously, Pis symmetric and the stationary distribution is

uniform. The non-reversible Markov chain is constructed by

adding cycles. The edges of these cycles are directed and let

Vdenote the adjacency matrix of these cycles. If Vi,j = 1,

then Vj,i = 0. Let w0>0be the weight of ﬂows along these

cycles. Then we construct the transition probability of the

non-reversible Markov chain as follows,

Qi,j =Wi,j

PlWi,l

,

where W=dmaxP+w0V.

In our experiment, we add 5 cycles of length 4, with edges

existing in G.w0is set to be dmax

2. We test MCGD on a least

square problem. First, we select β∗∼ N (0, Id); and then for

each node i, we generate xi∼ N (0, Id), and yi=xT

iβ∗. The

objective function is deﬁned as,

f(β) = 1

2

n

X

i=1(xT

iβ−yi)2.

The convergence results are depicted in Figure 2.

3. Convergence analysis

Assumption 1

The Markov chain (Xk)k≥0is time-homogeneous, irre-

ducible, and aperiodic. It has a transition matrix Pand

has stationary distribution π∗.

3.1 Convex cases

Assumption 2

The set Xis assumed to be convex and compact.

Convergence of MCGD in the convex cases

Let Assumptions 1 and 2 hold and (xk)k≥0be generated

by scheme (2). Assume that fi,i∈[M], are convex

functions, and the stepsizes satisfy

X

kγk= +∞,X

kln k·γ2

k<+∞.(3)

Then, we have limkEf(xk) = f∗.

The stepsizes requirement (3) is nearly identical to the one

of SGD and subgradient algorithms. In the theorem above,

we use the stepsize setting γk=O(1

kq)as 1

2< q < 1. This

kind of stepsize requirements also works for SGD and

subgradient algorithms. The convergence rate of MCGD is

O(1

Pk

i=1 γi) = O(1

k1−q), which is also as the same as SGD and

subgradient algorithms for γk=O(1

kq).

3.2 Nonconvex cases

Assumption 3

The gradients of fiare assumed to be bounded, i.e., there

exists D > 0such that k∇fi(x)k ≤ D, i ∈[M].

Nonconvex case

Let Assumptions 1 and 3 hold, and Xbe the full space,

and (xk)k≥0be generated by MCGD. Also, assume fiis

diﬀerentiable and ∇fiis L-Lipschitz, and the stepsizes

satisfy

X

kγk= +∞,X

kln2k·γ2

k<+∞.(4)

Then, we have limkEk∇f(xk)k= 0.

Compared with MCGD in the convex case, the stepsize require-

ments of nonconvex MCGD become a tad higher; in summable

part, we need Pkln2k·γ2

k<+∞rather than Pkln k·γ2

k<+∞.

Nevertheless, we can still use γk=O(1

kq)for 1

2< q < 1.

5. Convergence analysis for continuous

state space

When the state space Ξis a continuum, there are inﬁnitely

many possible states. In this case, we consider an inﬁnite-

state Markov chain that is time-homogeneous and reversible.

We aim to solve the following problem:

minimize

x∈X⊆RnEξ F(x;ξ)!=ZΠF(x, ξ)dΠ(ξ).(5)

and the algorithm

xk+1 =ProjX xk−γk(ˆ

∇F(xk;ξk))!.(6)

Convex cases

Assume F(·;ξ)is convex for each ξ∈Ξ. Let the

stepsizes satisfy (3) and (xk)k≥0be generated by Algo-

rithm (6). Let F∗:= minx∈XEξ(F(x;ξ)). Assume

that for any ξ∈Ξ,|F(x;ξ)−F(y;ξ)| ≤ Lkx−

yk,supx∈X,ξ∈Ξ{k ˆ

∇F(x;ξ)k} ≤ D,Eξˆ

∇F(x;ξ)∈

∂EξF(x;ξ), and supx,y∈X,ξ∈Ξ|F(x;ξ)−F(y;ξ)| ≤ H.

Then we have

lim

k

E Eξ(F(xk;ξ)) −F∗!= 0,

Non-convex cases

Let the stepsizes satisfy (4), (xk)k≥0be generated by Algo-

rithm (6) with Xbeing full space. Assume for any ξ∈Ξ,

F(x;ξ)is diﬀerentiable, and k∇F(x;ξ)− ∇F(y;ξ)k ≤

Lkx−yk. In addition, supx∈X,ξ∈Ξ{k∇F(x;ξ)k} <+∞,

Xis the full space, and Eξ∇F(x;ξ) = ∇EξF(x;ξ).

Then, we have

lim

k

Ek∇Eξ(F(xk;ξ))k= 0.