PosterPDF Available

On Markov Chain Gradient Descent

Authors:
On Markov Chain Gradient Descent
Tao Sun, Yuejiao Sun, and Wotao Yin
National University of Defense Technology & University of California, Los Angeles
1. Markov Chain Gradient Descent
We consider a kind of stochastic algorithm, which is developed
on the trajectory of a Markov chain, for solving the finite-
sum minimization problem (and its continuous generation, i.e.,
population risk minimization)
minimize
xXRd
f(x)PM
i=1 fi(x)
M
,(1)
where XRnis a closed convex set, and fiis convex or non-
convex but differentiable. The algorithm is proposed to over-
come two drawback of implementing the traditional stochastic
gradient descent: 1. the direct sampling is difficult, thus, a
Markov chain is used for the sampling; 2. the data is dis-
tributedly stored on different machines, which is connected on
a graph. The iteration of MCGD is, therefore, modeled as
xk+1 =ProjX xkγkˆ
fjk(xk)!,(2)
where (jk)k0is a trajectory of a Markov chain on
{1,2, . . . , M}that has a uniform stationary distribution, and
ˆ
fidenotes the subgradient of fiif fiis convex, and the
gradient of fiif fiis nonconvex but differentiable.
2. Numerical results
We present two kinds of numerical results. The first one is to
show that MCGD uses fewer samples to train both a convex
model and a nonconvex model. The second one demonstrat
-es the advantage of the faster mixing of a non-reversible
Markov chain.
2.1 Comparison with SGD
Let us compare:
1MCGD, where jkis taken from one trajectory of the
Markov chain;
2SGDT, for T= 1,2,4,8,16,32, where each jkis the Tth
sample of a fresh, independent trajectory. All trajectories
are generated by starting from the same state 0.
To compute Tgradients, SGDTuses Ttimes as many samples
as MCGD. We did not try to adapt Tas kincreases because
there lacks a theoretical guidance.
In the first test, we recover a vector ufrom an auto re-
gressive process, which closely resembles the first experiment.
Set matrix A as a subdiagonal matrix with random entries
Ai,i1i.i.d
∼ U[0.8,0.99]. Randomly sample a vector uRd,
d= 50, with the unit 2-norm. Our data (ξ1
t, ξ2
t)
t=1 are gener-
ated according to the following auto regressive process:
ξ1
t=1
t1+e1Wt, Wti.i.d
N(0,1)
¯
ξ2
t=
1,if hu, ξ1
ti>0,
0,otherwise;
ξ2
t=
¯
ξ2
t,with probability 0.8,
1¯
ξ2
t,with probability 0.2.
Clearly, (ξ1
t, ξ2
t)
t=1 forms a Markov chain. Let Πdenote the
stationary distribution of this Markov chain. We recover uas
the solution to the following problem:
minimize
xE(ξ12)Π`(x;ξ1, ξ2).
We consider both convex and nonconvex loss functions, which
were not done before in the literature. The convex one is the
logistic loss
`(x;ξ1, ξ2) = ξ2log(σ(hx, ξ1i)) (1 ξ2) log(1 σ(hx, ξ1i)),
where σ(t) = 1
1+exp(t).
100101102103104
Number of gradient evaluations
10-3
10-2
10-1
100Convex case
MCSGD
SGD1
SGD2
SGD4
SGD8
SGD16
SGD32
fxkf(x)
100101102103104
Number of samples
10-3
10-2
10-1
100Convex case
fxkf(x)
100101102103104
Number of gradient evaluations
10-3
10-2
10-1 Nonconvex case
fxkf(x)
100101102103104
Number of samples
10-3
10-2
10-1 Nonconvex case
fxkf(x)
Figure 1: Comparisons of MCGD and SGDTfor T= 1,2,4,8,16,32.xk
is the average of x1, . . . , xk.
And the nonconvex one is taken as
`(x;ξ1, ξ2) = 1
2(σ(hx, ξ1i)ξ2)2.
We choose γk=1
kqas our stepsize, where q= 0.501. This
choice is consistently with our theory below. Our results in
Figure 1 are surprisingly positive on MCGD, more so to our
expectation. As we had expected, MCGD used significantly
fewer total samples than SGD on every T. But, it is
surprising that MCGD did not need even more gradient
evaluations. Randomly generated data must have helped
homogenize the samples over the different states, making it
less important for a trajectory to converge. It is important to
note that SGD1 and SGD2, as well as SGD4, in the
nonconvex case, stagnate at noticeably lower accuracies
because their Tvalues are too small for convergence.
2.2 Comparison of reversible and
non-reversible Markov chains for MCGD
We also compare the convergence of MCGD when work-
ing with reversible and non-reversible Markov chains. It is
well-known transforming a reversible Markov chain into non-
reversible Markov chain can significantly accelerate the mixing
process. This technique also helps to accelerate the conver-
gence of MCGD.
In our experiment, we first construct an undirected connected
graph with n= 20 nodes with edges randomly generated. Let
Gdenote the adjacency matrix of the graph, that is,
Gi,j =
1,if i, j are connected;
0,otherwise.
Let dmax be the maximum number of outgoing edges of a node.
Select d= 10 and compute β N (0, Id). The transition
probability of the reversible Markov chain is then defined by,
known as Metropolis-Hastings markov chain,
Pi,j =
1
dmax ,if j6=i,Gi,j = 1;
1Pj6=iGi,j
dmax ,if j=i;
0,otherwise.
100101102103104
Iteration
10-3
10-2
10-1
100
101
102
f( k)-f*
Reversible
Non-reversible
Figure 2: Comparison of reversible and irreversible Markov chains. The
second largest eigenvalues of reversible and non-reversible Markov chains
are 0.75 and 0.66 respectively.
Obviously, Pis symmetric and the stationary distribution is
uniform. The non-reversible Markov chain is constructed by
adding cycles. The edges of these cycles are directed and let
Vdenote the adjacency matrix of these cycles. If Vi,j = 1,
then Vj,i = 0. Let w0>0be the weight of flows along these
cycles. Then we construct the transition probability of the
non-reversible Markov chain as follows,
Qi,j =Wi,j
PlWi,l
,
where W=dmaxP+w0V.
In our experiment, we add 5 cycles of length 4, with edges
existing in G.w0is set to be dmax
2. We test MCGD on a least
square problem. First, we select β N (0, Id); and then for
each node i, we generate xi N (0, Id), and yi=xT
iβ. The
objective function is defined as,
f(β) = 1
2
n
X
i=1(xT
iβyi)2.
The convergence results are depicted in Figure 2.
3. Convergence analysis
Assumption 1
The Markov chain (Xk)k0is time-homogeneous, irre-
ducible, and aperiodic. It has a transition matrix Pand
has stationary distribution π.
3.1 Convex cases
Assumption 2
The set Xis assumed to be convex and compact.
Convergence of MCGD in the convex cases
Let Assumptions 1 and 2 hold and (xk)k0be generated
by scheme (2). Assume that fi,i[M], are convex
functions, and the stepsizes satisfy
X
kγk= +,X
kln k·γ2
k<+.(3)
Then, we have limkEf(xk) = f.
The stepsizes requirement (3) is nearly identical to the one
of SGD and subgradient algorithms. In the theorem above,
we use the stepsize setting γk=O(1
kq)as 1
2< q < 1. This
kind of stepsize requirements also works for SGD and
subgradient algorithms. The convergence rate of MCGD is
O(1
Pk
i=1 γi) = O(1
k1q), which is also as the same as SGD and
subgradient algorithms for γk=O(1
kq).
3.2 Nonconvex cases
Assumption 3
The gradients of fiare assumed to be bounded, i.e., there
exists D > 0such that k∇fi(x)k ≤ D, i [M].
Nonconvex case
Let Assumptions 1 and 3 hold, and Xbe the full space,
and (xk)k0be generated by MCGD. Also, assume fiis
differentiable and fiis L-Lipschitz, and the stepsizes
satisfy
X
kγk= +,X
kln2k·γ2
k<+.(4)
Then, we have limkEk∇f(xk)k= 0.
Compared with MCGD in the convex case, the stepsize require-
ments of nonconvex MCGD become a tad higher; in summable
part, we need Pkln2k·γ2
k<+rather than Pkln k·γ2
k<+.
Nevertheless, we can still use γk=O(1
kq)for 1
2< q < 1.
5. Convergence analysis for continuous
state space
When the state space Ξis a continuum, there are infinitely
many possible states. In this case, we consider an infinite-
state Markov chain that is time-homogeneous and reversible.
We aim to solve the following problem:
minimize
xXRnEξ F(x;ξ)!=ZΠF(x, ξ)dΠ(ξ).(5)
and the algorithm
xk+1 =ProjX xkγk(ˆ
F(xk;ξk))!.(6)
Convex cases
Assume F(·;ξ)is convex for each ξΞ. Let the
stepsizes satisfy (3) and (xk)k0be generated by Algo-
rithm (6). Let F:= minxXEξ(F(x;ξ)). Assume
that for any ξΞ,|F(x;ξ)F(y;ξ)| ≤ Lkx
yk,supxX,ξΞ{k ˆ
F(x;ξ)k} ≤ D,Eξˆ
F(x;ξ)
EξF(x;ξ), and supx,yX,ξΞ|F(x;ξ)F(y;ξ)| ≤ H.
Then we have
lim
k
E Eξ(F(xk;ξ)) F!= 0,
Non-convex cases
Let the stepsizes satisfy (4), (xk)k0be generated by Algo-
rithm (6) with Xbeing full space. Assume for any ξΞ,
F(x;ξ)is differentiable, and k∇F(x;ξ)− ∇F(y;ξ)k ≤
Lkxyk. In addition, supxXΞ{k∇F(x;ξ)k} <+,
Xis the full space, and EξF(x;ξ) = EξF(x;ξ).
Then, we have
lim
k
Ek∇Eξ(F(xk;ξ))k= 0.
... sampling scheme is that the stochastic gradient at each iteration is sampled on the trajectory of a Markov chain, in which the stochastic gradient estimators are neither unbiased nor independent. Recent studies [4,18,22,33,34,58,68] overcame this technical hurdle and provided the convergence rates of MC-SGD. However, to the best of our knowledge, there is no work on the generalization performance of SGMs with Markov sampling. ...
... They also developed the convergence result for non-convex problems. In addition, decentralized SGD methods with the gradients sampled from a non-reversible Markov chain have been studied in [68]. [18] considered an accelerated ergodic Markov chain SGD for both convex and non-convex problems. ...
... Remark 2. Our assumptions on Markov chains listed above are standard in the literature [18,34,50,69,68,78]. For instance, Markov chain-type SGD was proposed for pairwise learning which can apply to various learning task such as AUC maximization and bipartite ranking [1,83,87,29,46] and metric learning [35,75,76,79]. ...
Preprint
Full-text available
Recently there is a large amount of work devoted to the study of Markov chain stochastic gradient methods (MC-SGMs) which mainly focus on their convergence analysis for solving minimization problems. In this paper, we provide a comprehensive generalization analysis of MC-SGMs for both minimization and minimax problems through the lens of algorithmic stability in the framework of statistical learning theory. For empirical risk minimization (ERM) problems, we establish the optimal excess population risk bounds for both smooth and non-smooth cases by introducing on-average argument stability. For minimax problems, we develop a quantitative connection between on-average argument stability and generalization error which extends the existing results for uniform stability \cite{lei2021stability}. We further develop the first nearly optimal convergence rates for convex-concave problems both in expectation and with high probability, which, combined with our stability results, show that the optimal generalization bounds can be attained for both smooth and non-smooth cases. To the best of our knowledge, this is the first generalization analysis of SGMs when the gradients are sampled from a Markov process.
... This formulation contains various multi-agent machine learning, reinforcement learning and statistical problems. We are particularly interested in cases where obtaining an independent and identically distributed (i.i.d.) sample ξ(i) from Ξ i is very hard or even impossible at every node i; see an example in [34] where the cost of i.i.d. sampling can be very expensive. ...
... In all these works, the Markov chain is required to be reversible, and the functions have to be convex. In a very recent paper [34], the non-ergodic convergence of MGD has been shown in the nonconvex case with non-reversible Markov chain, but the algorithm needs to be implemented in a centralized fashion. ...
... In fact, it is more difficult to prove (3.2) than to prove (3.3). The descent on a Lyapunov function and the Schwarz inequality can directly derive (3.2), while (3.3) requires a technical lemma, which first given in [36] and generalized by [34]. An extreme case is that m = 1 and W = 1; DMGD will reduce to the classical MGD. ...
Preprint
Full-text available
Decentralized stochastic gradient method emerges as a promising solution for solving large-scale machine learning problems. This paper studies the decentralized Markov chain gradient descent (DMGD) algorithm - a variant of the decentralized stochastic gradient methods where the random samples are taken along the trajectory of a Markov chain. This setting is well-motivated when obtaining independent samples is costly or impossible, which excludes the use of the traditional stochastic gradient algorithms. Specifically, we consider the first- and zeroth-order versions of decentralized Markov chain gradient descent over a connected network, where each node only communicates with its neighbors about intermediate results. The nonergodic convergence and the ergodic convergence rate of the proposed algorithms have been rigorously established, and their critical dependences on the network topology and the mixing time of Markov chain have been highlighted. The numerical tests further validate the sample efficiency of our algorithm.
... Nothing is known about the slowly-decaying nonsquare-summable ones, which are generally preferable since they give similar benefits as constant stepsizes and, often, also guarantee convergence. Note that such issues also plague much of the distributed stochastic optimization literature Yuan et al. [2016], Sun et al. [2019], Lian et al. [2017], Koloskova et al. [2020], , Pu and Nedić [2020]. ...
Preprint
Full-text available
In Multi-Agent Reinforcement Learning (MARL), multiple agents interact with a common environment, as also with each other, for solving a shared problem in sequential decision-making. It has wide-ranging applications in gaming, robotics, finance, etc. In this work, we derive a novel law of iterated logarithm for a family of distributed nonlinear stochastic approximation schemes that is useful in MARL. In particular, our result describes the convergence rate on almost every sample path where the algorithm converges. This result is the first of its kind in the distributed setup and provides deeper insights than the existing ones, which only discuss convergence rates in the expected or the CLT sense. Importantly, our result holds under significantly weaker assumptions: neither the gossip matrix needs to be doubly stochastic nor the stepsizes square summable. As an application, we show that, for the stepsize $n^{-\gamma}$ with $\gamma \in (0, 1),$ the distributed TD(0) algorithm with linear function approximation has a convergence rate of $O(\sqrt{n^{-\gamma} \ln n })$ a.s.; for the $1/n$ type stepsize, the same is $O(\sqrt{n^{-1} \ln \ln n})$ a.s. These decay rates do not depend on the graph depicting the interactions among the different agents.
... Nothing is known about the slowly-decaying non-squaresummable ones, which are generally preferable since they give similar benefits as constant stepsizes and, often, also guarantee convergence. Note that such issues also plague much of the distributed stochastic optimization literature Yuan et al. [2016], Sun et al. [2019], Lian et al. [2017], Koloskova et al. [2020], , Pu and Nedić [2020]. ...
Conference Paper
Full-text available
In Multi-Agent Reinforcement Learning (MARL), multiple agents interact with a common environment, as also with each other, for solving a shared problem in sequential decision-making. It has wide-ranging applications in gaming, robotics, finance, etc. In this work, we derive a novel law of iterated logarithm for a family of distributed nonlinear stochastic approximation schemes that is useful in MARL. In particular, our result describes the convergence rate on almost every sample path where the algorithm converges. This result is the first of its kind in the distributed setup and provides deeper insights than the existing ones, which only discuss convergence rates in the expected or the CLT sense. Importantly, our result holds under significantly weaker assumptions: neither the gossip matrix needs to be doubly stochastic nor the stepsizes square summable. As an application, we show that, for the stepsize $n^{-\gamma}$ with $\gamma \in (0, 1),$ the distributed TD(0) algorithm with linear function approximation has a convergence rate of $O((n^{-\gamma} \ln n)^{-1/2})$ a.s.; for the 1/n type stepsize, the same is $O((n^{-1}\ln \ln n)^{-1/2})$ a.s. These decay rates do not depend on the graph depicting the interactions among the different agents.
... Nothing is known about the slowly-decaying non-square-summable ones, which are generally preferable since they give similar benefits as constant stepsizes and, often, also guarantee convergence. Note that such issues also plague much of the distributed stochastic optimization literature [47,38,20,16,32,31]. ...
Preprint
Full-text available
In Multi-Agent Reinforcement Learning (MARL), multiple agents interact with a common environment and also with each other, for solving a shared problem in sequential decision-making. Algorithms for MARL have a wealth of applications in popular domains including gaming, robotics, and finance. In this work, we study a family of distributed nonlinear stochastic approximation schemes useful in MARL and derive a novel law of iterated logarithm. In particular, our result describes the convergence rate on almost every sample path where the algorithm converges. This result is the first of its kind in the distributed setup and provides deeper insights than the existing ones, which only discuss convergence rates in the expected or the CLT sense. Importantly, our result holds under significantly weaker assumptions: neither the gossip matrix needs to be doubly stochastic nor the stepsizes square summable. As an application, we show that, for the stepsize $n^{-\gamma}$ with γ ∈ (0, 1), the distributed TD(0) algorithm with linear function approximation has a convergence rate of O(n^{−γ} log n) a.s.; for the 1/n type stepsize, it is O(n^{−1} log log n) a.s. These decay rates do not depend on the graph depicting the interactions among the different agents.
ResearchGate has not been able to resolve any references for this publication.