Access to this full-text is provided by Springer Nature.
Content available from Computational Optimization and Applications
This content is subject to copyright. Terms and conditions apply.
Vol.:(0123456789)
Computational Optimization and Applications (2022) 83:465–524
https://doi.org/10.1007/s10589-022-00399-3
1 3
Stochastic relaxed inertial forward‑backward‑forward
splitting formonotone inclusions inHilbert spaces
ShishengCui1· UdayShanbhag1· MathiasStaudigl2 · PhanVuong3
Received: 2 November 2021 / Accepted: 30 June 2022 / Published online: 30 July 2022
© The Author(s) 2022
Abstract
We consider monotone inclusions defined on a Hilbert space where the operator is
given by the sum of a maximal monotone operator T and a single-valued mono-
tone, Lipschitz continuous, and expectation-valued operator V. We draw motivation
from the seminal work by Attouch and Cabot (Attouch in AMO 80:547–598, 2019,
Attouch in MP 184: 243–287) on relaxed inertial methods for monotone inclusions
and present a stochastic extension of the relaxed inertial forward–backward-for-
ward method. Facilitated by an online variance reduction strategy via a mini-batch
approach, we show that our method produces a sequence that weakly converges to
the solution set. Moreover, it is possible to estimate the rate at which the discrete
velocity of the stochastic process vanishes. Under strong monotonicity, we demon-
strate strong convergence, and give a detailed assessment of the iteration and oracle
complexity of the scheme. When the mini-batch is raised at a geometric (polyno-
mial) rate, the rate statement can be strengthened to a linear (suitable polynomial)
rate while the oracle complexity of computing an
𝜖
-solution improves to
O(1∕
𝜖
)
.
Importantly, the latter claim allows for possibly biased oracles, a key theoretical
advancement allowing for far broader applicability. By defining a restricted gap
function based on the Fitzpatrick function, we prove that the expected gap of an
averaged sequence diminishes at a sublinear rate of
O(1∕k)
while the oracle com-
plexity of computing a suitably defined
𝜖
-solution is
O(
1
∕
𝜖
1+a)
where
a>1
. Numer-
ical results on two-stage games and an overlapping group Lasso problem illustrate
the advantages of our method compared to competitors.
Keywords Monotone operator splitting· Stochastic approximation· Complexity·
Variance reduction· Dynamic sampling
* Mathias Staudigl
m.staudigl@maastrichtuniversity.nl
Extended author information available on the last page of the article
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
466
S.Cui et al.
1 3
1 Introduction
1.1 Problem formulation andmotivation
A wide range of problems in areas such as optimization, variational inequalities,
game theory, signal processing, or traffic theory, can be reduced to solving inclu-
sions involving set-valued operators in a Hilbert space
𝖧
, i.e. to find a point
x∈𝖧
such that
0∈F(x)
, where
F∶𝖧→2𝖧
is a set-valued operator. In many applications
such inclusion problems display specific structure revealing that the operator F can
be additively decomposed. This leads us to the main problem we consider in this
paper.
Problem1 Let
𝖧
be a real separable Hilbert space with inner product
⟨⋅,⋅⟩
and asso-
ciated norm
‖
⋅
‖
=
√⟨
⋅,⋅
⟩
. Let
T∶𝖧→2𝖧
and
V∶𝖧→𝖧
be maximally monotone
operators, such that V is L-Lipschitz continuous. The problem is to
We assume that Problem1 is well-posed:
Assumption 1
𝖲≜𝖹𝖾𝗋(F)≠∅
.
We are interested in the case where (MI) is solved by an iterative algorithm based
on a stochastic oracle (SO) representation of the operator V. Specifically, when
solving the problem, the algorithm calls to the SO. At each call, the SO receives as
input a search point
x∈𝖧
generated by the algorithm on the basis of past informa-
tion so far, and returns the output
V(x,𝜉)
, where
𝜉
is a random variable defined on
some given probability space
(Ω,F,ℙ)
, taking values in a measurable set
Ξ
with
law
𝖯=
ℙ
◦𝜉−1
. In most parts of this paper, and the vast majority of contributions on
stochastic variational problems in general, it is assumed that the output of the SO is
unbiased,
Such stochastic inclusion problems arise in numerous problems of fundamental
importance in mathematical optimization and equilibrium problems, either directly
or through an appropriate reformulation. An excellent survey on the existing tech-
niques for solving problem (MI) can be found in [3] (in general Hilbert spaces) and
[4] (in the finite-dimensional case).
1.2 Motivating examples
In what follows, we provide some motivating examples.
(MI)
find x∈𝖧such that 0 ∈F(x)≜V(x)+T(x),
(1)
V
(x)=𝔼𝜉[
V(x,𝜉)] =
∫Ξ
V(x,z)d𝖯(z)∀x∈𝖧
.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
467
1 3
Stochastic relaxed inertial forward‑backward‑forward…
Example 1 (Stochastic Convex Optimization) Let
𝖧1,𝖧2
be separable Hilbert spaces.
A large class of stochastic optimization problems, with wide range of applications in
signal processing, machine learning and control, is given by
where
h∶𝖧1→ℝ
is a convex differentiable function with a Lipschitz con-
tinuous gradient
∇h
, represented as
h
(u)=𝔼
𝜉
[
h(u,𝜉
)]
.
f∶𝖧1→(−∞,∞]
and
g∶𝖧2→(−∞,∞]
are proper, convex lower semi-continuous functions, and
L∶𝖧1→𝖧
2
is a bounded linear operator. Problem (2) gains particular relevance in
machine learning, where usually h(u) is a convex data fidelity term (e.g. a popula-
tion risk functional), and g(Lu) and f(u) embody penalty or regularization terms; see
e.g. total variation [5], hierarchical variable selection [6, 7], and graph regulariza-
tion [8, 9]. Applications in control and engineering are given in [10, 11]. We refer
to (2) as the primal problem. Using Fenchel-Rockafellar duality [3, ch.19], the dual
problem of (2) is given by
where
g∗
is the Fenchel conjugate of g and
(f+h)∗(w)=f∗◻h∗(w)=infu∈𝖧1{f∗(u)+h∗(w−u)}
represents the infimal convo-
lution of the functions f and h. Combining the primal problem (2) with its dual (3),
we obtain the saddle-point problem
Following classical Karush-Kuhn-Tucker theory [12], the primal-dual optimality
conditions associated with (4) are concisely represented by the following monotone
inclusion: Find
x=(u,v)∈𝖧1×𝖧2≡𝖧
such that
We may compactly summarize these conditions in terms of the zero-finding problem
(MI) using the operators V and T, defined as
Note that the operator
V∶𝖧→𝖧
is the sum of a maximally monotone and a skew-
symmetric operator. Hence, in general, it is not cocoercive. Conditions on the data
guaranteeing Assumption 1 are stated in [13].
Since h(u) is represented as an expected value, we need to appeal to simula-
tion based methods to evaluate its gradient. Also, significant computational speed-
ups can be made if we are able to sample the skew-symmetric linear operator
(u,v)↦(L∗u,−Lu)
in an efficient way. Hence, we assume that there exists a SO that
can provide unbiased estimator to the gradient operators
∇h(u)
and
(L∗v,−Lu)
. More
specifically, given the current position
x=(u,v)∈𝖧1×𝖧2
, the oracle will output
the random estimators
H
(u,
𝜉
),
L
u
(u,
𝜉
),
L
v
(v,
𝜉)
such that
(2)
min
u
∈𝖧
1
{f(u)+h(u)+g(Lu)}
(3)
min
v∈𝖧2
{(f+h)∗(−L∗v)+g∗(v)},
(4)
inf
u
∈𝖧1
sup
v∈𝖧2
{f(u)+h(u)−g∗(v)+⟨Lu,v⟩}.
(5)
−L∗v∈𝜕f(u)+∇h(u), and Lu∈𝜕g∗(v).
V(u,v)≜(∇h(u)+L∗v,−Lu)and T(u,v)≜𝜕f(u)×𝜕g∗(v).
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
468
S.Cui et al.
1 3
This oracle feedback generates the random operator
V(x,𝜉)=(
H(u,𝜉)+
Lv(v,𝜉),−
Lu(u,𝜉))
, which allows us to approach the saddle-
point problem (4) via simulation-based techniques.
Example 2 (Stochastic variational inequality problems) There are a multitude of
examples of monotone inclusion problems (MI) where the single-valued map V is
not the gradient of a convex function. An important model class where this is the
case is the stochastic variational inequality (SVI) problem. Due to their huge num-
ber of applications, SVI’s received enormous interest over the last several years from
various communities [14–17]. This problem emerges when V(x) is represented as
an expected value as in (1) and
T(x)=𝜕g(x)
for some proper lower semi-continuous
function
g∶𝖧→(−∞,∞]
. In this case, the resulting structured monotone inclusion
problem can be equivalently stated as
An important and frequently studied special case of (6) arises if g is the indicator
function of a given closed and convex subset
𝖢⊂𝖧
. In this cases the set-valued
operator T becomes the normal cone map
This formulation includes many fundamental problems including fixed point prob-
lems, Nash equilibrium problems and complementarity problems [4]. Consequently,
the equilibrium condition (6) reduces to
1.3 Contributions
Despite the advances in stochastic optimization and variational inequalities, the
algorithmic treatment of general monotone inclusion problems under stochastic
uncertainty is a largely unexplored field. This is rather surprising given the vast
amount of applications of maximally monotone inclusions in control and engi-
neering, encompassing distributed computation of generalized Nash equilibria
[18–20], traffic systems [21–23], and PDE-constrained optimization [24]. The first
major aim of this manuscript is to introduce and investigate a relaxed inertial sto-
chastic forward-backward-forward (RISFBF) method, building on an operator split-
ting scheme originally due to Paul Tseng [25]. RISFBF produces three sequences
{(
X
k,
Y
k,
Z
k);
k
∈
ℕ
}
, defined as
𝔼𝜉
[
H(u,𝜉)]=∇h(u),𝔼
𝜉
[
L
u
(u,𝜉)] = Lu, and 𝔼
𝜉
[
L
v
(v,𝜉)] = L
∗
v
.
(6)
find x∈𝖧s.t. ⟨V(x),x−x⟩+g(x)−g(x)≥0∀x∈𝖧.
(7)
T
(x)= 𝖭𝖢(x)
≜
p∈𝖧
supy∈𝖢
y−x,p
≤
0
if x∈𝖢
,
∅else.
find x∈𝖢s.t. ⟨V(x),x−x⟩≥0∀x∈𝖢.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
469
1 3
Stochastic relaxed inertial forward‑backward‑forward…
The data involved in this scheme are explained as follows:
•
Ak(Zk)
and
Bk(Yk)
are random estimators of V obtained by consulting the SO at
search points
Zk
and
Yk
, respectively;
•
(𝛼k)k∈ℕ
is a sequence of non-negative numbers regulating the memory, or inertia
of the method;
•
(𝜆k)k∈ℕ
is a positive sequence of step-sizes;
•
(𝜌k)k∈ℕ
is a non-negative relaxation sequence.
If
𝛼k=0
and
𝜌k=1
the above scheme reduces to the stochastic forward-backward-
forward method developed in [26, 27], with important applications in Gaussian
communication networks [16] and dynamic user equilibrium problems [28]. How-
ever, even more connections to existing methods can be made.
Stochastic Extragradient If
T={0}
, we obtain the inertial extragradient method
If
𝛼k=0
, this reduces to a generalized extragradient method
recently introduced in [29].
Proximal Point Method If
V=0
, the method reduces to the well-known deter-
ministic proximal point algorithm [2], overlaid by inertial and relaxation effects. The
scheme reads explicitly as
The list of our contributions reads as follows:
(i) Wide Applicability A key argument in favor of Tseng’s operator splitting
method is that it is provably convergent when solving structured monotone
inclusions of the type (MI), without imposing cocoercivity of the single-valued
part V. This is a remarkable advantage relative to the perhaps more familiar
and direct forward-backward splitting methods (aka projected (stochastic) gra-
dient descent in the potential case). In particular, our scheme is applicable to
the primal-dual splitting described in Example1.
(RISFBF)
Z
k
=X
k
+𝛼
k
(X
k
−X
k−1
),
Yk=J𝜆kT(Zk−𝜆kAk(Zk)),
X
k+1=(1−𝜌k)
Z
k+𝜌k[
Y
k+𝜆k(
A
k(
Z
k)−
B
k(
Y
k))].
Z
k
=X
k
+𝛼
k
(X
k
−X
k−1
),
Yk=Zk−𝜆kAk(Zk),
Xk+1
=Z
k
−
𝜌k𝜆k
B
k
(Y
k
).
Y
k
=X
k
−𝜆
k
A
k
(X
k
),
Xk+1=Xk−𝜆k𝜌kBk(Yk),
Z
k
=X
k
+𝛼
k
(X
k
−X
k−1
),
X
k+1
=(1−𝜌
k
)Z
k
+𝜌
k
J
𝜆kT
(Z
k
)
.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
470
S.Cui et al.
1 3
(ii) Asymptotic guarantees We show that under suitable assumptions on the relaxa-
tion sequence
(𝜌k)k∈ℕ
, the non-decreasing inertial sequence
(𝛼k)k∈ℕ
, and step-
length sequence
(𝜆k)k∈ℕ
, the generated stochastic process
(Xk)k∈ℕ
weakly
almost surely converges to a random variable with values in
𝖲
. Assuming
demiregularity of the operators yields strong convergence in the real (possibly
infinite-dimensional) Hilbert space.
(iii) Non-asymptotic linear rate under strong monotonicity of V When V is strongly
monotone, strong convergence of the last iterate is shown and the sequence
admits a non-asymptotic linear rate of convergence without a conditional
unbiasedness of the SO. In particular, we show that the iteration and oracle
complexity of computing an
𝜖
-solution is no worse than
O
(log(
1
𝜖))
and
O
(
1
𝜖)
,
respectively.
(iv) Non-asymptotic sublinear rate under monotonicity of V When V is monotone,
by leveraging the Fitzpatrick function [3, 30, 31] associated with the structured
operator
F=T+V
, we propose a restricted gap function. We then prove that
the expected gap of an averaged sequence diminishes at the rate of
O
(
1
k)
. This
allows us to derive an
O
(
1
𝜖)
upper bound on the iteration complexity, and an
O
(
1
𝜖
2+𝛿
)
upper bound (for
𝛿>0)
on the oracle complexity for computing an
𝜖
-solution.
The above listed contributions shed new light on a set of open questions, which we
summarize below:
(i) Absence of rigorous asymptotics So far no aymptotic convergence guarantees
have been available when considering relaxed inertial FBF schemes when T is
maximally monotone and V is a single-valued monotone expectation-valued
map.
(ii) Unavailability of rate statements We are not aware of any known non-asymp-
totic rate guarantees for algorithms solving (MI) under stochastic uncertainty.
A key barrier in monotone and stochastic regimes in developing such state-
ments has been in the availability of a residual function. Some recent progress
in the special stochastic variational inequality case has been made by [26, 32,
33], but the general Hilbert-space setting involving set-valued operators seems
to be largely unexplored (we will say more in Sect.1.4).
(iii) Bias requirements A standard assumption in stochastic optimization is that
the SO generates signals which are unbiased estimators of the deterministic
operator V(x). Of course, the requirement that the noise process is unbiased
may often fail to hold in practice. In the present Hilbert space setting this is in
some sense even expected to be the rule rather than the exception, since most
operators are derived from complicated dynamical systems or the optimiza-
tion method is applied to discretized formulations of the original problem.
See the recent work [34, 35] for an interesting illustration in the context of
PDE-constrained optimization. Some of our results go beyond the standard
unbiasedness assumption.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
471
1 3
Stochastic relaxed inertial forward‑backward‑forward…
1.4 Related research
Understanding the role of inertial and relaxation effects in numerical schemes is a
line of research which received enormous interest over the last two decades. Below,
we try to give a brief overview about related algorithms.
Inertial, Relaxation, and Proximal schemes
In the context of convex optimization, Polyak [36] introduced the Heavy-ball
method. This is a two-step method for minimizing a smooth convex function f. The
algorithm reads as
The difference from the gradient method is that the base point of the gradient
descent step is taken to be the extrapolated point
Zk
, instead of
Xk
. This small dif-
ference has the surprising consequence that (HB) attains optimal complexity guar-
antees for strongly convex functions with Lipschitz continuous gradients. Hence,
(HB) resembles an optimal method [37]. The acceleration effects can be explained
by writing the process entirely in terms of a single updating equation as
Choosing
𝛼k=1−ak𝛿k
and
𝜆k
=𝛾
k
𝛿
2
k
for
𝛿k
a small parameter, we arrive at
This can be seen as a discrete-time approximation of the second-order dynamical
system
introduced by [38]. Since then, it has received significant attention in the potential,
as well as in the non-potential case (see e.g [39–41] for an appetizer). As pointed out
in [42], if
𝛾(t)=1
, the above system reduces to a continuous version of Nesterov’s
fast gradient method [43]. Recently, [44] defined a stochastic version of the Heavy-
ball method.
Motivated by the development of such fast methods for convex optimization,
Attouch and Cabot [1] studied a relaxed-inertial forward-backward algorithm, read-
ing as
(HB)
{
Zk=Xk+𝛼k(Xk−Xk−1)
,
X
k+1
=Z
k
−𝜆
k
∇f(X
k
)
Xk+1−2Xk−Xk−1+(1−𝛼k)(Xk−Xk−1)+𝜆k∇f(Xk)=0.
1
𝛿
2
k
(Xk+1−2Xk−Xk−1)+
a
k
𝛿k
(Xk−Xk−1)+𝛾k∇f(Xk)=
0.
x
(t)+
a
t
x(t)+𝛾(t)∇f(x(t)) =
0,
(RIFB)
⎧
⎪
⎨
⎪
⎩
Zk=Xk+𝛼k(Xk−Xk−1),
Yk=J𝜆kT(Zk−𝜆kV(Zk))
Xk+1=(1−𝜌k)Zk+𝜌kYk
.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
472
S.Cui et al.
1 3
If
V=0
, this reduces to a relaxed inertial proximal point method analyzed by
Attouch and Cabot [2]. If
𝜌k=1
, an inertial forward-backward splitting method is
recovered, first studied by Lorenz and Pock [45].
Convergence guarantees for the forward-backward splitting rely on the cocoer-
civity (inverse strong monotonicity) of the single-valued operator V. Example1, in
which V is given by a monotone plus a skew-symmetric linear operator, illustrates
an important instance for which this assumption is not satisfied (see [46] for fur-
ther examples). A general-purpose operator splitting framework, relaxing the coco-
ercivity property, is the forward-backward-forward (FBF) method due to Tseng [25].
Inertial [47] and relaxed-inertial [48] versions of FBF have been developed. An all-
encompassing numerical scheme can be compactly described as
Weak and strong convergence under appropriate conditions on the involved oper-
ators and parameter sequences are established in [48], but no rate statements are
given.
Related work on stochastic approximation Efforts in extending stochastic approx-
imation methods to variational inequality problems have considered standard projec-
tion schemes [14] for Lipschitz and strongly monotone operators. Extragradient and
(more generally) mirror-prox algorithms [49, 50] can contend with merely mono-
tone operators, while iterative smoothing [51] schemes can cope with with the lack
of Lipschitz continuity. It is worth noting that extragradient schemes have recently
assumed relevance in the training of generative adversarial networks (GANS) [52,
53]. Rate analysis for stochastic extragradient (SEG) have led to optimal rates for
Lipschitz and monotone operators [50], as well as extensions to non-Lipschitzian
[51] and pseudomonotone settings [32, 54]. To alleviate the computational com-
plexity single-projection schemes, such as the stochastic forward-backward-forward
(SFBF) method [26, 27], as well as subgradient-extragradient and projected reflected
algorithms [55] have been studied as well.
SFBF has been shown to be nearly optimal in terms of iteration and oracle com-
plexity, displaying significant empirical improvements compared to SEG. While the
role of inertia in optimization is well documented, in stochastic splitting problems,
the only contribution we are aware of is the work by Rosasco et al. [56]. In that
paper asymptotic guarantees for an inertial stochastic forward-backward (SFB) algo-
rithm are presented under the hypothesis that the operators V and T are maximally
monotone and the single-valued operator V is cocoercive.
Variance reduction approaches Variance-reduction schemes address the deterio-
ration in convergence rate and the resulting poorer practical behavior via two com-
monly adopted avenues:
(i) If the single-valued part V appears as a finite-sum (see e.g. [52, 57]), variance-
reduction ideas from machine learning [58] can be used.
(RIFBF)
⎧
⎪
⎨
⎪
⎩
Zk=Xk+𝛼k(Xk−Xk−1),
Yk=J𝜆kT(Zk−𝜆kV(Zk)),
Xk+1=(1−𝜌k)Zk+𝜌k[Yk−𝜆k(V(Yk)−V(Zk))]
.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
473
1 3
Stochastic relaxed inertial forward‑backward‑forward…
(ii) Mini-batch schemes that employ an increasing batch-size of gradients [59]
lead to deterministic rates of convergence for stochastic strongly convex [60],
convex [61], and nonconvex optimization [62], as well as for pseudo-monotone
SVIs via extragradient [32], and splitting schemes [26].
In terms of run-time, improvements in iteration complexities achieved by mini-batch
approaches are significant; e.g. in strongly monotone regimes, the iteration complex-
ity improves from
O
(
1
𝜖)
to
O
(ln(
1
𝜖))
[27, 55]. Beyond run-time advantages, such ave-
nues provide asymptotic and rate guarantees under possibly weaker assumptions on
the problem as well as the oracle; in particular, mini-batch schemes allow for possi-
bly biased oracles and state-dependency of the noise [55]. Concerns about the sam-
pling burdens are, in our opinion, often overstated since such schemes are meant to
provide
𝜖
-solutions; e.g. if
𝜖=10−3
and the obtained rate is
O(1∕k)
, then the batch-
size
mk=⌊ka⌋
where
a>1
, implying that the batch-sizes are
O(103a)
, a relatively
modest requirement, given the advances in computing.
Outline The remainder of the paper is organized in five sections. After dispens-
ing with the preliminaries in Sect. 2, we present the (RISFBF) scheme in Sect. 3.
Asymptotic and rate statements are developed in Sect.4 and preliminary numerics
are presented in Sect.5. We conclude with some brief remarks in Sect.6. Technical
results are collected in Appendix1.
2 Preliminaries
Throughout,
𝖧
is a real separable Hilbert space with scalar product
⟨
⋅
,
⋅
⟩
, norm
‖
⋅
‖
, and Borel
𝜎
-algebra
B
. The symbols
→
and
⇀
denote strong and weak con-
vergence, respectively.
Id ∶ 𝖧→𝖧
denotes the identity operator on
𝖧
. Stochastic
uncertainty is modeled on a complete probability space
(Ω,F,ℙ)
, endowed with
a filtration
𝔽=(F
k
)
k
∈
ℕ
0
. By means of the Kolmogorov extension theorem, we
assume that
(Ω,F,ℙ)
is large enough so that all random variables we work with
are defined on this space. A
𝖧
-valued random variable is a measurable function
X∶ (Ω,F)→(𝖧,B)
. Let
G⊂F
be a given sub-sigma algebra. The conditional
expecation of the random variable X is denoted by
𝔼(X|G)
. If
A⊂G⊂F
, the tower-
property says that
We denote by
𝓁0(
𝔽
)
the set of sequences of real-valued random variables
(𝜉k)k∈ℕ
such that, for every
k∈ℕ
,
𝜉k
is
Fk
-measurable. For
p∈[1, ∞]
, we set
We denote the set of summable non-negative sequences by
𝓁1
+
(ℕ
)
.
We now collect some concepts from monotone operator theory. For more details, we
refer the reader to [3]. Let
F∶𝖧→2𝖧
be a set-valued operator. Its domain and graph
𝔼[
𝔼
(X|G)|A]=
𝔼
[
𝔼
(X|A)|G]=
𝔼
(X|A).
𝓁
p(𝔽)
≜
(𝜉k)k∈ℕ∈𝓁0(𝔽)
k≥1
𝜉k
p<∞ℙ-a.s.
.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
474
S.Cui et al.
1 3
are defined as
dom F≜{x∈𝖧|F(x)≠∅}, and gr (F)≜{(x,u)∈𝖧×𝖧|u∈F(x)},
respectively. A single-valued operator
C∶𝖧→𝖧
is cocoercive if there exists
𝛽>0
such that
⟨C(x)−C(y),x−y⟩≥𝛽‖C(x)−C(y)‖2
. A set-valued operator
F∶𝖧→2𝖧
is called monotone if
The set of zeros of F, denoted by
𝖹𝖾𝗋(T)
, defined as
𝖹𝖾𝗋(F)≜{x∈𝖧|0∈T(x)}
.
The inverse of F is
F−1∶𝖧→2𝖧,u↦F−1(u)={x∈𝖧|u∈F(x)}
. The resolvent
of F is
JF≜( Id +
F
)−1.
If F is maximally monotone, then
JF
is a single-valued
map. We also need the classical notion of demiregularity of an operator.
Definition 1 An operator
F∶𝖧→2𝖧
is demiregular at
x∈ dom (F)
if for every
sequence
{(yn,un)}n∈
ℕ
⊂gr (F)
and every
u∈F(y)
, we have
The notion of demiregularity captures various properties typically used to estab-
lish strong convergence of dynamical systems. [10] exhibits a large class of possibly
set-valued operators F which are demiregular. In particular, demiregularity holds if
F is uniformly or strongly monotone, or when F is the subdifferential of a uniformly
convex lower semi-continuous function f. We often use the Young inequality
3 Algorithm
Our aim is to solve the monotone inclusion problem (MI) under the following
assumption:
Assumption 2 Consider Problem 1. The set-valued operator
T∶𝖧→2𝖧
is maxi-
mally monotone with an efficiently computable resolvent. The single-valued opera-
tor
V∶𝖧→𝖧
is maximally monotone and L-Lipschitz continuous (
L>0)
with full
domain
dom V=𝖧
.
Assumption2 guarantees that the operator
F=T+V
is maximally monotone [3,
Corolllary 24.4].
For numerical tractability, we make a finite-dimensional noise assumption, com-
mon to stochastic optimization problems in (possibly infinite-dimensional) Hilbert
spaces [63].1
(8)
⟨v−w,x−y⟩≥0∀(x,v),(y,w) ∈ gr (F).
[yn⇀y,vn→v]⇒yn→y.
(9)
ab ≤
a
2
2𝜀
+𝜀b
2
2
(a,b∈ℝ)
.
1 Our analysis does not rely on this assumption. It is made here only for concreteness and because it is
the most prevalent one in applications.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
475
1 3
Stochastic relaxed inertial forward‑backward‑forward…
Assumption 3 (Finite-dimensional noise) All randomness can be described via a
finite dimensional random variable
𝜉∶ (Ω,F)→(Ξ,E)
, where
Ξ⊆
ℝ
d
is a measur-
able set with Borel sigma algebra
E
. The law of the random variable
𝜉
is denoted by
𝖯
, i.e.
𝖯(Γ) ≜
ℙ
({𝜔∈Ω|𝜉(𝜔) ∈ Γ})
for all
Γ∈E
.
To access new information about the values of the operator V(x), we adopt a sto-
chastic approximation (SA) approach where samples are accessed iteratively and
online: At each iteration, we assume to have access to a stochastic oracle (SO) which
generates some estimate on the value of the deterministic operator V(x) when the
current position is x. This information is obtained by drawing an iid sample form the
law
𝖯
. These fresh samples are then used in the numerical algorithm after an initial
extrapolation step delivering the point
Zk=Xk+𝛼k(Xk−Xk−1)
, for some extrapola-
tion coefficient
𝛼k∈[0, 1]
. Departing from
Zk
, we call the SO to retrieve the mini-
batch estimator with sample rate
mk∈ℕ
:
𝜉
k≜(𝜉
(1)
k
,…,𝜉
(m
k
)
k)
is the data sample employed by the SO to return the estimator
Ak(Zk)
. Subsequently we perform a forward-backward update with step size
𝜆k>0
:
In the final updates, a second independent call of the SO is made, using the data set
𝜂k
=(𝜂
(1)
k
,…,𝜂
(m
k
)
k)
, yielding the estimator
and the new state
This iterative procedure generates a stochastic process
{(Zk,Yk,Xk)}k∈ℕ
, defining the
relaxed inertial stochastic forward-backward-forward (RISFBF) scheme. A pseu-
docode is given as Algorithm1 below.
(10)
Ak(Zk,𝜔)
≜
1
m
k
m
k
∑
t=1
V(Zk,𝜉(t)
k(𝜔))
.
(11)
Y
k=J
𝜆k
T
(
Zk−𝜆kAk(Zk)
).
(12)
Bk(Yk,𝜔)
≜
1
m
k
m
k
∑
t=1
V(Yk,𝜂(t)
k(𝜔))
,
(13)
Xk+1
=(1−𝜌
k
)Z
k
+𝜌
k[
Y
k
+𝜆
k
(A
k
(Z
k
)−B
k
(Y
k
))
]
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
476
S.Cui et al.
1 3
Note that RISFBF is still conceptual since we have not explained how the
sequences
(𝛼k)k∈ℕ,(𝜆k)k∈ℕ
and
(𝜌k)k∈ℕ
should be chosen. We will make this precise
in our complexity analysis, starting in Sect.4.
3.1 Equivalent form ofRISFBF
We can collect the sequential updates of RISFBF as the fixed-point iteration
where
Φk,𝜆∶𝖧×Ω→𝖧
is the time-varying map given by
Formulating the algorithm in this specific way establishes the connection between
RISFBF and the heavy-ball system. Indeed, combining the iterations in (15) in one,
we get a second-order difference equations, closely resembling the structure present
in (HB):
Also, it reveals the Markovian nature of the process
(Xk)k∈ℕ
; It is clear from the for-
mulation (15) that
Xk
is Markov with respect to the sigma-algebra
𝜎({X0,…,Xk−1})
.
(14)
X
k=
K
�
k=1
𝜌kYk
∑
K
k=1
𝜌k
(15)
{
Zk=Xk+𝛼k(Xk−Xk−1)
,
X
k+1
=Z
k
−𝜌
k
Φ
k,𝜆k
(Z
k
)
Φk,𝜆
(x,𝜔)
≜
x−𝜆A
k
(x,𝜔) − ( Id
𝖧
−𝜆B
k
(⋅,𝜔))◦J
𝜆T
◦( Id
𝖧
−𝜆A
k
(⋅,𝜔))(x)
.
1
𝜌k
(Xk+1−2Xk−Xk−1)+
(1−𝛼
k
)
𝜌k
(Xk−Xk−1)+Φ
k,𝜆k(Xk+𝛼k(Xk−Xk−1)) =
0.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
477
1 3
Stochastic relaxed inertial forward‑backward‑forward…
3.2 Assumptions onthestochastic oracle
In order to tame the stochastic uncertainty in RISFBF, we need to impose some
assumptions on the distributional properties of the random fields
(Ak(x))k∈ℕ
and
(Bk(x))k∈ℕ
. One crucial statistic we need to control is the SO variance. Define the
oracle error at a point
x∈𝖧
as
Assumption 4 (Oracle Noise) We say that the SO
(i) is conditionally unbiased if
𝔼𝜉[𝜀(x,𝜉)|x]=0
for all
x∈𝖧
;
(ii) enjoys a uniform variance bound:
𝔼𝜉
[
‖
𝜀(x,𝜉)
‖2�
x]
≤
𝜎
2
for some
𝜎>0
and all
x∈𝖧
.
Define
The introduction of these two processes allows us to decompose the random estima-
tor into a mean component and a residual, so that
If Assumption 4(i) holds true then
𝔼
[W
k|
F
k
]=0=𝔼[U
k|
F
k
]=
0
. Hence, under
conditional unbiasedness, the processes
{(Uk,Fk);k∈ℕ}
and
{(
W
k
,
F
k
);k∈ℕ
}
are martingale difference sequences, where the filtrations are defined as
F0≜
F
0≜
F
1≜𝜎
(X
0
,X
1)
, and iteratively, for
k≥1
,
Observe that
Fk
⊆
F
k
⊆F
k+1
for all
k≥1
. The uniform variance bound, Assump-
tion 4(ii), ensures that the processes
{(Uk,
F
k);k∈
ℕ
},{(Wk,
F
k);k∈
ℕ
}
have finite
second moment.
Remark 1 For deriving the stochastic estimates in the analysis to come, it is impor-
tant to emphasize that
Xk
is
Fk
-measurable for all
k≥0
, and
Yk
is
Fk
-measurable.
The mini-batch sampling technology implies an online variance reduction effect,
summarized in the next lemma, whose simple proof we omit.
Lemma 1 (Variance of the SO) Suppose Assumption4 holds. Then for
k≥1
,
(16)
𝜀(x,𝜉)≜
V(x,𝜉)−V(x).
U
k(𝜔)
≜
1
mk
m
k
∑
t=1
𝜀(Zk(𝜔),𝜉(t)
k(𝜔)), and Wk(𝜔)
≜
1
mk
m
k
∑
t=1
𝜀(Yk(𝜔),𝜂(t)
k(𝜔))
.
Ak(Zk)=V(Zk)+Uk, and Bk(Yk)=V(Yk)+Wk
Fk≜𝜎
(X
0
,X
1
,
𝜉1
,
𝜂1
,…,
𝜂k−1
,
𝜉k
),F
k+1≜𝜎
(X
0
,X
1
,
𝜉1
,
𝜂1
,…,
𝜉k
,
𝜂k
)
.
(17)
𝔼
[
‖
‖
Wk
‖
‖
2
|
Fk]
≤
𝜎
2
mk
and 𝔼[
‖
‖
Uk
‖
‖
2
|
Fk]
≤
𝜎
2
mk
,ℙ−
a.s.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
478
S.Cui et al.
1 3
We see that larger sampling rates lead to more precise point estimates of the sin-
gle-valued operator. This comes at the cost of more evaluations of the stochastic
operator. Hence, any mini-batch approach faces a trade-off between the oracle com-
plexity and the iteration complexity. We want to use mini-batch estimators to achieve
an online variance reduction scheme, motivating the next assumption.
Assumption 5 (Batch Size) The batch size sequence
(mk)k∈ℕ
is non-decreasing and
satisfies
∑∞
k=1
1
mk
<
∞
.
4 Analysis
This section is organized into three subsections. The first subsection derives asymp-
totic convergence guarantees, while the second and third subsections provides lin-
ear and sublinear rate statements in strongly monotone and monotone regimes,
respectively.
4.1 Asymptotic convergence
Given
𝜆>0
, we define the residual function for the monotone inclusion (MI) as
Clearly, for every
𝜆>0
,
x∈𝖲 ⇔ 𝗋𝖾𝗌𝜆(x)=0
. Hence,
𝗋𝖾𝗌𝜆(⋅)
is a merit function for
the monotone inclusion problem. To put this merit function into context, let us con-
sider the special case where T is the subdifferential of a lower semi-continuous con-
vex function
g∶𝖧→(−∞,∞]
, i.e.
T=𝜕g
. In this case, the resolvent
J𝜆T
reduces to
the well-known proximal-operator
In the potential case, where
V(x)=∇f(x)
for some smooth convex function
f∶𝖧→ℝ
, the residual function is thus seen to be a constant multiple of the norm
of the so-called gradient mapping
‖
‖
‖
x− pr ox 𝜆g(x−𝜆V(x))
‖
‖
‖
, which is a standard
merit function in convex [64] and stochastic [65, 66] optimization. We use this func-
tion to quantify the per-iteration progress of RISFBF. The main result of this subsec-
tion is the following.
Theorem2 (Asymptotic Convergence) Let
𝛼 ,𝜀 ∈(0, 1)
be fixed parameters. Sup-
pose that Assumption1-5 hold true. Let
(𝛼k)k∈ℕ
be a non-decreasing sequence such
that
limk→∞𝛼k=𝛼
. Let
(𝜆k)k∈ℕ
be a converging sequence in
(
0,
1
4L)
such that
limk→∞
𝜆
k
=𝜆∈(0,
1
4L)
. If
𝜌
k=5(1−
𝜀
)(1−
𝛼
)
2
4(2𝛼2
k
−𝛼
k
+1)(1+L𝜆
k)
for all
k≥1
, then
(i)
limk→∞𝗋𝖾𝗌𝜆k(Zk)=0
in
L2(
ℙ
)
;
(18)
𝗋𝖾𝗌𝜆
(x)
≜‖
‖
x−J
𝜆T
(x−
𝜆
V(x))
‖
‖.
prox
𝜆g(x)
≜
argmin
u∈𝖧
{𝜆g(u)+
1
2‖
u−x
‖
2}
.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
479
1 3
Stochastic relaxed inertial forward‑backward‑forward…
(ii) the stochastic process
(Xk)k∈ℕ
generated by algorithm RISFBF weakly con-
verges to a
𝖲
-valued limiting random variable X;
(iii)
∞
k=1
(1−𝛼k)
5(1−𝛼k
4𝜌k(1+L𝜆k)
−1
−2𝛼2
k
Xk−Xk−1
2<∞
ℙ
-a.s.
We prove this Theorem via a sequence of technical Lemmas.
Lemma 3 For all
k≥1
, we have
Proof By definition,
where the last inequality uses the non-expansivity property of the resolvent operator.
Rearranging terms gives the claimed result.
◻
Next, for a given pair
(p,p∗) ∈ gr (F)
, we define the stochastic processes
(ΔMk)k∈
ℕ
,(ΔNk(p,p∗))k∈ℕ
, and
(𝚎k)k∈ℕ
as
Key to our analysis is the following energy bound on the evolution of the anchor
sequence
(‖
‖
Xk−p
‖
‖
2
)k∈ℕ
.
Lemma 4 (Fundamental Recursion) Let
(Xk)k∈ℕ
be the stochastic process generated
by RISFBF with
𝛼k∈(0, 1)
,
0≤
𝜌k<
5
4(1+L𝜆k)
, and
𝜆k∈(0, 1∕4L)
. For all
k≥1
and
(p,p∗) ∈ gr (F)
, we have
(19)
−‖
‖
Zk−Yk
‖
‖
2
≤
𝜆2
k
‖
‖
Uk
‖
‖
2−
1
2
𝗋𝖾𝗌2
𝜆
k
(Zk)
.
1
2
𝗋𝖾𝗌2
𝜆k
(Zk)= 1
2
‖
‖
‖
Zk−J𝜆kT(Zk−𝜆kV(Zk))
‖
‖
‖
2
=1
2
‖
‖
‖
Zk−Yk+J𝜆kT(Zk−𝜆kAk(Zk)) − J𝜆kT(Zk−𝜆kV(Zk))‖
‖
‖
2
≤
‖
‖
Zk−Yk
‖
‖
2+
‖
‖
‖
J𝜆kT(Zk−𝜆kAk(Zk)) − J𝜆kT(Zk−𝜆kV(Zk))
‖
‖
‖
2
≤
‖
‖
Z
k
−Y
k‖
‖
2+𝜆2
k
‖
‖
U
k‖
‖
2,
(20)
Δ
Mk
≜
5𝜌k𝜆
2
k
2(1+L𝜆k)
‖
‖
𝚎k
‖
‖
2+
𝜌k𝜆
2
k
2
‖
‖
Uk
‖
‖
2
,
(21)
Δ
N
k
(p,p
∗
)
≜
2
𝜌k𝜆k⟨
W
k
+p
∗
,p−Y
k⟩, and
(22)
𝚎k≜
W
k
−U
k.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
480
S.Cui et al.
1 3
Proof To simplify the notation, let us call
Ak≡Ak(Zk)
and
Bk≡Bk(Yk)
. We also
introduce the intermediate update
Rk≜
Y
k
+
𝜆k
(A
k
−B
k)
. For all
k≥0
, it holds true
that
Since
Introducing the process
(𝚎k)k∈ℕ
from eq. (22), the aforementioned set of inequalities
reduces to
Hence,
But
Yk+𝜆kT(Yk)∋Zk−𝜆kAk
, implying that
Xk+1−p
2
≤
(1+𝛼k)
Xk−p
2−𝛼k
Xk−1−p
2−
𝜌
k
4
𝗋𝖾𝗌2
𝜆k
(Zk)
+ΔMk+ΔNk(p,p∗)−2𝜌k𝜆kV(Yk)−V(p),Yk−p
+𝛼k
Xk−Xk−1
22𝛼k+5(1−𝛼k)
4𝜌k(1+L𝜆k)
−(1−𝛼k)
5
4
𝜌k
(1+L
𝜆k
)−1
Xk+1−Xk
2.
�
�
Zk−p
�
�
2
=
�
�Zk−Yk+Yk−Rk+Rk−p
�
�
2
=�
�Zk−Yk�
�
2+�
�Yk−Rk�
�
2+�
�Rk−p�
�
2+2⟨Zk−Yk,Yk−p⟩
+2⟨Yk−Rk,Rk−p⟩
=�
�Zk−Yk�
�
2+�
�Yk−Rk�
�
2+�
�Rk−p�
�
2+2⟨Zk−Yk,Yk−p⟩
+2⟨Yk−Rk,Rk−p⟩
=�
�Zk−Yk�
�
2+�
�Yk−Rk�
�
2+�
�Rk−p�
�
2+2⟨Zk−Yk,Yk−p⟩
+2⟨Yk−Rk,Yk−p⟩+2⟨Yk−Rk,Rk−Yk⟩
=�
�Zk−Yk�
�
2+�
�Yk−Rk�
�
2+�
�Rk−p�
�
2+2⟨Zk−Rk,Yk−p⟩
+2
⟨
Yk−Rk,Rk−Yk
⟩
=
�
�
Z
k
−Y
k�
�
2−
�
�
Y
k
−R
k�
�
2+
�
�
R
k
−p
�
�
2+2
⟨
Z
k
−R
k
,Y
k
−p
⟩.
�
�
Yk−Rk
�
�
2
=𝜆2
k
�
�Bk(Yk)−Yk(Zk)
�
�
2
≤𝜆2
k�
�V(Yk)−V(Zk)+Wk+1−Uk+1�
�
2
≤𝜆2
k�
�V(Yk)−V(Zk)�
�
2+𝜆2
k�
�Wk−Uk�
�
2+2𝜆2
k⟨V(Yk)−V(Zk),Wk−Uk
⟩
≤L2𝜆2
k
�
�
Yk−Zk
�
�
2+𝜆2
k
�
�
Wk−Uk
�
�
2+2𝜆2
k
⟨
V(Yk)−V(Zk),Wk−Uk
⟩
≤2L2𝜆2
k�
�
Y
k
−Z
k�
�
2+2𝜆2
k�
�
W
k
−U
k�
�
2.
‖
‖
Y
k
−R
k‖
‖
2≤
2L2𝜆2
k
‖
‖
Y
k
−Z
k‖
‖
2
+2𝜆2
k
‖
‖
𝚎
k‖
‖
2.
�
�
Z
k
−p
�
�
2≥
(1−2L2𝜆2
k
)
�
�
Z
k
−Y
k�
�
2
−2𝜆2
k�
�
𝚎
k�
�
2
+
�
�
R
k
−p
�
�
2
+2
⟨
Z
k
−R
k
,Y
k
−p
⟩.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
481
1 3
Stochastic relaxed inertial forward‑backward‑forward…
Pick
(p,p∗) ∈ gr (F)
, so that
p∗−V(p)∈T(p)
. Then, the monotonicity of T yields
the estimate
This is equivalent to
This implies that
Hence, we obtain the following,
Rearranging terms, we arrive at the following bound on
‖
‖
R
k
−p
‖
‖
2
:
Next, we observe that
‖
‖
X
k+1
−p
‖
‖
2
may be bounded as follows.
We may then derive a bound on the expression in(25),
1
𝜆k
(Zk−Yk−𝜆kAk)∈T(Yk)
.
⟨
1
𝜆k
(Zk−Yk−𝜆kAk)−p∗+V(p),Yk−p
⟩≥0.
(23)
1
𝜆k
(Zk−Rk−𝜆kBk)−p∗+V(p),Yk−p
≥
0,
or Zk−Rk,Yk−p≥𝜆kWk+p
∗
,Yk−p+𝜆kV(Yk)−V(p),Yk−p.
⟨Zk−Rk,Yk−x∗⟩≥𝜆k⟨Wk,Yk−x∗⟩.
�
�
Zk−p
�
�
2≥
(1−2L2𝜆2
k)
�
�
Yk−Zk
�
�
2
+
�
�
Rk−p
�
�
2
−2𝜆2
k
�
�
𝚎k
�
�
2
+2𝜆k⟨Wk+p
∗
,Yk−p⟩+2𝜆k⟨V(Yk)−V(p),Yk−p⟩.
(24)
�
�
Rk−p
�
�
2≤�
�
Zk−p
�
�
2
−(1−2L2𝜆2
k)
�
�
Yk−Zk
�
�
2
+2𝜆2
k
�
�
𝚎k
�
�
2
+2𝜆k
⟨
Wk+p∗,p−Yk
⟩
+2𝜆k⟨V(Yk)−V(p),p−Yk⟩
(25)
�
�
Xk+1−p
�
�
2
=
�
�(1−𝜌k)Zk+𝜌kRk−p
�
�
2
=�
�(1−𝜌k)(Zk−p)−𝜌k(Rk−p)�
�
2
=(1−𝜌k)2�
�Zk−p�
�
2+𝜌2
k�
�Rk−p�
�
2−2𝜌k(1−𝜌k)⟨Zk−p,Rk−p
⟩
=(1−𝜌k)�
�Zk−p�
�
2−𝜌k(1−𝜌k)�
�Zk−p�
�
2
+𝜌k�
�Rk−p�
�
2−𝜌k(1−𝜌k)�
�Rk−p�
�
2
+2𝜌k(1−𝜌k)⟨Zk−p,Rk−p⟩
=(1−𝜌k)�
�
Zk−p�
�
2+𝜌k�
�
Rk−p�
�
2−𝜌k(1−𝜌k)�
�
Rk−Zk�
�
2
=(1−𝜌k)
�
�
Zk−p
�
�
2+𝜌k
�
�
Rk−p
�
�
2−1−𝜌k
𝜌k
�
�
Xk+1−Zk
�
�
2.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
482
S.Cui et al.
1 3
By invoking (19), we arrive at the estimate
Furthermore,
which implies that
Multiplying both sides by
𝜌
k
(
1
∕
2
−
2L
𝜆
k
)
1+L𝜆k
, a positive scalar since
𝜆k
∈(0,
1
4L)
, we obtain
Rearranging terms, and noting that
(
1∕2−2L𝜆
k
)(1+L𝜆
k
)
≤
1∕2−2L
2
𝜆
2
k
, the above
estimate becomes
(26)
(
1−𝜌k)�
�Zk−p�
�
2+𝜌k�
�Rk−p�
�
2−
1−𝜌
k
𝜌k
�
�Xk+1−Zk�
�
2
≤
�
�
Zk−p
�
�
2−1−𝜌k
𝜌k
�
�
Xk+1−Zk
�
�
2−𝜌k(1−2L2𝜆2
k)
�
�
Zk−Yk
�
�
2
+2𝜆2
k
𝜌
k�
�
𝚎
k�
�
2−2𝜌
k
𝜆
k⟨
W
k
+p∗,Y
k
−p
⟩
+2𝜌
k
𝜆
k⟨
V(Y
k
)−V(p),p−Y
k⟩
(27)
=
�
�Zk−p�
�
2−
1−𝜌
k
𝜌k
�
�Xk+1−Zk�
�
2−𝜌k(1∕2−2L2𝜆2
k)�
�Zk−Yk�
�
2
+2𝜆2
k𝜌k
�
�
𝚎k
�
�
2−2𝜌k𝜆k
⟨
Wk+p∗,Yk−p
⟩
−𝜌k
2
�
�
Yk−Zk
�
�
2+2𝜌k𝜆k
⟨
V(Yk)−V(p),p−Yk
⟩
.
�
�
Xk+1−p�
�
2
≤
�
�Zk−p�
�
2−
1−𝜌
k
𝜌k
�
�Xk+1−Zk�
�
2+2𝜆2𝜌k�
�𝚎k�
�
2
−2𝜌k𝜆k⟨Wk+p∗,Yk−p⟩
−𝜌k(1∕2−2L2𝜆2
k)�
�Yk−Zk�
�
2−𝜌k
4
𝗋𝖾𝗌2
𝜆k(Zk)
+𝜌k𝜆2
k
2
�
�
Uk
�
�
2+2𝜌k𝜆k
⟨
V(Yk)−V(p),p−Yk
⟩
.
1
𝜌
k
‖
‖Xk+1−Zk
‖
‖=
‖
‖Rk−Zk
‖
‖
≤‖
‖Rk−Yk
‖
‖+
‖
‖Yk−Zk
‖
‖
≤𝜆k
‖
‖
Bk−Ak
‖
‖
+
‖
‖
Yk−Zk
‖
‖
≤(1+L𝜆
k
)
‖
‖
Y
k
−Z
k‖
‖
+𝜆
k‖
‖
𝚎
k‖
‖
,
(28)
1
2
𝜌2
k
‖
‖
Xk+1−Zk
‖
‖
2
≤
(1+L𝜆k)2
‖
‖
Yk−Zk
‖
‖
2+𝜆2
k
‖
‖
𝚎k
‖
‖
2
.
(29)
1∕2−2L𝜆
k
2
𝜌k(1+L𝜆k)
‖
‖Xk+1−Zk‖
‖
2
≤
𝜌k(1∕2−2L𝜆k)(1+L𝜆k)‖
‖Yk−Zk‖
‖
2
+𝜆2
k𝜌k(1∕2−2L𝜆k)
1+L𝜆k
‖
‖
𝚎k
‖
‖
2.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
483
1 3
Stochastic relaxed inertial forward‑backward‑forward…
Substituting this bound into the first majorization of the anchor process
‖
‖
X
k+1
−p
‖
‖
2
,
we see
Observe that
and Lemma16 gives
By hypothesis,
𝛼k,𝜌k,𝜆k
are defined such that
5∕2−2𝜌
k
(1+L𝜆
k
)
2𝜌k(1+L𝜆k)
>
0
. Then, using both of
these relations in the last estimate for
‖
‖
X
k+1
−p
‖
‖
2
, we arrive at
(30)
−
𝜌k(1∕2−2L2𝜆2
k)‖
‖Yk−Zk‖
‖
2
≤
−
1∕2−2L𝜆
k
2𝜌k(1+L𝜆k)
‖
‖Xk+1−Zk‖
‖
2
+𝜌k𝜆2
k(1∕2−2L𝜆k)
1+L
𝜆k
‖
‖
𝚎k
‖
‖
2.
Xk+1−p
2
≤
Zk−p
2−
1−𝜌
k
𝜌k
+
1∕2−2L𝜆
k
2𝜌k(1+L𝜆k)
Xk+1−Zk
2
+𝜌k𝜆2
k
𝚎k
22+1∕2−2L𝜆k
1+L𝜆k+2𝜌k𝜆kV(Yk)−V(p),p−Yk
−2𝜌k𝜆kWk+p∗,Yk−p−𝜌k
4
𝗋𝖾𝗌2
𝜆k(Zk)+ 𝜌k𝜆2
k
2
Uk
2
=
Zk−p
2−𝜌k
4
𝗋𝖾𝗌2
𝜆k(Zk)+ 𝜌k𝜆2
k
2
Uk
2−2𝜌k𝜆kWk+p∗,Yk−p
−5∕2−2𝜌k(1+L𝜆k)
2𝜌k(1+L𝜆k)
Xk+1−Zk
2
+
5𝜌k𝜆2
k
2(1+L𝜆k)
𝚎k
2+2𝜌k𝜆k
V(Yk)−V(p),p−Yk
.
(31)
‖
‖
Xk+1−Zk
‖
‖
2
=
‖
‖
(Xk+1−Xk)−𝛼k(Xk−Xk−1)
‖
‖
2
≥(1−𝛼
k
)
‖
‖
X
k+1
−X
k‖
‖
2+(𝛼2
k
−𝛼
k
)
‖
‖
X
k
−X
k−1‖
‖
2
,
(32)
‖
‖
Z
k
−p
‖
‖
2
=(1+𝛼
k
)
‖
‖
X
k
−p
‖
‖
2
−𝛼
k‖
‖
X
k−1
−p
‖
‖
2
+𝛼
k
(1+𝛼
k
)
‖
‖
X
k
−X
k−1‖
‖
2.
Xk+1−p
2≤
(1+𝛼k)
Xk−p
2
−𝛼k
Xk−1−p
2
+𝛼k(1+𝛼k)
Xk−Xk−1
2
−2𝜌k𝜆kWk+1+p∗,Yk−p
−𝜌k
4
𝗋𝖾𝗌2
𝜆k
(Zk)+
5𝜌k𝜆2
k
2(1+L𝜆k)
𝚎k
2
+
𝜌k𝜆2
k
2
Uk
2+2𝜌k𝜆kV(Yk)−V(p),p−Yk
−
5
4
𝜌k
(1+L
𝜆k
)−1
(1−𝛼k)
Xk+1−Xk
2+(𝛼2
k−𝛼k)
Xk−Xk−1
2
.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
484
S.Cui et al.
1 3
Using the respective definitions of the stochastic increments
ΔMk,ΔNk(p,p∗)
in (20)
and (21), we arrive at
◻
Recall that
Yk
is
Fk
-measurable. By the law of iterated expectations, we therefore
see
for all
(p,p∗) ∈ gr (F)
. Observe that if we choose
(p,0) ∈ gr (F)
, meaning that
p∈𝖲
, then
ΔNk(p,0)≡ΔNk(p)
is a martingale difference sequence. Furthermore,
for all
k≥1
,
where
𝚊
k
≜
𝜆2
k
(
10𝜌k
1+L
𝜆k
+𝜌k
2
)
.
To prove the a.s. convergence of the stochastic process
(Xk)k∈ℕ
, we rely on the fol-
lowing preparations. Motivated by the analysis of deterministic inertial schemes, we
are interested in a regime under which
𝛼k
is non-decreasing.
For a fixed reference point
p∈𝖧
, define the anchor sequences
𝜙
k(p)
≜
1
2
‖
‖
Xk−p
‖
‖
2
, and the energy sequence
Δ
k
≜
1
2
‖
‖
Xk−Xk−1
‖
‖
2.
In terms of
these sequences, we can rearrange the fundamental recursion from Lemma 4 to
obtain
For a given pair
(p,p∗) ∈ gr (F)
, define
(33)
Xk+1−p
2
≤
(1+𝛼k)
Xk−p
2−𝛼k
Xk−1−p
2−
𝜌
k
4
𝗋𝖾𝗌2
𝜆k
(Zk)
+ΔMk+ΔNk(p,p∗)−2𝜌k𝜆kV(Yk)−V(p),Yk−p
+𝛼k
Xk−Xk−1
22𝛼k+5(1−𝛼k)
4𝜌k(1+L𝜆k)
−(1−𝛼k)
5
4
𝜌k
(1+L
𝜆k
)−1
Xk+1−Xk
2.
𝔼
[ΔN
k
(p,p∗)
F
k
]=𝔼
𝔼[ΔN
k
(p,p∗)
F
k
]
F
k
=2𝜌
k
𝜆
k
𝔼[
p∗,p−Y
k
F
k
]
,
(34)
𝔼
[ΔMk
|
Fk]
≤
5𝜌k𝜆
2
k
1+L𝜆k
𝔼[
‖
‖
Wk
‖
‖
2
|
Fk]+𝜆2
k
(
5𝜌k
1+L𝜆k
+𝜌k
2)
𝔼[
‖
‖
Uk
‖
‖
2
|
Fk]
≤
𝚊k𝜎2
mk
,
𝜙
k+1(p)−𝛼k𝜙k(p)−(1−𝛼k)
5
4𝜌k(1+L𝜆k)−1
Δk+1
≤
𝜙k(p)−𝛼k𝜙k−1(p)
−(1−𝛼k)5
4𝜌k(1+L𝜆k)−1Δk+1
2ΔMk+1
2ΔNk(p,p∗)
−𝜌k𝜆kV(Yk)−V(p),Yk−p+Δ
k2𝛼2
k+(1−𝛼k)1−5(1−𝛼k)
4𝜌k(1+L𝜆k)
−𝜌k
8
𝗋𝖾𝗌2
𝜆k
(Zk).
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
485
1 3
Stochastic relaxed inertial forward‑backward‑forward…
Then, in terms of the sequence
and using the monotonicity of V, guaranteeing that
⟨
V
(
Y
k)−
V
(
p
),
Y
k−
p
⟩≥0
, we
get
Defining
we arrive at
Our aim is to use
Qk(p)
as a suitable energy function for RISFBF. For that to work,
we need to identify a specific parameter sequence pair
(𝜌k,𝛼k)
so that
𝛽k≥0
and
𝜃k≥0
, taking the following design criteria into account:
1.
𝛼k∈(0, 𝛼 ]⊂(0, 1)
for all
k≥1
;
2.
𝛼k
is non-decreasing with
Incorporating these two restrictions on the inertia parameter
𝛼k
, we are left with the
following constraints:
To identify a constellation of parameters
(𝛼k,𝜌k)
satisfying these two conditions,
define
Then,
(35)
Q
k(p)
≜
𝜙k(p)−𝛼k𝜙k−1(p)+(1−𝛼k)
(
5
4𝜌
k
(1+L𝜆
k
)−1
)
Δk
.
(36)
𝛽
k+1
≜
(1−𝛼k)
(
5
4𝜌k(1+L𝜆k)
−1
)
−(1−𝛼k+1)
(
5
4𝜌k+1(1+L𝜆k+1)
−1
),
Q
k+1(p)
≤
Qk(p)−
𝜌
k
8
𝗋𝖾𝗌2
𝜆k
(Zk)+
1
2ΔMk+
1
2ΔNk(p,p∗)+(𝛼k−𝛼k+1)𝜙k(p
)
+
[
2𝛼2
k+(1−𝛼k)
(
1−5(1−𝛼k)
4
𝜌k
(1+L
𝜆k
)
)]
Δk−𝛽k+1Δk+1.
𝜃
k
≜𝜌
k
8
𝗋𝖾𝗌2
𝜆k
(Zk)−
[
2𝛼2
k+(1−𝛼k)
(
1−
5(1−𝛼
k
)
4𝜌k(1+L𝜆k))]
Δk
,
(37)
Q
k+1(p)
≤
Qk(p)−𝜃k+
1
2
ΔMk+
1
2
ΔNk(p,p∗)+(𝛼k−𝛼k+1)𝜙k(p)−𝛽k+1Δk+1
.
(38)
sup
k≥1
𝛼
k
=𝛼 , and inf
k
≥
1
𝛼
k
>0.
(39)
𝛽
k
≥
0 and 2𝛼2
k+(1−𝛼k)
(
1−
5(1−𝛼
k
)
4𝜌k(1+L𝜆k))≤0.
(40)
h
k(x,y)
≜
(1−x)
(
5
4y(1+L
𝜆k
)−1
).
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
486
S.Cui et al.
1 3
which gives
Solving this condition for
𝜌k
reveals that 1
𝜌k
≥
4(2𝛼
2
k−𝛼k+1)(1+L𝜆k)
5(1−𝛼k)
2
.
Using the design
condition
𝛼k≤𝛼 < 1
, we need to choose the relaxation parameter
𝜌k
so that
𝜌
k
≤
5
(
1
−𝛼
k
)2
4(1+L𝜆
k
)(2𝛼2
k
−𝛼
k
+1
)
. This suggests to use the relaxation sequence
𝜌
k=𝜌k(𝛼k,𝜆k)
≜
5(1−𝜀 )(1−𝛼 )
2
4(1+L𝜆
k
)(2𝛼2
k
−𝛼
k
+1
)
. It remains to verify that with this choice we can
guarantee
𝛽k≥0.
This can be deduced as follows: Recalling (40), we get
In particular, we note that if f(𝛼)
≜(
1
−𝛼)(
2
𝛼2−𝛼+
1
)
(1−𝜀 )(1−𝛼 )
2+𝛼−
1
, then
We consider two cases:
Case 1:
𝛼 ≤1∕2
. In this case
Case 2:
1∕2< 𝛼 < 1
. In this case
Thus,
f(𝛼)
is decreasing in
𝛼∈(0, 𝛼 ]
, where
0< 𝛼 < 1
.
Using these relations, we see that (37) reduces to
where
𝜃k≥0
. This is the basis for our proof of Theorem2.
Proof of Theorem2 We start with (i). Consider (42), with the special choice
p∗=0
,
so that
p∈𝖲
. Taking conditional expectations on both sides of this inequality, we
arrive at
0≥
2𝛼
2
k−(1−𝛼k)
(
hk(𝛼k,𝜌k)+(1−𝛼k)−1
)
=2𝛼2
k+𝛼k(1−𝛼k)−(1−𝛼k)hk(𝛼k,𝜌k)
=
𝛼k
(1+
𝛼k
)−(1−
𝛼k
)h
k
(
𝛼k
,
𝜌k
),
(41)
h
k(𝛼k,𝜌k)
≥𝛼
k
(1+𝛼
k
)
1−𝛼k
.
h
k(𝛼k,𝜌k)=(1−𝛼k)
(
5
4
𝜌k
(1+L
𝜆
)−1
)
=(1−
𝛼
k)(2
𝛼2
k−
𝛼
k+1)
(1−
𝜀
)(1−
𝛼
)2+𝛼k−
1.
f�(𝛼)= (1−𝛼)(4𝛼−1)−(2𝛼2−𝛼+1)
(1−𝜀 )(1−𝛼 )
2+1=−6𝛼2+6𝛼−2+(1−𝜀 )(1−𝛼 )2
(1−𝜀 )(1−𝛼 )
2=−6(𝛼−
1
2)2−
1
2+(1−𝜀 )(1−𝛼 )
2
(1−𝜀 )(1−𝛼 )
2
f�(𝛼)
≤
−6(𝛼 −
1
2)2−
1
2+(1−𝜀 )(1−𝛼 )2
(1−𝜀 )(1−𝛼 )2
≤
−5𝛼
2
+4𝛼 −1
(1−𝜀 )(1−𝛼 )
2<
0.
f�(𝛼)
≤
−6(1∕2−1∕2)
2
−1∕2+(1−𝜀 )(1−𝛼 )
2
(1−𝜀 )(1−𝛼 )
2
≤
−1∕2+(1−𝜀 )(1−1∕2)
2
(1−𝜀 )(1−𝛼 )
2<
0.
(42)
Q
k+1(p)
≤
Qk(p)−𝜃k+
1
2
ΔMk+
1
2
ΔNk(p,p∗)
,
𝔼[Qk+1(p)|Fk]≤Qk(p)−𝜃k+𝜓k,
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
487
1 3
Stochastic relaxed inertial forward‑backward‑forward…
where
𝜓
k
≜
𝚊k𝜎
2
2mk
. By design of the relaxation sequence
𝜌k
, we see that
Since
limk→∞𝜆k=𝜆∈(0, 1∕4
L
)
, and
limk→∞𝛼k=𝛼 ∈(0, 1)
, we conclude that the
sequence
(𝚊k)k∈ℕ
is bounded. Consequently, thanks to Assumption5, the sequence
(𝜓k)k∈ℕ
is in
𝓁1
+
(ℕ
)
. We next claim that
Qk(p)≥0
. To verify this, note that
where the first and second inequality uses
𝜀 < 1
and
𝛼k≤𝛼 ∈(0, 1)
, the third ine-
quality makes use of the Young inequality:
1−a
2a
‖
‖
Xk−p
‖
‖
2
+
a
2(1−a)
‖
‖
Xk−Xk−1
‖
‖
2≥‖
‖
Xk−p
‖
‖
⋅
‖
‖
Xk−Xk−1
‖
‖
. Finally, the fourth
inequality uses the triangle inequality
‖
‖
X
k−1
−p
‖
‖≤‖
‖
X
k
−X
k−1‖
‖
+
‖
‖
X
k
−p
‖
‖
.
Lemma 17 readily yields the existence of an a.s. finite limiting random variable
Q∞(p)
such that
Qk(p)→Q∞(p)
,
ℙ
-a.s., and
(
𝜃
k
)
k∈
ℕ∈𝓁
1
+
(𝔽
)
. Since
𝜆k→𝜆
, we get
lim
k→∞𝜌k=5(1−
𝜀
)(1−
𝛼
)
2
4(1+L𝜆)(2𝛼
2
+1−𝛼 )
. Hence,
𝚊
k=𝜆2
k
(10𝜌
k
1+L𝜆k
+
𝜌
k
2
)
=𝜆2
k
(
10
1+L𝜆k
+1
2
)
5(1−
𝜀
)
4(2𝛼2
k
−𝛼
k
+1)(1+L𝜆
k
)
.
Q
k(p)= 1
2
Xk−p
2−
𝛼
k
2
Xk−1−p
2+(1−
𝛼
k)
2
5
4𝜌k(1+L𝜆k)−1
Xk−Xk−1
2
=1
2
Xk−p
2+(1−𝛼k)(2𝛼2
k+1−𝛼k)
(1−𝜀 )(1−𝛼 )2−1+𝛼k1
2
Xk−Xk−1
2
−𝛼k
2
Xk−1−p
2
≥1
2
Xk−p
2+(1−𝛼k)(𝛼2
k+1−𝛼k)
(1−𝜀 )(1−𝛼 )2−1+𝛼k1
2
Xk−Xk−1
2
−𝛼k
2
Xk−1−p
2
≥1
2
Xk−p
2+(1−𝛼k)(𝛼2
k+1−𝛼k)
(1−𝛼k)2−1+𝛼k1
2
Xk−Xk−1
2
−𝛼k
2
Xk−1−p
2
=(𝛼k+(1−𝛼k))
Xk−p
2+𝛼k+
𝛼2
k
1−𝛼k
Xk−Xk−1
2
−𝛼k
Xk−1−p
2
≥𝛼k
2
Xk−p
2+
Xk−Xk−1
2
−𝛼k
2
Xk−1−p
2+𝛼k
Xk−p
⋅
Xk−Xk−1
≥𝛼k
2
Xk−p
+
Xk−Xk−1
2−𝛼k
2
Xk−1−p
2≥0.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
488
S.Cui et al.
1 3
ℙ
-a.s. We conclude that
lim
k→∞𝗋𝖾𝗌
2
𝜆k
(Zk)=
0
,
ℙ
-a.s..
To prove (ii) observe that, since
𝜀 ∈(0, 1)
and
limk→∞𝛼k=𝛼
, it follows
Consequently,
limk→∞
‖
‖
X
k
−X
k−1‖
‖
2
=
0
,
ℙ
-a.s., and
(
𝜙
k
(p)−𝛼
k
𝜙
k−1
(p)
)k∈ℕ
is
almost surely bounded. Hence, for each
𝜔∈Ω
, there exists a bounded random vari-
able
C1(𝜔)∈[0, ∞)
such that
Iterating this relation, using the fact that
𝛼 ∈[0, 1)
, we easily derive
Hence,
(𝜙k(p))k∈ℕ
is a.s. bounded, which implies that
(Xk)k∈ℕ
is bounded
ℙ
-a.s. We next claim that
(‖
‖
X
k
−p
‖
‖
)
k∈ℕ
converges to a
[0, ∞)
-valued random
variable
ℙ
-a.s. Indeed, take
𝜔∈Ω
such that
𝜙k(p,𝜔)≡𝜙k(𝜔)
is bounded.
Suppose there exists
𝚝1(𝜔)∈[0, ∞),
𝚝
2(𝜔)∈[0, ∞)
, and subsequences
(𝜙
kj
(𝜔))
j∈
ℕ
and
(𝜙
l
j(𝜔))
j∈
ℕ
such that
𝜙
k
j(𝜔)→𝚝
1
(𝜔)
and
𝜙
l
j(𝜔)→𝚝
2
(𝜔)>𝚝
1
(𝜔)
. Then,
lim
j→∞
Q
k
j(p)(
𝜔
)=Q
∞
(p)(
𝜔
)=(1−
𝛼
)
𝚝1
(
𝜔
)
<
(1−
𝛼
)
𝚝2
(
𝜔
)=lim
j→∞
Q
l
j(p)(
𝜔
)=Q
∞
(p)(
𝜔
)
, a
contradiction. It follows that
𝚝1(𝜔)=𝚝2(𝜔)
and, in turn,
𝜙k(𝜔)→𝚝(𝜔)
. Thus, for
each
p∈𝖲
,
𝜙k(p)→𝚝
ℙ
-a.s.
Since we assume that
𝖧
is separable, [67, Prop 2.3(iii)] guarantees that there
exists a set
Ω0∈F
with
ℙ(Ω0)=1
, and, for every
𝜔∈Ω
0
and every
p∈𝖲
, t he
sequence
(‖
‖
X
k
(
𝜔
)−p
‖
‖
)
k∈ℕ
converges.
We next show that all weak limit points of
(Xk)k∈ℕ
are contained in
𝖲
. Let
𝜔∈Ω
such that
(Xk(𝜔))k∈ℕ
is bounded. Thanks to [3, Lemma 2.45], we can find a weakly
convergent subsequence
(X
k
j(𝜔))
j∈
ℕ
with limit
𝜒(𝜔)
, i.e. for all
u∈𝖧
we have
lim
j→∞
Xkj(𝜔),u
=
𝜒(𝜔),u
. This implies
showing that
Z
k
j(𝜔)⇀𝜒(𝜔)
. Along this weakly converging subsequence, define
Clearly,
𝗋𝖾𝗌
𝜆k
j
(Zkj(𝜔)) =
‖
‖
‖
rkj(𝜔)
‖
‖
‖
, so that
lim
j→∞
r
k
j(𝜔)=0
. By definition
lim
k
→∞
(
2𝛼2
k−(1−𝛼k)
(
1−5(1−𝛼k)
4𝜌k(1+L𝜆k)
))‖
‖
Xk(𝜔)−Xk−1(𝜔)
‖
‖
2=
0, and
lim
k→∞
𝜌k
4
𝗋𝖾𝗌2
𝜆k
(Zk(𝜔)) = 0.
[
2𝛼2
k+(1−𝛼k)
(
1−
5(1−𝛼
k
)
4𝜌
k
(1+L𝜆
k
)
)]≤
−𝜀
1−𝜀 (2𝛼 2+1−𝛼 )<
0.
𝜙k(p,𝜔)≤C1(𝜔)+𝛼k𝜙k−1(p,𝜔)≤C1(𝜔)+ 𝛼 𝜙 k−1(p,𝜔)∀k≥1.
𝜙
k(p,𝜔)
≤C
1
(𝜔)
1−𝛼
+𝛼 k𝜙1(p,𝜔)
.
lim
j→∞
Zkj(𝜔),u
=lim
j→∞
Xkj(𝜔),u
+lim
j→∞
𝛼kj
Xkj(𝜔)−Xkj−1(𝜔),u
=
𝜒(𝜔),u
,
r
kj(𝜔)
≜
Zkj(𝜔)−J𝜆k
j
T(Zkj(𝜔)−𝜆kjV(Zkj(𝜔)))
.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
489
1 3
Stochastic relaxed inertial forward‑backward‑forward…
Since V and
F=T+V
are maximally monotone, their graphs are sequentially
closed in the weak-strong topology
𝖧weak ×𝖧strong
[3, Prop. 20.33(ii)]. Therefore, by
the strong convergence of the sequence
(r
k
j(𝜔))
j∈
ℕ
, we deduce weak convergence of
the sequence
(Z
k
j(
𝜔
)−r
k
j(
𝜔
)
,
Z
k
j(
𝜔
))
j∈ℕ
⇀
(
𝜒
(
𝜔
)
,𝜒
(
𝜔
))
. Therefore
1
𝜆
rk
j
(𝜔)−V(Zk
j
(𝜔
))
+
V
(
Zkj(𝜔)−rkj(𝜔)
)
→
0
. Hence,
0∈(T+V)(𝜒(𝜔))
, showing that
𝜒(𝜔)∈𝖲
.
Invoking [67, Prop 2.3(iv)], we conclude that
(Xk)k∈ℕ
converges weakly
ℙ
-a.s to an
𝖲
-valued random variable.
We now establish (iii). Let
qk≜
𝔼[Q
k
(p
)]
, so that (42) yields the recursion
By Assumption 5, and the definition of all sequences involved, we see that
∑∞
k=1
𝜓
k
<
∞
. Hence, a telescopian argument gives
Hence, for all
k≥1
, rearranging the above reveals
Letting
k→∞
, we conclude
(
𝔼[𝜃k]
)k∈
ℕ∈𝓁
1
+
(ℕ
)
. Classically, this implies
𝜃k→0
ℙ
-a.s. By a simple majorization argument, we deduce that
ℙ
-a.s.
◻
Remark 2 The above result gives some indication of the balance between the inertial
effect and the relaxation effect. Our analysis revealed that the maximal value of the
relaxation parameter is
𝜌≤
5(1−
𝜀
)(1−
𝛼
)
2
4(1+L𝜆)(2𝛼
2
−𝛼+1)
. This is closely aligned with the maximal
relaxation value exhibited in Remark 2.13 of [2]. Specifically, the function
𝜌
m(𝛼,𝜀)= 5(1−
𝜀
)(1−
𝛼
)
2
4(1+L𝜆)(2𝛼
2
−𝛼+1)
. This function is decreasing in
𝛼
. For this choice of param-
eters, one observes that for
𝛼→0
we get
𝜌
→
5(1−𝜀)
4(1+L𝜆)
and for
𝛼→1
it is observed
𝜌→0
.
1
𝜆
k
j
rkj(𝜔)−V(Zkj(𝜔)) + V
(
Zkj(𝜔)−rkj(𝜔)
)
∈F(Zkj(𝜔)−rkj(𝜔))
.
qk≤qk−1−𝔼[𝜃k]+𝜓k.
q
k−q0=
k
∑
i=1
(qi−qi−1)
≤
−
k
∑
i=1
𝔼[𝜃i]+
k
∑
i=1
𝜓i
≤
−
k
∑
i=1
𝔼[𝜃i]+
∞
∑
i=1
𝜓i
.
k
∑
i=1
𝔼[𝜃i]
≤
q0+
∞
∑
i=1
𝜓i<∞
.
∞
>
∞
∑
k=1
{
𝜌k
8
𝗋𝖾𝗌2
𝜆k
(Zk)−
[
2𝛼2
k+(1−𝛼k)
(
1−
5(1−𝛼k)
4𝜌k(1+L𝜆k)
)]
Δk
}
≥
∞
∑
k=1[
(1−𝛼k)
(
5(1−𝛼k)
4𝜌
k
(1+L𝜆
k
)−1
)
−2𝛼2
k
]
Δk.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
490
S.Cui et al.
1 3
As an immediate corollary of Theorem2, we obtain a convergence result when
all parameter sequences are constant.
Corollary 5 (Asymptotic convergence under constant inertia and relaxation) Let the
same Assumptions as in Theorem2 hold. Consider Algorithm RISFBF with the con-
stant parameter sequences
𝛼
k
≡
𝛼∈(0, 1),𝜆k
≡
𝜆∈(0,
1
4L)
and
𝜌
k=𝜌< 5(1−𝛼)
2
4(1+L𝜆)(2𝛼
2
+1−𝛼)
. Then
(Xk)k∈ℕ
converges weakly
ℙ
-a.s. to a limiting random
variable with values in
𝖲
.
In fact, the a.s. convergence with a larger
𝜆k
is allowed as shown in the following
corollary.
Corollary 6 (Asymptotic convergence under larger steplength) Let the same Assump-
tions as in Theorem2 hold. Consider Algorithm RISFBF with the constant parame-
ter sequences
𝛼
k
≡
𝛼∈(0, 1),𝜆k
≡
𝜆∈(0,
1−𝜈
2L)
and
𝜌
k=𝜌< (3−
𝜈
)(1−
𝛼
)
2
2(1+L𝜆)(2𝛼
2
+1−𝛼)
, where
0<𝜈<1
. Then
(Xk)k∈ℕ
converges weakly
ℙ
-a.s. to a limiting random variable with
values in
𝖲
.
Proof First we make a slight modification to (27) that the following relation holds
for
0<𝜈<1
Then similarly with (29), we multiply both sides of (28) by
𝜌
k
((1−𝜈)−2L𝜆
k
)
1+L𝜆k
, which is
positive since
𝜆
k∈(0,
1−𝜈
2L)
. The convergence follows in a similar fashion to Theo-
rem2.
◻
Another corollary of Theorem2 is a strong convergence result, assuming that F is
demiregular (cf. Definition 1).
Corollary 7 (Strong Convergence under demiregularity) Let the same Assumptions
as in Theorem2 hold. If
F=T+V
is demiregular, then
(Xk)k∈ℕ
converges strongly
ℙ
-a.s. to a
𝖲
-valued random variable.
Proof Set
y
k
j
(𝜔)
≜
Zk
j
(𝜔)−rk
j
(𝜔
)
, and
u
k
j
(𝜔)
≜1
𝜆
rk
j
(𝜔)−V(Zk
j
(𝜔)) + V(Zk
j
(𝜔)−rk
j
(𝜔
))
.
We know from the proof of Theorem 2 that
y
k
j(𝜔)⇀𝜒(𝜔)
and
u
k
j(𝜔)→0
. If
F=T+V
is demiregular then
y
k
j(𝜔)→𝜒(𝜔)
. Since we know
r
k
j(𝜔)→0
, we con-
clude
Z
k
j(𝜔)→𝜒(𝜔)
. Since
Zk
and
Xk
have the same limit points, it follows
Xk→𝜒
.
◻
(1−𝜌k)�
�Zk−p�
�
2+𝜌k�
�Rk−p�
�
2−
1−𝜌
k
𝜌k
�
�Xk+1−Zk�
�
2
≤�
�
Zk−p
�
�
2−1−𝜌k
𝜌k
�
�
Xk+1−Zk
�
�
2−𝜌k((1−𝜈)−2L2𝜆2
k)
�
�
Zk−Yk
�
�
2
+2
𝜆
2
𝜌k�
�
𝚎
k�
�
2−2
𝜌k𝜆k⟨
W
k
+p∗,Y
k
−p
⟩
−
𝜌k𝜈�
�
Y
k
−Z
k�
�
2+2
𝜌k𝜆k⟨
V(Y
k
)−V(p),p−Y
k⟩.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
491
1 3
Stochastic relaxed inertial forward‑backward‑forward…
4.2 Linear convergence
In this section, we derive a linear convergence rate and prove strong convergence
of the last iterate in the case where the single-valued operator V is strongly mono-
tone. Various linear convergence results in the context of stochastic approximation
algorithms for solving fixed-point problems are reported in [68] in the context of
the random sweeping processes. In a general structured monotone inclusion setting
[69] derive rate statements for cocoercive mean operators in the context of forward-
backward splitting methods. More recently, Cui and Shanbhag [27] provide linear
and sublinear rates of convergence for a variance-reduced inexact proximal-point
scheme for both strongly monotone and monotone inclusion problems. However, to
the best of our knowledge, our results are the first published for a stochastic operator
splitting algorithm, featuring relaxation and inertial effects. Notably, this result does
not require imposing Assumption4(i) (i.e. the noise process be conditionally unbi-
ased.) Instead our derivations hold true under a weaker notion of an asymptotically
unbiased SO.
Assumption 6 (Asymptotically unbiased SO) There exists a constant
𝚜>0
such that
for all
k≥1
.
This definition is rather mild and is imposed in many simulation-based optimiza-
tion schemes in finite dimensions. Amongst the more important ones is the simulta-
neous perturbation stochastic approximation (SPSA) method pioneered by Spall [70,
71]. In this scheme, it is required that the gradient estimator satisfies an asymptotic
unbiasedness requirement; in particular, the bias in the gradient estimator needs to
diminish at a suitable rate to ensure asymptotic convergence. In fact, this setting
has been investigated in detail in the context of stochastic Nash games [72]. Further
examples for stochastic approximation schemes in a Hilbert-space setting obeying
Assumption 6 are [73, 74] and [35]. We now discuss an example that further clari-
fies the requirements on the estimator.
Example 3 Let
{
V
k
(x,
𝜉
)}
k∈ℕ
be a collection of independent random
𝖧
-valued vector
fields of the form
Vk
(x,
𝜉
)=V(x)+
𝜀k
(x,
𝜉)
such that
where
𝜎 > 0
and
b>0
such that
(Bk)k∈ℕ
is an
𝖧
-valued sequence satisfying
‖
B
k‖
2
≤
b
2
in an a.s. sense. These statistics can be obtained as
(43)
𝔼
[
‖
‖
Uk
‖
‖
2
|
Fk]
≤
𝚜
2
mk
and 𝔼[
‖
‖
Wk
‖
‖
2
|
Fk]
≤
𝚜
2
mk
,ℙ−
a.s.
𝔼
𝜉[𝜀k(x,𝜉)
x]=
B
k
mk
and 𝔼𝜉[
𝜀k(x,𝜉)
2
x]
≤
𝜎 2ℙ−
a.s.,
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
492
S.Cui et al.
1 3
Setting
𝚜
2
≜𝜎
2
+
b2
, we see that condition (43) holds. A similar estimate holds for
the random noise
‖
‖
W
k‖
‖
2
.
Assumption 7
V∶𝖧→𝖧
is
𝜇
-strongly monotone (
𝜇>0
), i.e.
Combined with Assumption1, strong monotonicity implies that
𝖲={x}
for some
x∈𝖧
.
Remark 3 In the context of a structured operator
F=T+V
, the assump-
tion that the single-valued part V is strongly monotone can be done without loss
of generality. Indeed, if instead T is assumed to be
𝜇
-strongly monotone, then
(V+𝜇Id)+(T−𝜇Id )
is maximally monotone and Lipschitz continuous while
V≜V+𝜇Id
may be seen to be
𝜇
-strongly monotone operator.
Our first result establishes a “perturbed linear convergence” rate on the anchor
sequence
(‖
‖
Xk−x
‖
‖
2
)k∈ℕ
, similar to the one derived in [68, Corollary 3.2] in the
context of randomized fixed point iterations.
Theorem8 (Perturbed linear convergence) Consider RISFBF with
X0=X1
. Suppose
Assumptions 1-3, Assumption 6 and Assumption 7 hold. Let
𝖲={x}
denotes the
unique solution of (MI). Suppose
𝜆
k
≡
𝜆
≤
min
{
a
2𝜇
,b𝜇,1−a
2
L}
, where
0<a,b<1
,
L
2
≜
L2+
1
2
,
𝜂k≡𝜂≜(1−b)𝜆𝜇
. Define
Δ
Mk
≜
2𝜌k
‖
‖
Wk
‖
‖
2+(3−a)
𝜌
k
𝜆2
k
1+
L𝜆k
‖
‖
𝚎k
‖
‖
2
. Let
(𝛼k)k∈ℕ
be a non-decreasing sequence such that
0<𝛼
k≤𝛼 < 1
, and define
𝜌
k
≜
(3−a)(1−
𝛼
k)
2
2(2
𝛼
2
k
−0.5
𝛼k
+1)(1+
L
𝜆)
for every
k∈ℕ
. Set
where
q=1−𝜌𝜂 ∈(0, 1)
,
𝜌
=16(3−a)(1−
𝛼
)
2
31(1+
L𝜆)
. Then the following hold:
𝔼
[
Uk
2Fk]=𝔼
1
mk
mk
t=1
𝜀t(Zk)
2
Fk
=1
m2
k
mk
t=1
𝔼[
𝜀t(Zk)
2Fk]+ 2
m2
k
mk
t=1
l>t
𝔼[𝜀t(Zk),𝜀l(Zk)Fk
]
≤𝜎 2
m
k
+(mk−1)
m
k
Bk
2
m
k
≤𝜎 2+
b2
m
k
ℙ−a.s.
(44)
⟨
V
(
x
)−
V
(
y
),
x
−
y
⟩≥𝜇‖
x
−
y
‖2∀
x
,
y
∈ dom
V
=𝖧.
(45)
H
k
≜‖
‖Xk−x
‖
‖
2+(1−𝛼k)
(
3−a
2𝜌k(1+
L𝜆)−1
)‖
‖Xk−Xk−1
‖
‖
2−𝛼k
‖
‖Xk−1−x
‖
‖
2
,
c
k≜𝔼[ΔMk
|
Fk], and ck≜
k
∑
i=1
qk−i𝔼[ci
|
F1],
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
493
1 3
Stochastic relaxed inertial forward‑backward‑forward…
(i)
(
c
k
)
k∈
ℕ∈𝓁
1
+
(ℕ
)
.
(ii) For all
k≥1
In particular, this implies a perturbed linear rate of the sequence
(‖Xk−x‖2)k∈ℕ
as
(iii)
∞
k=1(1−𝛼k)
3−a
2
𝜌k
(1+
L
𝜆
)−1
Xk−Xk−1
2<
∞
ℙ-a.s.
.
Proof Our point of departure for the analysis under the stronger Assumption7 is eq.
(23), which becomes
Repeating the analysis of the previous section with reference point
p=x
and
p∗=0
, the unique solution of (MI), yields the bound
The triangle inequality
‖
‖
Z
k
−x
‖
‖
2≤
2
‖
‖
Y
k
−Z
k‖
‖
2
+2
‖
‖
Y
k
−x
‖
‖
2
gives
By (9), we have for all
c>0
Observe that this estimate is crucial in weakening the requirement of conditional
unbiasedness. Choose
c
=
𝜆k
2
to get
Assume that
𝜆k𝜇≤a
2<1
. Then,
(46)
𝔼
[H
k+1|
F
1
]
≤
q
k
H
1
+c
k.
(47)
𝔼
[
‖
‖
Xk+1−x
‖
‖
2
|
F0]
≤
qk
(2(1−𝛼
1
)
1−𝛼 ‖
‖
X1−x
‖
‖
2
)
+2
1−𝛼
ck
.
⟨
Z
k
−R
k
,Y
k
−p
⟩≥
𝜆
k⟨
W
k
+p∗,Y
k
−p
⟩
+𝜆
k
𝜇
�
�
Y
k
−p
�
�
2
∀(p,p∗) ∈ gr (F)
.
�
�
Rk−x
�
�
2≤�
�
Zk−x
�
�
2
−(1−2L2𝜆2
k)
�
�
Yk−Zk
�
�
2
+2𝜆2
k
�
�
𝚎k
�
�
2
+2𝜆
k⟨
W
k
,x−Y
k⟩
−2𝜆
k
𝜇
�
�
x−Y
k�
�
2.
�
�
Rk−x
�
�
2≤�
�
Zk−x
�
�
2
−(1−2L2𝜆2
k)
�
�
Yk−Zk
�
�
2
+2𝜆2
k
�
�
𝚎k
�
�
2
+2𝜆k
⟨
Wk,x−Yk
⟩
+2𝜆
k
𝜇
�
�
Y
k
−Z
k�
�
2−𝜆
k
𝜇
�
�
Z
k
−x
�
�
2.
(48)
Wk,x−Yk
≤1
2c
Wk
2+
c
2
Yk−x
2
≤1
2c
Wk
2+c
Zk−x
2+
Zk−Yk
2
.
‖
‖
Rk−x
‖
‖
2≤‖
‖Zk−x
‖
‖
2
−(1−2L2𝜆2
k)
‖
‖Yk−Zk
‖
‖
2
+2𝜆2
k
‖
‖𝚎k
‖
‖
2
+2
‖
‖Wk
‖
‖
2
+𝜆2
k
‖
‖x−Zk
‖
‖
2
+2𝜆k𝜇‖
‖Yk−Zk‖
‖
2−𝜆k𝜇‖
‖Zk−x‖
‖
2+𝜆2
k
‖
‖Zk−Yk‖
‖
2
=(1+𝜆2
k−𝜆k𝜇)
‖
‖
Zk−x
‖
‖
2+2𝜆2
k
‖
‖
𝚎k
‖
‖
2+2
‖
‖
Wk
‖
‖
2
−(1−2L2𝜆2
k
−2𝜆
k
𝜇−𝜆2
k
)
‖
‖
Y
k
−Z
k‖
‖
2.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
494
S.Cui et al.
1 3
where
L
2≜
L
2+1∕2
. Moreover, choosing
𝜆k≤
b
𝜇
, we see
Using these bounds, we readily deduce for
0<𝜆
k
≤
min{
a
2𝜇
,b
𝜇}
, that
Proceeding as in the derivation of eq. (30), one sees first that
and therefore,
Define
𝜂k=(1−b)𝜆k𝜇
. Using the equality (25),
1
−2L
2
𝜆
2
k
−2𝜆
k
𝜇−𝜆
2
k≥
(1−a)−2L
2
𝜆
2
k
−𝜆
2
k
=(1−a)−2
L
2
𝜆
2
k,
1
+𝜆
2
k
−𝜆
k
𝜇
≤
1−(1−b)𝜆
k
𝜇
.
(49)
‖
‖
Rk−x
‖
‖
2≤(
1−(1−b)𝜆k𝜇
)‖
‖
Zk−x
‖
‖
2
−
(
(1−a)−2
L2𝜆2
k
)‖
‖
Yk−Zk
‖
‖
2
+2𝜆2
k‖
‖
𝚎
k‖
‖
2+2
‖
‖
W
k‖
‖
2.
1
2
𝜌2
k
‖
‖
Xk+1−Zk
‖
‖
2
≤
(1+
L𝜆k)2
‖
‖
Yk−Zk
‖
‖
2+𝜆2
k
‖
‖
𝚎k
‖
‖
2
,
(50)
−
𝜌k((1−a)−2
L2𝜆2
k)‖
‖Yk−Zk‖
‖
2
≤
−(1−a)−2
L𝜆k
2𝜌k(1+
L𝜆k)
‖
‖Xk+1−Zk‖
‖
2
+
𝜌k𝜆2
k((1−a)−2
L𝜆k)
1+
L𝜆k
‖
‖
𝚎k
‖
‖
2
.
Xk+1−x
2
=(1−𝜌k)
Zk−x
2
+𝜌k
Rk−x
2
−
1−𝜌
k
𝜌k
Xk+1−Zk
2
(49)
≤(1−𝜌k𝜂k)
Zk−x
2−1−𝜌k
𝜌k
Xk+1−Zk
2−𝜌k((1−a)−2
L2𝜆2
k)
Zk−Yk
2
+2𝜆2
k𝜌k
𝚎k
2+2𝜌k
Wk
2
(50)
≤(1−𝜌k𝜂k)
Zk−x
2−(3−a)−2𝜌k(1+
L𝜆k)
2𝜌k(1+
L𝜆k)
Xk+1−Zk
2+2𝜌k
Wk
2+(3−a)𝜌k𝜆2
k
1+
L𝜆k
𝚎k
2
(32),(31)
≤(1−𝜌k𝜂k)[(1+𝛼k)
Xk−x
2−𝛼k
Xk−1−x
2+𝛼k(1+𝛼k)
Xk−Xk−1
2]
−(3−a)−2𝜌k(1+
L𝜆k)
2𝜌k(1+
L𝜆k)[(1−𝛼k)
Xk+1−Xk
2+(𝛼2
k−𝛼k)
Xk−Xk−1
2]
+2𝜌k
Wk
2+(3−a)𝜌k𝜆2
k
(1+
L𝜆k)
𝚎k
2
≤(1+𝛼k)(1−𝜌k𝜂k)Xk−x2−𝛼k(1−𝜌k𝜂k)Xk−1−x2+ΔMk
+𝛼k
Xk−Xk−1
2(1+𝛼k)(1−𝜌k𝜂k)+(𝛼k−1)+ (3−a)(1−𝛼k)
2𝜌k(1+
L𝜆k)
−(1−𝛼k)
3−a
2
𝜌k
(1+
L
𝜆k
)−1
Xk+1−Xk
2,
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
495
1 3
Stochastic relaxed inertial forward‑backward‑forward…
with stochastic error term
Δ
Mk
≜
2𝜌k
‖
‖
Wk
‖
‖
2+(3−a)𝜌k𝜆
2
k
1+
L𝜆k
‖
‖
𝚎k
‖
‖
2
. From here, it follows
that
Since
𝜆k=𝜆
, and
𝜌
k=(3−a)(1−𝛼k)
2
2(2𝛼2
k
−0.5𝛼
k
+1)(1+
L𝜆
)
, we claim that
𝜌
k
≤1−𝛼
k
(1+4𝛼k)𝜂
for
𝜂≡(1−b)𝜆𝜇
. Indeed,2
In particular, this implies
𝜂𝜌k∈(0, 1)
for all
k∈ℕ
. We then have
Next, we show that
H
k
≥1−𝛼
2‖
Xk−x
‖2
, for
Hk
defined in (45). This can be seen
from the next string of inequalities:
(51)
Xk+1−x
2+(1−k)
3−a
2k(1+
Lk)−1
Xk+1−Xk
2−kXk−x2
≤(1−kk)
Xk−x
2+(1−k)3−a
2k(1+
Lk)−1
Xk−Xk−1
2−kXk−1−x2
−(1−kk)(1−k)3−a
2k(1+
Lk)−1
−k(1+k)(1−kk)+(k−1)+ (3−a)(1−k)
2k(1+
Lk)
Xk−Xk−1
2
−kkk
Xk−x
2+ΔMk
=(1−kk)
Xk−x
2+(1−k)3−a
2k(1+
Lk)−1
Xk−Xk−1
2−kXk−1−x2
−(1−k−kk)(3−a)(1−k)
2k(1+
Lk)−1−2
k(2−kk)
≜
I
Xk−Xk−1
2−kkk
Xk−x
2+ΔMk
.
1−𝛼
k
(1+4𝛼k)𝜂
𝜌k
=2(2𝛼2
k−0.5𝛼k+1)(1+
L𝜆)
(3−a)(1−𝛼k)(1+4𝛼k)𝜂
≥
2(2𝛼2
k−0.5𝛼k+1)(1+
L𝜆)
(3−a)(1−𝛼k)(1+4𝛼k)a
2
(1−b)
≥
2⋅
31
32
⋅1
25
16
⋅1
=31
25 >
1.
(52)
I
=(1−𝛼k−𝜌k𝜂)
(
(3−a)(1−𝛼k)
2𝜌k(1+
L𝜆)−1
)
−𝛼2
k(2−𝜌k𝜂
)
≥(1−𝛼k−1−𝛼k
1+4𝛼k)(2𝛼2
k−0.5𝛼k+1
1−𝛼k
−1)−2𝛼2
k
=(1−𝛼k)4𝛼k
1+4𝛼k
⋅
2𝛼2
k+0.5𝛼k
1−𝛼k
−2𝛼2
k(1+4𝛼k)
1+4𝛼k
=0.
2 To wit, the function
x
↦
2x2−0.5x+1
is attains a global minumum at x
=1∕8
, which gives the
global lower bound 31/32. Furthermore, the function
x
↦
(1−
x
)(1+4
x
)
attains a global maximum at
x=3∕8
, with corresponding value 25/16.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
496
S.Cui et al.
1 3
In this derivation we have used the (9) to estimate
1
−𝛼k
2
‖
‖Xk−x‖
‖
2+
2𝛼
2
k
1−𝛼
k
‖
‖
X
k
−X
k−1
‖
‖
2
≥2𝛼
k
‖
‖
X
k
−x‖
‖
⋅‖
‖
X
k
−X
k−1
‖
‖,
and the specific choice
𝜌
k=(3−a)(1−𝛼k)
2
2(2𝛼2
k
−1
2
𝛼k+1)(1+
L𝜆
)
.
By recalling (51) and invoking (52), we are left with the stochastic recursion
where
qk≜
1−𝜌
k𝜂
and
bk≜𝛼k𝜌k𝜂k‖Xk−x‖2.
Since
𝜌
k=(3−a)(1−𝛼k)
2
2(2𝛼2
k−1
2
𝛼k+1)(1+
L𝜆)≥𝜌=16(3−a)(1−𝛼 )2
31(1+
L𝜆)
for every k, we have that
qk≤q=1−𝜂𝜌
for every k. Furthermore,
1> 𝜂𝜌k≥𝜂𝜌
, so
that
q∈(0, 1)
. Taking conditional expectations on both sides on (53), we get
using the notation
ck≜
𝔼
[ΔMk|
F
k]
. Applying the operator
𝔼[⋅|Fk−1]
and using the
tower property of conditional expectations, this gives
Proceeding inductively, we see that
H
k=
Xk−x
2−𝛼k
Xk−1−x
2+(1−𝛼k)
3−a
2𝜌k(1+
L𝜆)−1
Xk−Xk−12
≥
Xk−x
2+(1−𝛼k)(2𝛼2
k+1−0.5𝛼k)
(1−𝛼k)2−1+𝛼k
Xk−Xk−1
2
−𝛼k
Xk−1−x
2
≥
Xk−x
2+(1−𝛼k)(2𝛼2
k+1−𝛼k)
(1−𝛼k)2−1+𝛼k
Xk−Xk−1
2
−𝛼k
Xk−1−x
2
=𝛼k+
1−𝛼k
2
Xk−x
2+𝛼k+
2𝛼2
k
1−𝛼k
Xk−Xk−1
2−𝛼k
Xk−1−x
2
+
1−𝛼k
2
Xk−x
2
≥𝛼k
Xk−x
2+
Xk−Xk−1
2−𝛼k
Xk−1−x
2
+2𝛼k
Xk−x
⋅
Xk−Xk−1
+
1−𝛼k
2
Xk−x
2
≥𝛼k
Xk−x
+
Xk−Xk−1
2−𝛼k
Xk−1−x
2
+
1−𝛼k
2
Xk−x
2≥
1−𝛼k
2
Xk−x
2
≥1−𝛼
2
Xk−x
2.
(53)
Hk+1≤qkHk−
bk+ΔMk.
𝔼
[H
k+1|
F
k
]+
b
k≤
qH
k
+c
k
ℙ
-a.s.
𝔼
[H
k+1|
F
k−1
]
≤
q
2
H
k−1
−q𝔼[
b
k−1|
F
k−1
]−𝔼[
b
k|
F
k−1
]+q𝔼[c
k−1|
F
k−1
]+𝔼[c
k|
F
k−1
]
.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
497
1 3
Stochastic relaxed inertial forward‑backward‑forward…
This establishes eq. (46). To validate eq. (47), recall that we assume
X1=X0
, so that
H1
=(1−𝛼
1
)
‖
‖
X
1
−x
‖
‖
2
. Furthermore,
Hk+1≥1−𝛼
2
‖
‖
X
k+1
−x
‖
‖
2
, so that
We now show that
(
c
k
)
k∈
ℕ∈𝓁
1
+
(ℕ
)
. Simple algebra, combined with Assumption6,
gives
Hence, since
(𝜌k)k∈ℕ
is bounded, Assumption 5 gives
limk→∞ck=0
a.s. Using again
the tower property, we see
𝔼
[ck
|
F1]=𝔼
[
𝔼(ck
|
Fk)
|
F1
]≤
𝜅𝜌k𝚜
2
mk
≤
𝜅𝜌 𝚜2
mk
, where
𝜌
k=(3−a)(1−𝛼k)
2
2(2𝛼2
k
−1
2
𝛼
k
+1)(1+
L𝜆)
≤
𝜌 =3−a
2(1+
L𝜆
)
for every k. Consequently, the discrete convolu-
tion
k−1
i=1qk−i𝔼[ci
F1]
k∈ℕ
is summable. Therefore
∑k≥1
𝔼[H
k
]
<∞
and
∑k≥1
𝔼[
b
k
]<
∞
. Clearly, this implies
limk→∞
𝔼
[
bk]=0,
and consequently the sub-
sequently stated two implication follow as well:
◻
Remark 4 It is worth remarking that the above proof does not rely on unbiasedness
of the random estimators. The reason why we can lift this rather typical assumption
lies in our application Young’s inequality in the estimate (48). The only assump-
tion needed is a summable oracle variance as formulated in Assumption 6 to get the
above result working.
Remark 5 The above result illustrates again nicely the well-known trade-off between
relaxation and inertial effects (cf. Remark2). Indeed, up to constant factors, the cou-
pling between inertia and relaxation is expressed by the function
𝛼
↦(1−
𝛼
)
2
2𝛼2−1
2
𝛼+
1
.
Basic calculus reveals that this function is decreasing for
𝛼
increasing. In the extreme
case when
𝛼↑1
, it is necessary to let
𝜌↓0
, and vice versa. When
𝛼→0
then the
limiting value of our specific relaxation policy is
3−a
1+
L𝜆
. In practical applications, it is
𝔼
[Hk+1
|
F1]
≤
qkH1+
k−1
∑
i=1
qk−i𝔼[ci
|
F1]=qkH1+ck
.
𝔼
[
‖
‖
Xk+1−x
‖
‖
2
|
F1]
≤
qk
(2(1−𝛼
1
)
1−𝛼 ‖
‖
X1−x
‖
‖
2
)
+2
1−𝛼
ck
.
(54)
c
k=𝔼[ΔMk|Fk]
≤
2𝜌k
(
1+(3−a)𝜆2
1+
L𝜆
)
𝔼[‖
‖Wk‖
‖
2|Fk]+
2(3−a)𝜌k𝜆
2
1+
L𝜆
𝔼[‖
‖Uk‖
‖
2|F1
]
≤
2𝚜2𝜌k
mk(
1+2(3−a)𝜆2
1+
L𝜆)
≡𝜌k𝚜2
mk
𝜅.
lim
k
→∞
‖
‖Xk−x
‖
‖=0ℙ-a.s., and
∞
∑
k=1
(1−𝛼k)
(
3−a
2𝜌k(1+
L𝜆)−1
)‖
‖
Xk−Xk−1
‖
‖
2<∞ℙ
-a.s..
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
498
S.Cui et al.
1 3
advisable to choose b small in order to make q large. The value a must be calibrated
in a disciplined way in order to allow for a sufficiently large step size
𝜆
. This requires
some knowledge of the condition number of the problem
𝜇∕L
. As a heuristic argu-
ment, a good strategy, anticipating that b should be close to 0, is to set
a
2𝜇
=
1−a
2
L
.
This means
a=𝜇
L+𝜇
.
We obtain a full linear rate of convergence when a more aggressive sample rate is
employed in the SO. We achieve such global linear rates, together with tuneable iter-
ation and oracle complexity estimates in two settings: First, we consider an aggres-
sive simulation strategy, where the sample size grows over time geometrically. Such
a sampling frequency can be quite demanding in some applications. As an alterna-
tive, we then move on and consider a more modest simulation strategy under which
only polynomial growth of the batch size is required. Whatever simulation strategy
is adopted, key to the assessment of the iteration and oracle complexity is to bound
the stopping time
In order to understand the definition of this stopping time, recall that RISFBF com-
putes the last iterate
XK+1
by extrapolating between the current base point
Zk
and
the correction step involving
Yk+𝜆K(Ak−Bk)
, which requires
2mk
iid realizations
from the law
𝖯
. In total, when executing the algorithm until the terminal time
K𝜖
, we
therefore need to simulate
2∑K
𝜖
k=1
m
k
random variables. We now estimate the integer
K𝜖
under a geometric sampling strategy.
Proposition 9 (Non-asymptotic linear convergence under geometric sampling) Sup-
pose the conditions of Theorem8 hold. Let p∈(0, 1),𝙱=2𝜌 𝚜2
(
1+2(3−a)𝜆2
1+
L𝜆),
and
choose the sampling rate
mk
=
⌊
p
−k⌋
. Let
p∈(p,1)
, and define
Then, whenever
p≠q
, we see that
and whenever
p=q
,
In particular, the stochastic process
(Xk)k∈ℕ
converges strongly and
ℙ
-a.s. to the
unique solution
x
at a linear rate.
(55)
K
𝜖
≜
inf{k∈ℕ
|
𝔼
(‖
‖
Xk+1−x
‖
‖
2
)≤
𝜖}
.
(56)
C
(p,q)
≜2(1−𝛼
1
)
1−𝛼
‖
‖
X1−x
‖
‖
2
+
4
𝙱
(1−𝛼 )(1−min{p∕q,q∕p})
if p
≠
q
, and
(57)
C≜2(1−𝛼
1
)
1−𝛼
‖
‖
X1−x
‖
‖
2
+4𝙱
(1−𝛼 )exp(1)ln(p∕q)
if p=q
.
𝔼(‖
‖
Xk+1−x
‖
‖
2
)≤
C(p,q)max{p,q}k
,
𝔼(‖
‖
Xk+1−x
‖
‖
2
)≤
Cpk
.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
499
1 3
Stochastic relaxed inertial forward‑backward‑forward…
Proof Departing from (53), ignoring the positive term
bk
from the right-hand side,
and taking expectations on both sides leads to
where the equality follows from
ck
being deterministic. The sequence
(ck)k∈ℕ
is fur-
ther upper bounded by the following considerations: First, the relaxation sequence is
bounded by
𝜌
k
≤
𝜌 =
3−a
2(1+
L𝜆)
; Second, the sample rate is bounded by
m
k=
p−k
≥
1
2p−k
≥
1
2p−
k
. Using these facts, eq. (54) yields
where
𝙱
=2𝜌 𝚜2
(
1+2(3−a)𝜆2
1+
L𝜆
)
. Iterating the recursion above, one readily sees that
Consequently, by recalling that
h1=(1−𝛼1)‖X1−x‖2
and
h
k
≥
1−
𝛼
2
𝔼(
‖
‖
Xk−x
‖
‖
2)
,
the bound (59) allows us to derive the recursion
We consider three cases.
(i)
0<q<p<1
: Defining
𝚌
1
≜2(1−𝛼
1
)
1−
𝛼
‖
‖
X1−x
‖
‖
2
+
4
𝙱
(1−
𝛼
)(1−q∕p
)
, we obtain from
(61)
(ii)
0<p<q<1
. Akin to (i) and defining
𝚌
2
≜2(1−𝛼
1
)
1−𝛼
‖
‖
X1−x
‖
‖
2
+
4𝙱
(1−𝛼 )(1−p∕q)
,
we arrive as above at the bound
𝔼
(
‖
‖
X
k
−x
‖
‖
2
)
≤
qk𝚌
2
.
(iii)
p=q<1
. Choose
p∈(q,1)
and
𝚌
3
≜1
exp(1)ln(p∕q)
, so that Lemma 18 yields
kqk≤
𝚌
3
pk
for all
k≥1
. Therefore, plugging this estimate in eq. (61), we see
(58)
1−𝛼
2
𝔼(
‖
‖
Xk+1−x
‖
‖
2)
≤
hk+1
≜
𝔼(Hk+1)
≤
q𝔼(Hk)+ck=qhk+ck
,
(59)
c
k
≤𝜌
k𝚜
2𝜅
mk
≤
2𝙱pk∀k
≥1,
(60)
h
k+1
≤
qkh1+
k
∑
i=1
qk−ici∀k
≥1.
(61)
𝔼
Xk+1−x
2
≤
qk
2(1−𝛼1)
1−𝛼
X1−x
2
+4𝙱
1−𝛼
k
i=1
qk−ipi
.
𝔼
(
‖
‖
Xk+1−x
‖
‖
2)
≤
qk
(
2(1−𝛼1)
1−𝛼
‖
‖
X1−x
‖
‖
2
)
+4𝙱
1−𝛼
k
∑
i=1
(q∕p)k−ipk
≤
𝚌1pk
.
𝔼
(‖
‖Xk−x‖
‖
2)
≤
qk
(
2(1−𝛼1)
1−𝛼 ‖
‖X1−x‖
‖
2
)
+4𝙱
1−𝛼
k
∑
i=1
q
k
≤pk(2(1−𝛼1)
1−𝛼
‖
‖
X1−x
‖
‖
2)+4𝙱
1−𝛼
𝚌3pk
=
𝚌
4
pk
,
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
500
S.Cui et al.
1 3
after setting
𝚌4≜
2
(
1
−𝛼
1
)
1−𝛼
‖
‖
X
1
−x
‖
‖
2
+
4
𝙱𝚌
3
1−𝛼
. Collecting these three cases
together, verifies the first part of the proposition.
◻
Proposition 10 (Oracle and Iteration Complexity under geometric sampling) Given
𝜖>0
, define the stopping time
K𝜖
as in eq. (55). Define
and the same hypothesis as in Theorem8 hold true. Then,
K𝜀≤𝜏𝜖
(p,q)=O(ln(
𝜀−1))
.
The corresponding oracle complexity of RISFBF is upper bounded as
2𝜏
𝜖
(p,q)
i=1
m
i
=O
(1∕𝜖)1+𝛿(p,q)
, where
Proof First, let us recall that the total oracle complexity of the method is assessed by
If
p≠q
define
𝜏
𝜖
≡
𝜏𝜖(p,q)=
⌈
ln(C(p,q)𝜖
−1
)
ln(1∕max{p,q}) ⌉
. Then,
𝔼
(
‖
‖
‖
X𝜏𝜖+1−x
‖
‖
‖
2
)
≤𝜖
, and hence
K𝜖≤𝜏𝜖
. We now compute
This gives the oracle complexity bound
If
p=q
, we can replicate this calculation, after setting
𝜏
𝜖=
⌈
ln(𝜖
−1
C)
ln(1∕
p
)⌉
. After so many
iterations, we can be ensured that
𝔼
(
‖
‖
‖
X𝜏𝜖+1−x
‖
‖
‖
2
)
≤𝜖
, with an oracle complexity
◻
(62)
𝜏
𝜖(p,q)≜
⎧
⎪
⎨
⎪
⎩
⌈
ln(C(p,q)𝜖−1)
ln(1∕max{p,q})
⌉
if p
≠
q
,
⌈
ln(
C𝜖−1)
ln(1∕p)
⌉
if p=q
𝛿
(p,q)≜
⎧
⎪
⎨
⎪
⎩
0 if p>q,
ln(p)
ln(q)−1 if p∈(0, q)
,
ln(p)
ln(p)−1 if p=q.
2
K
𝜖
�
i=1
mi=2
K
𝜖
�
i=1⌊
p−i
⌋≤
2
K
𝜖
�
i=1
p−i
.
𝜏𝜖
i=1
(1∕p)i=1
p
(1∕p)ln(C(p,q)𝜖
−1
)
ln(1∕max{p,q}) −1
1∕p−1≤1
p2
(1∕p)
ln(C(p,q)𝜖
−1
)
ln(1∕max{p,q})
1∕p−1
=
𝜖−1C(p,q)
ln(1∕p)∕ ln(1∕max{p,q})
p(1−p)
.
2
𝜏𝜖
∑
i=1
mi
≤
2
(
𝜖−1C(p,q)
)
ln(1∕p)∕ ln(1∕max{p,q})
p(1−p)
.
2
𝜏
𝜖
∑
i=1
mi
≤
2
p(1−p)
(
C
𝜖
)ln(p)∕ ln(p)
.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
501
1 3
Stochastic relaxed inertial forward‑backward‑forward…
To the best of our knowledge, the provided non-asymptotic linear convergence
guarantee appears to be amongst the first in relaxed and inertial splitting algorithms.
In particular, by leveraging the increasing nature of mini-batches, this result no
longer requires the unbiasedness assumption on the SO, a crucial benefit of the pro-
posed scheme.
There may be settings where geometric growth of
mk
is challenging to adopt.
To this end, we provide a result where the sampling rate is polynomial rather than
geometric. A polynomial sampling rate arises if
mk
=
⌈
a
k
(k+k
0
)
𝜃
+b
k⌉
for some
parameters
ak,bk,𝜃>0
. Such a regime has been adopted in related mini-batch
approaches [75, 76]. This allows for modulating the growth rate by changing the
exponent in the sampling rate. We begin by providing a supporting result. We make
the specific choice
ak=bk=1
for all
k≥1
, and
k0=0
, leaving essentially the expo-
nent
𝜃>0
as a free parameter in the design of the stochastic oracle.
Proposition 11 (Polynomial rate of convergence under polynomially increasing
mk
)
Suppose the conditions of Theorem8 hold. Choose the sampling rate
mk=⌊k𝜃⌋
where
𝜃>0
. Then, for any
k≥1
,
Proof From the relation (60), we obtain
A standard bound based on the integral criterion for series with non-negative sum-
mands gives
The upper bounding integral can be evaluated using integration-by-parts, as follows:
(63)
𝔼
(‖
‖Xk+1−x‖
‖
2)
≤
qk
(
2(1−𝛼1)
1−𝛼 ‖
‖X1−x‖
‖
2+2
1−𝛼
q
−1
exp(2𝜃)−1
1−q
)
+4𝙱
(1−𝛼 )qln(1∕q)
(k+1)−𝜃
h
k+1
≤
qkh1+
k
i=1
qk−ici
≤
qkh1+𝙱
k
i=1
qk−ii−𝜃
=qkh1+𝙱
k
i=1
q−ii−𝜃
=qk
h1+𝙱2𝜃∕ln(1∕q)
i
=1
q−ii−𝜃+𝙱
k
i=2𝜃∕ln(1∕q)+1
q−ii−𝜃
.
k
�
i=⌈2𝜃∕ln(1∕q)⌉+1
q−ii−𝜃
≤
�k+1
⌈
2𝜃∕ln(1∕q)
⌉
(1∕q)t
t𝜃dt
.
∫k+1
⌈2𝜃∕ln(1∕q)⌉
(1∕q)t
t𝜃dt =t𝜃etln(1∕q)
ln(1∕q)
�
t=k+1
t=
⌈
2𝜃∕ln(1∕q)
⌉
+∫
k+1
⌈2𝜃∕ln(1∕q)⌉
𝜃t−(𝜃+1)etln(1∕q)
ln(1∕q)dt
.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
502
S.Cui et al.
1 3
Note that
𝜃
tln(1∕q)≤1
2
when
t≥⌈2𝜃∕ln(1∕q)⌉
. Therefore, we can attain a simpler
bound from the above by
Consequently,
Furthermore,
Note that
(1∕q)
2
𝜃
∕ln(1∕q)
=(exp(ln(1∕q)))2𝜃∕ln(1∕q)=exp(2𝜃)
. Hence,
Plugging this into the opening string of inequalities shows
Since
h1
=(1−𝛼
1
)
‖
‖
X
1
−x
‖
‖
2
and
h
k+1
≥
1−𝛼
2
𝔼
(‖
‖
Xk+1−x
‖
‖
2
)
, we finally arrive at
the desired expression (63).
◻
Proposition 12 (Oracle and Iteration complexity under polynomial sampling)
Let all Assumptions as in Theorem8 hold. Given
𝜖>0
, define
K𝜖
as in (55). Then
the iteration and oracle complexity to obtain an
𝜖
-solution are
O(𝜃𝜀−1∕𝜃)
and
O(
exp
(𝜃)𝜃𝜃(
1
∕𝜖)1+1∕𝜃)
, respectively.
Proof We first note that
(k+1)−𝜃≤k−𝜃
for all
k≥1
. Hence, the bound established
in Proposition11 yields
�k+1
⌈2𝜃∕ln(1∕q)⌉
(1∕q)t
t𝜃dt
≤
(1∕q)k
+
1
ln(1∕q)(k+1)𝜃+1
2�
k+1
⌈2𝜃∕ln(1∕q)⌉
(1∕q)t
t𝜃
dt
�k+1
⌈2𝜃∕ln(1∕q)⌉
(1∕q)
t
t𝜃dt
≤
2(1∕q)
k+1
(k+1)
−𝜃
ln(1∕q)
.
⌈
2
𝜃
∕ln(1∕q)
⌉
�
i=1
q−ii−𝜃
≤⌈
2
𝜃
∕ln(1∕q)
⌉
�
i=1
q−i=1
q
(1∕q)⌈2𝜃∕ln(1∕q)⌉−1
1∕q−1
≤
1
q
(1∕q)2𝜃∕ln(1∕q)+1−1
1∕q−1
.
⌈2𝜃∕ln(1∕q)⌉
�
i=1
q−ii−𝜃
≤
1
q
q−1exp(2𝜃)−1
1∕q−1=q−1exp(2𝜃)−1
1−q
.
h
k+1
≤
qk
h1+𝙱
2𝜃∕ln(1∕q)
i=1
q−i+2𝙱(1∕q)k+1(k+1)−𝜃
ln(1∕q)
≤qkh1+𝙱q−1exp(2𝜃)−1
1−q+2𝙱(1∕q)k+1(k+1)−𝜃
ln(1∕q)
=qk
h1+𝙱q−1exp(2𝜃)−1
1−q
+2𝙱∕q
ln(1∕q)(k+1)−𝜃.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
503
1 3
Stochastic relaxed inertial forward‑backward‑forward…
Consider the function
𝜓(t)≜t𝜃qt
for
t>0
. Then, straightforward calculus shows
that
𝜓(t)
is unimodal on
(0, ∞)
, with unique maximum
t
∗=
𝜃
ln(1∕q)
and associated
value
𝜓
(t∗)=exp(−𝜃)
(
𝜃
ln(1∕q))𝜃
. Hence, for all
t>0
, we have
t
𝜃qt
≤
exp(−𝜃)
(
𝜃
ln(1∕q))𝜃
, and consequently,
q
k
≤
exp(−𝜃)
(
𝜃
ln(1∕q))𝜃
k−
𝜃
for all
k≥1
.
This allows us to conclude
where
Then, for any
k≥
K
𝜖≜⌈
(𝚌
q,𝜃
∕𝜖)
1∕𝜃⌉
, we are ensured that
𝔼
(
‖
‖
X
k+1
−x
‖
‖
2
)
≤𝜀
. Since
(
𝚌
q,𝜃
)
1∕𝜃
=O(exp(−1)𝜃
)
, we conclude that
K𝜖=
O
(𝜃𝜖−1∕𝜃)
. The corresponding ora-
cle complexity is bounded as follows:
◻
Remark 6 It may be observed that if the
𝜃=1
or
mk=k
, there is a worsening of
the rate and complexity statements from their counterparts when the sampling rate
is geometric; in particular, the iteration complexity worsens from
O
(ln(
1
𝜖))
to
O
(
1
𝜖)
while the oracle complexity degrades from the optimal level of
O
(
1
𝜖)
to
O
(
1
𝜖
2
)
. But
this deterioration comes with the advantage that the sampling rate is far slower and
this may be of significant consequence in some applications.
𝔼
(‖
‖Xk+1−x‖
‖
2)
≤
qk
(
2(1−
𝛼
1)
1−𝛼 ‖
‖X1−x‖
‖
2+2
1−𝛼
q
−1
exp(2𝜃)−1
1−q
)
+4𝙱
(1−𝛼 )qln(1∕q)
k−𝜃
𝔼
(‖
‖Xk+1−x‖
‖
2)
≤
exp(−𝜃)
(
𝜃
ln(1∕q)
)𝜃
k−𝜃
(2(1−𝛼1)
1−𝛼 ‖
‖X1−x‖
‖
2+2
1−𝛼
q−1exp(2𝜃)−1
1−q
)
+4𝙱
(1−𝛼 )qln(1∕q)
k−𝜃=𝚌q,𝜃k−𝜃,
(64)
𝚌
q,𝜃≜exp(−𝜃)
(
𝜃
ln(1∕q)
)𝜃(
2(1−𝛼1)
1−𝛼 ‖
‖X1−x‖
‖
2+2
1−𝛼
q−1exp(2𝜃)−1
1−q
)
+4𝙱
(1−
𝛼
)qln(1∕q)
2
K
𝜖
i=1
mi
≤
2
K
𝜖
i=1
i𝜃
≤
2�K𝜖+1
1
t𝜃dt
≤
2
1+𝜃
𝚌q,𝜃
𝜖
1∕𝜃+1
1+𝜃
=O(exp(𝜃)𝜃𝜃(1∕𝜖)1+1∕𝜃)
.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
504
S.Cui et al.
1 3
4.3 Rates interms ofmerit functions
In this subsection we estimate the iteration and oracle complexity of RISFBF with
the help of a suitably defined gap function. Generally, a gap function associated with
the monotone inclusion problem (MI) is a function
𝖦𝖺𝗉 ∶𝖧→ℝ
such that (i)
𝖦𝖺𝗉
is
sign restricted on
𝖧
; and (ii)
𝖦𝖺𝗉(x)=0
if and only if
x∈𝖲
. The Fitzpatrick function
[3, 30, 31, 77] is a useful tool to construct gap functions associated with a set-valued
operator
F∶𝖧→2𝖧
. It is defined as the function
GF∶𝖧×𝖧→[−∞,∞]
given by
This function allows us to recover the operator F, by means of the follow-
ing result (cf. [3, Prop. 20.58]): If
F∶𝖧→2𝖧
is maximally monotone, then
GF(x,x∗)≥⟨x,x∗⟩
for all
(x,x∗)∈𝖧×𝖧
, with equality if and only if
(x,x∗) ∈ gr (F)
.
In particular,
gr (F) = {(x,x∗)∈𝖧×𝖧�GF(x,x∗)≥⟨x,x∗⟩}
. In fact, it can be
shown that the Fitzpatrick function is minimal in the family of convex functions
f∶𝖧×𝖧→(−∞,∞]
such that
f(x,x∗)≥⟨x,x∗⟩
for all
(x,x∗)∈𝖧×𝖧
, with equal-
ity if
(x,x∗) ∈ gr (F)
[77].
Our gap function for the structured monotone operator
F=V+T
is derived from
its Fitzpatrick function by setting
𝖦𝖺𝗉
(x)
≜
G
F
(x,0
)
for
x∈𝖧
. This reads explicitly
as
It immediately follows from the definition that
𝖦𝖺𝗉(x)≥0
for all
x∈𝖧
. It is also
clear, that
x↦ 𝖦𝖺𝗉(x)
is convex and lower semi-continuous and
𝖦𝖺𝗉(x)=0
if and
only if
x∈𝖲=𝖹𝖾𝗋(F)
. Let us give some concrete formulae for the gap function.
Example 4 (Variational Inequalities) We reconsider the problem described in Exam-
ple 2. Let
V∶𝖧→𝖧
be a maximally monotone and L-Lipschitz continuous map,
and
T(x)=
𝖭
𝖢(x)
the normal cone of a given closed convex set
𝖢⊂𝖧
. Then, by
[77, Prop. 3.3], the gap function (66) reduces to the well-known dual gap function,
due to [78],
Example 5 (Convex Optimization) Reconsider the general non-smooth con-
vex optimization problem in Example 1, with primal objective function
𝖧1∋u
↦
f(u)+g(Lu)+h(u)
. Let us introduce the convex-concave function
Define
(65)
G
F
(x,x∗)=⟨x,x∗⟩−inf
(y,y
∗
)∈ gr (F)⟨x−y,x∗−y∗⟩.
(66)
𝖦𝖺𝗉
(x)
≜
sup
(y,y
∗
)∈ gr (F)
⟨
y
∗
,x−y
⟩
=sup
p∈ dom T
sup
p
∗
∈T(p)
⟨
V(p)+p
∗
,x−p
⟩
∀x∈𝖧
.
𝖦𝖺𝗉(
x
)=
sup
p∈𝖢
⟨
V
(
p
)
,x
−
p
⟩.
L(u,v)≜f(u)+h(u)−g∗(v)+⟨Lu,v⟩∀(u,v)∈𝖧1×𝖧2.
(67)
Γ(
x
�
)
≜
sup
u∈𝖧1,v∈𝖧2
(
L(u
�
,v)−L(u,v
�
)
)
∀x
�
=(u
�
,v
�
)∈𝖧=𝖧1×𝖧2
.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
505
1 3
Stochastic relaxed inertial forward‑backward‑forward…
It is easy to check that
Γ(x�)≥0
, and equality holds only for a primal-dual pair
(saddle-point)
x∈𝖲
. Hence,
Γ(⋅)
is a gap function for the monotone inclusion
derived from the Karush-Kuhn-Tucker conditions (5). In fact, the function (67) is a
standard merit function for saddle-point problems (see e.g. [79]). To relate this gap
function to the Fitzpatrick function, we exploit the maximally monotone operators
V and T introduced Example 1. In terms of these mappings, first observe that for
p=(u,v),x=(u,v)
we have
Since h is convex differentiable, the classical gradient inequality reads as
h(u)−h(u)≥⟨∇h(u),u−u⟩
. Using this estimate in the previous display shows
For
p∗=(u∗
,v∗)∈T(p)
, we again employ convexity to get
Hence,
Therefore, we see
Hence,
It is clear from the definition that a convex gap function can be extended-valued
and its domain is contingent on the boundedness properties of
dom T
. In the setting
where T(x) is bounded for all
x∈𝖧
, the gap function is clearly globally defined.
However, the case where
dom T
is unbounded has to be handled with more care.
There are potentially two approaches to cope with such a situation: One would be to
introduce a perturbation-based termination criterion as defined in [80], and recently
used in [81] to solve a class of structured stochastic variational inequality problems.
The other solution strategy is based on the notion of restricted merit functions, first
introduced in [82], and later on adopted in [83]. We follow the latter strategy.
Let
xs∈ dom T
denote an arbitrary reference point and
D>0
a suitable constant.
Define the closed set
𝖢≜dom T∩{x∈𝖧�‖x−x
s
‖≤D}
, and the restricted gap
function
⟨V(p),x−p⟩=⟨∇h(u),u−u⟩+⟨v,Lu⟩−⟨Lu,v⟩
⟨V(p),x−p⟩≤h(u)−h(u)−⟨Lu,v⟩+⟨v,Lu⟩.
f(u)
≥
f(
u)+
⟨
u
∗
,u−
u
⟩
∀u∈H1,
g
∗
(v)≥g
∗
(v)+⟨v
∗
,v−v⟩∀v∈H2.
⟨u∗
,u−u⟩+⟨v∗
,v−v⟩≤(f(u)−f(u))+(g∗(v)−g∗(v)).
⟨V(p)+p∗
,x−p⟩≤(f(u)+h(u)−g∗(v)+⟨v,Lu⟩)−(f(u)+h(u)−g∗(v)+⟨v,Lu⟩)
=L(u,v)−L(u,v).
𝖦𝖺𝗉
(x)= sup
(p,p
∗
)∈ gr (T)
⟨
V(p)+p
∗
,x−p
⟩≤
sup
(u,v)∈𝖧1×𝖧2
(
L
(u,
v)−
L
(
u,v))= Γ(x)
.
(68)
𝖦𝖺𝗉(x�𝖢)≜sup{⟨y∗,x−y⟩�y∈𝖢,y∗∈F(y)}.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
506
S.Cui et al.
1 3
Clearly,
𝖦𝖺𝗉(x|dom T)=𝖦𝖺𝗉(x)
. The following result explains in a precise way the
meaning of the restricted gap function. It extends the variational case in [82, Lemma
1] and [83, Lemma 3] to the general monotone inclusion case.
Lemma 13 Let
𝖢⊂𝖧
be nonempty closed and convex. The function
𝖧∋x↦ 𝖦𝖺𝗉(x|𝖢)
is well-defined and convex on
𝖧
. For any
x∈𝖢
we have
𝖦𝖺𝗉(x|𝖢)≥0
. Moreover, if
x∈𝖢
is a solution to (MI), then
𝖦𝖺𝗉(x|𝖢)=0
. Moreo-
ver, if
𝖦𝖺𝗉(x|𝖢)=0
for some
x∈ dom T
such that
‖x−xs‖<D
, then
x∈𝖲
.
Proof The convexity and non-negativity for
x∈𝖢
of the restricted function is clear.
Since
𝖦𝖺𝗉(x|C)≤𝖦𝖺𝗉(x)
for all
x∈𝖧
, we see
To show the converse implication, suppose
𝖦𝖺𝗉(x|𝖢)=0
for some
x∈𝖢
with
‖x−xs‖<D
. Without loss of generality we can choose
x∈𝖢
in this particular
way, since we may choose the radius of the ball as large as desired. It follows that
⟨y∗,x−y⟩≤0
for all
y∈𝖢,y∗∈F(y)
. Hence,
x∈𝖢
is a Minty solution to the Gen-
eralized Variational inequality with maximally monotone operator
F(x)+ 𝖭𝖢(x)
.
Since F is upper semi-continuous and monotone, Minty solutions coincide with
Stampacchia solutions, implying that there exists
x∗∈F(x)
such that
⟨x∗,y−x⟩≥0
for all
y∈𝖢
(see e.g. [84]). Consider now the gap program
This program is solved at
y=x
, which is a point for which
‖x−xs‖<D
. Hence, the
constraint can be removed, and we conclude
⟨x∗,y−x⟩≥0
for all
y∈ dom (F)
. By
monotonicity of F, it follows
Hence,
𝖦𝖺𝗉(x)=0
and we conclude
x∈𝖲
.
◻
In order to state and prove our complexity results in terms of the proposed merit
function, we start with the first preliminary result.
Lemma 14 Consider the sequence
(Xk)k∈ℕ
generated by RISFBF with the initial con-
dition
X0=X1
. Suppose
𝜆k=𝜆∈(0, 1∕(2L))
for every
k∈ℕ
. Moreover, suppose
(𝛼k)k∈ℕ
is a non-decreasing sequence such that
0<𝛼
k≤𝛼 < 1
,
𝜌
k=3(1−
𝛼
)
2
2(2𝛼2
k
−𝛼
k
+1)(1+L𝜆
)
for every
k∈ℕ
. Define
and for
(p,p∗) ∈ gr (F)
, we define
ΔNk(p,p∗)
as in (21). Then, for all
(p,p∗) ∈ gr (F)
, we have
x∈𝖲 ⇔ 𝖦𝖺𝗉(x)=0⇒ 𝖦𝖺𝗉(x|𝖢)=0.
g𝖢(x,x∗)≜inf{⟨x∗,y−x⟩�y∈𝖢}.
⟨y∗,y−x⟩≥⟨x∗,y−x⟩≥0∀(y,y∗) ∈ gr (F).
(69)
Δ
Mk
≜
3𝜌k𝜆
2
k
1+
L
𝜆k
‖
‖
𝚎k
‖
‖
2+
𝜌k𝜆
2
k
2
‖
‖
Uk
‖
‖
2
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
507
1 3
Stochastic relaxed inertial forward‑backward‑forward…
Proof For
(p,p∗) ∈ gr (V+T)
, we know from eq. (23)
where the last inequality uses the monotonicity of V. We first derive a recursion
which is similar to the fundamental recursion in Lemma4. Invoking (25) and (26),
we get
Multiplying both sides of (28) and noting that
(
1−2L
𝜆k
)(1+L
𝜆k
)
≤
1−2L
2𝜆2
k
, we
obtain the following inequality
Inserting the above inequality to (71) and using the same fashion in deriving (33),
we arrive at
Invoking the monotonicity of V and rearranging (72), it follows that
(70)
K
k=1
2𝜌k𝜆
p∗,Yk−p
≤
(1−𝛼1)
X1−p
2+
K
k=1
ΔMk+
K
k=1
ΔNk(p,0)
.
⟨
Zk−Rk,Yk−p
⟩≥𝜆
k
⟨
Wk+p
∗
,Yk−p
⟩
+
𝜆
k
⟨
V(Yk)−V(p),Yk−p
⟩
≥⟨p
∗
,Yk−p⟩+𝜆k⟨Wk,Yk−p⟩,
(71)
�
�
Xk+1−p�
�
2
≤
�
�Zk−p�
�
2−
1−𝜌
k
𝜌k
�
�Xk+1−Zk�
�
2+2𝜆2𝜌k�
�𝚎k�
�
2
−2𝜌k𝜆k⟨Wk+p∗,Yk−p⟩
−𝜌k(1−2L2𝜆2
k)
�
�
Yk−Zk
�
�
2+
𝜌k𝜆2
k
2
�
�
Uk
�
�
2+2𝜌k𝜆k
⟨
V(Yk)−V(p),p−Yk
⟩.
−
𝜌k(1−2L2𝜆2
k)
‖
‖
Yk−Zk
‖
‖
2
≤
−
1−2L𝜆k
2𝜌k(1+L𝜆k)
‖
‖
Xk+1−Zk
‖
‖
2+
𝜌k𝜆
2
k(1−2L𝜆k)
1+L𝜆k
‖
‖
𝚎k
‖
‖
2
.
(72)
Xk+1−p
2≤
(1+𝛼k)
Xk−p
2
−𝛼k
Xk−1−p
2
+ΔMk
+ΔNk(p,p∗)−2𝜌k𝜆kV(Yk)−V(p),Yk−p
+𝛼k
Xk−Xk−1
22𝛼k+3(1−𝛼k)
2𝜌k(1+L𝜆k)
−(1−𝛼k)
3
2𝜌k(1+L𝜆k)
−1
Xk+1−Xk
2
.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
508
S.Cui et al.
1 3
We define
𝛽k+1
as
and similarly with (36), we can show
{𝛽k}
is non-increasing by choosing
𝜌
k=3(1−𝛼 )
2
2(2𝛼2
k
−𝛼
k
+1)(1+L𝜆
k)
and
𝜆k≡𝜆
. Thus,
(
1−𝛼k+1)
(
3
2𝜌k
+
1(1+L𝜆k
+
1)−1
)
≤(1−𝛼k)
(
3
2𝜌k(1+L𝜆k)−1
)
.
Together with
𝛼k+1≥𝛼k
, the last inequality gives
Recall that
ΔNk(p,p∗)=ΔNk(p,0)+2𝜌k𝜆⟨p∗,p−Yk⟩
. Hence, after setting
ΔNk(p,0)=ΔNk(p)
, rearranging the expression given in the previous display shows
that
Summing over
k=1, …,K
, we obtain
Xk+1−p2+(1−k)
3
2k(1+Lk)−1
Xk+1−Xk2−kXk−p
2
≤Xk−p2+(1−k)3
2k(1+Lk)−1Xk−Xk−12
−kXk−1−p2+ΔMk+ΔNk(p,p∗)
+22
k+(1−k)1−3(1−k)
2k(1+Lk)
≤0
Xk−Xk−12
≤
Xk−p
2+(1−k)3
2k(1+Lk)−1
Xk−Xk−1
2
−kXk−1−p
2
+ΔMk+ΔNk(p,p
∗
).
𝛽
k+1
≜
(1−𝛼k)
(
3
2
𝜌k
(1+L
𝜆k
)−1
)
−(1−𝛼k+1)
(
3
2
𝜌k+1
(1+L
𝜆k+1
)−1
),
Xk+1−p2+(1−𝛼k+1)
3
2𝜌k+1(1+L𝜆)−1
Xk+1−Xk2−𝛼k+1Xk−p
2
≤
Xk−p
2+(1−𝛼k)3
2𝜌k(1+L𝜆)−1
Xk−Xk−1
2
−𝛼kXk−1−p
2
+ΔMk+ΔNk(p,p
∗
).
2
𝜌k𝜆p∗,Yk−p
≤
Xk−p2+(1−𝛼k)
3
2𝜌k(1+L𝜆)−1
Xk
−X2
k−1−𝛼kXk−1−p2
−Xk+1−p2+(1−𝛼k+1)3
2𝜌k+1(1+L𝜆)−1Xk+
1
−X2
k
−𝛼
k+1
X
k
−p2
+ΔM
k
+ΔN
k
(p)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
509
1 3
Stochastic relaxed inertial forward‑backward‑forward…
where we notice
X1=X0
in the last inequality.
◻
Next, we derive a rate statement in terms of the gap function, using the averaged
sequence
Theorem15 (Rate and oracle complexity under monotonicity of V) Consider the
sequence
(Xk)k∈ℕ
generated RISFBF. Suppose Assumptions 1-5 hold. Suppose
mk≜⌊k
a
⌋
and
𝜆k=𝜆∈(0, 1∕(2L))
for every
k∈ℕ
where
a>1
. Suppose
(𝛼k)k∈ℕ
is
a non-decreasing sequence such that
0<𝛼
k≤𝛼 < 1
,
𝜌
k=3(1−
𝛼
)
2
2(2𝛼2
k
−𝛼
k
+1)(1+L𝜆
)
for every
k∈ℕ
. Then the following hold for any
K∈ℕ
:
(i)
𝔼
[𝖦𝖺𝗉(
XK
|
𝖢)]
≤
O
(
1
K).
(ii) Given
𝜀>0
, define
K𝜀≜{
k
∈
ℕ
|
𝔼
[
𝖦𝖺𝗉
(
X
k|
𝖢
)] ≤𝜀}
, then
K𝜀
k=1mk
≤
O
1
𝜀
1+a
.
The proof of this Theorem builds on an idea which is frequently used in the anal-
ysis of stochastic approximation algorithms, and can at least be traced back to the
robust stochastic approximation approach of [49]. In order to bound the expectation
of the gap function, we construct an auxiliary process which allows us to majorize
the gap via a quantity which is independent of the reference points. Once this is
achieved, a simple variance bound completes the result.
Proof of Theorem15 We define an auxiliary process
(Ψk)k∈ℕ
such that
K
k
=1
2𝜌k𝜆p∗,Yk−p
≤
K
k=1
Xk−p2+(1−𝛼k)
3
2𝜌k(1+L𝜆)−1
Xk
−Xk−12−𝛼kXk−1−p2
−Xk+1−p2+(1−𝛼k+1)3
2𝜌k+1(1+L𝜆)−1Xk+1−Xk2−𝛼k+1Xk−p2
+
K
k=1
ΔMk+
K
k=1
ΔNk(p)
≤X1−p2+(1−𝛼1)3
2𝜌1(1+L𝜆)−1X1−X02−𝛼1X0−p2
+
K
k=1
ΔMk+
K
k=1
ΔNk(p)
=(1−𝛼1)
X1−p
2+
K
k=1
ΔMk+
K
k=1
ΔNk(p),
(73)
XK
≜∑K
k=1𝜌kYk
∑
K
k=1𝜌k
.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
510
S.Cui et al.
1 3
Then,
so that
Introducing the iterate
Yk
, the above implies
As
ΔNk(p)=2𝜌k𝜆k⟨Wk,p−Yk⟩
, this implies via a telescopian sum argument
Using Lemma14 and setting
𝜆k≡𝜆
, for any
(p,p∗) ∈ gr (F)
it holds true that
Define
𝚌1≜
(1−𝛼
1
)
‖
‖
X
1
−p
‖
‖
2
, divide both sides by
∑K
k=1
𝜌
k
and using our defini-
tion of an ergodic average (73), this gives
Using the bound established in eq. (75), it follows
Choosing
Ψ1,p∈𝖢
and introducing
𝚌2≜
𝚌
1
+4D
2
, we see that the above can be
bounded by a random quantity which is independent of p:
Taking the supremum over pairs
(p,p∗)
such that
p∈C
and
p∗∈F(y)
, it follows
(74)
Ψk+1≜
Ψ
k
+
𝜌k𝜆k
W
k
,Ψ
1
∈ dom (T)
.
�
�
Ψ
k+1
−p
�
�
2
=
�
�
(Ψ
k
−p)+𝜌
k
𝜆
k
W
k�
�
2
=
�
�
Ψ
k
−p
�
�
2
+𝜌2
k
𝜆2
k�
�
W
k�
�
2
+2𝜌
k
𝜆
k⟨
Ψ
k
−p,W
k⟩,
2
𝜌
k
𝜆
k⟨
W
k
,p−Ψ
k⟩
=
�
�
Ψ
k
−p
�
�
2
−
�
�
Ψ
k+1
−p
�
�
2
+𝜌2
k
𝜆2
k�
�
W
k�
�
2.
2𝜌
k
𝜆
k
⟨W
k
,p−Y
k
⟩=2𝜌
k
𝜆
k
⟨W
k
,p−Ψ
k
⟩+2𝜌
k
𝜆
k
⟨W
k
,Ψ
k
−Y
k
⟩
=
�
�
Ψ
k
−p
�
�
2−
�
�
Ψ
k+1
−p
�
�
2+𝜌2
k
𝜆2
k�
�
W
k�
�
2+2𝜌
k
𝜆
k⟨
W
k
,Ψ
k
−Y
k⟩.
(75)
K
k=1
ΔNk(p)
≤
Ψ1−p
2+
K
k=1
𝜌2
k𝜆2
k
Wk
2+
K
k=1
2𝜌k𝜆k
Wk,Ψk−Yk
.
K
k=1
2𝜌k𝜆
p∗,Yk−p
≤
(1−𝛼1)
X1−p
2+
K
k=1
ΔMk+
K
k=1
ΔNk(p)
.
2
𝜆
p∗,
XK−p
≤
1
K
k=1
𝜌
k
𝚌1+
K
k=1
ΔMk+
K
k=1
ΔNk(p)
.
2
𝜆
p∗,
XK−p
≤
1
K
k=1𝜌k
𝚌1+
K
k=1
ΔMk+
Ψ1−p
2
+
K
k=1
𝜌2
k𝜆2
Wk
2+
K
k=1
2𝜌k𝜆
Wk,Ψk−Yk
.
2
𝜆
p∗,
XK−p
≤
1
K
k=1
𝜌
k
𝚌2+
K
k=1
ΔMk+
K
k=1
𝜌2
k𝜆2
Wk
2+
K
k=1
2𝜌k𝜆k
Wk,Ψk−Yk
.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
511
1 3
Stochastic relaxed inertial forward‑backward‑forward…
In order to proceed, we bound the first moment of the process
ΔMk
in the same way
as in (34), in order to get
Next, we take expectations on both sides of inequality (76), and use the bound (17),
and
𝔼
[
W
k
,Ψ
k
−Y
k
]=𝔼
𝔼
W
k
,Ψ
k
−Y
k
F
k
=
0.
This yields
Since
𝛼k↑𝛼 ∈(0, 1)
, we know that
𝜌
k
≥
𝜌
≜
3(1−
𝛼 2
)
2(1+L𝜆)(2𝛼
2
+1)
. Similarly, since
2𝛼2
k
−
𝛼k
+1
≥
7∕
8
for all k, it follows
𝜌
k
≤
𝜌
≜
12(1−𝛼 )
2
7
. Using this upper and lower
bound on the relaxation sequence, we also see that
𝚊
k
≤
𝜆2
(
12 𝜌
1+L𝜆+𝜌
2
)≡
𝚊
, so that
where
𝚌
3
≜𝚌
2
𝜌
+
1
𝜌
𝚊𝜎2+𝜌 𝜆 2𝜎2
∞
k=1
1
m
k
. Hence, defining the deterministic stop-
ping time
K𝜀
={k∈ℕ
|
𝔼[𝖦𝖺𝗉(
X
k|
𝖢)]
≤𝜀}
, we see
K𝜀≥𝚌
3
2𝜆𝜀
=
𝚌
4
𝜀
.
(ii). Suppose
mk=⌊ka⌋
, for
a>1
. Then the oracle complexity to compute an
X
K
such that
𝔼[
𝖦𝖺𝗉
(
X
k|
𝖢
)] ≤𝜖
is bounded as
◻
Remark 7 In the prior result, we employ a sampling rate
mk=⌊ka⌋
where
a>1
.
This achieves the optimal rate of convergence. In contrast, the authors in [32]
employ a sampling rate, loosely given by
mk
=
⌊
k
1+a
(ln(k))
1+b⌋
where
a>0, b≥−1
or
a=0, b>0
. We observe that when
a>0
and
b≥−1
, the mini-batch size grows
faster than our proposed
mk
while it is comparable in the other case.
(76)
2
𝜆𝖦𝖺𝗉(
XK𝖢)
≤
𝚌
2
K
k=1𝜌k
+
K
k=1ΔMk+
K
k=1𝜌2
k𝜆2
Wk
2+
K
k=12𝜌k𝜆k
Wk,Ψk−Yk
K
k=1
𝜌
k
𝔼
[ΔMkFk]
≤
6𝜌k𝜆2
k
1+L𝜆k
𝔼[
Wk
2Fk]+𝜆2
k
6𝜌k
1+L𝜆k
+
𝜌k𝜆2
k
2
𝔼[
Uk
2Fk
]
=12𝜌k𝜆2
k
1+L𝜆k
𝜎2+𝜌k𝜆2
k
2𝜎2
m
k
≜𝚊k𝜎2
m
k
.
2
𝜆𝔼
𝖦𝖺𝗉(
XK
𝖢)
≤
𝚌2
K
k=1
𝜌
k
+1
K
k=1
𝜌
kK
k=1
𝚊k𝜎2
mk
+
K
k=1
𝜌2
k𝜆2𝜎2
mk
.
2
𝜆𝔼
[
𝖦𝖺𝗉(
XK
|
𝖢)
]≤
𝚌2
𝜌 K+1
𝜌 K
(
𝚊𝜎2+𝜌 2𝜆2𝜎2
)K
∑
k=1
1
m
k
≤
𝚌
3
K
K
k=1
mk
≤(
𝚌4
∕𝜀)
k=1
mk
≤(
𝚌4
∕𝜀)
k=1
ka
≤
�(𝚌4∕𝜀)+1
k=1
xadx
≤
((𝚌4∕𝜀)+1)a+1
a+1
≤
𝚌4
𝜀a+1
.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
512
S.Cui et al.
1 3
5 Applications
In this section, we compare the proposed scheme with its SA counterparts on a class
of monotone two-stage stochastic variational inequality problems (Sec.5.1) and a
supervised learning problem (Sec.5.2) and discuss the resulting performance.
5.1 Two‑stage stochastic variational inequality problems
In this section, we describe some preliminary computational results obtained from
Algorithm 1 when applied to a class of two-stage stochastic variational inequality
problems, recently introduced by [85].
Consider an imperfectly competitive market with N firms playing a two-stage
game. In the first stage, the firms decide upon their capacity level
xi∈[li,ui]
, antici-
pating the expected revenues to be obtained in the second stage in which they com-
pete by choosing quantities à la Cournot. The second-stage market is characterized
by uncertainty as the per-unit cost
hi(𝜉i)
is realized on the spot and cannot be antici-
pated. To compute an equilibrium in this game, we assume that each player is able
to take stochastic recourse by determining production levels
yi(𝜉)
, contingent on ran-
dom convex costs and capacity levels
xi
. In order to bring this into the terminology
for our problem, let use define the feasible set for capacity decisions of firm i as
Xi≜
[l
i
,u
i
]⊂ℝ
+
. The joint profile of capacity decisions is denoted by an N-tuple
x
=(x
1
,…,x
N
)∈X
≜∏N
i=1
X
i
=X . The capacity choice of player i is then deter-
mined as a solution to the parametrized problem (Play
i(x−i)
)
where
ci∶Xi→ℝ+
is a
Lc
i
-smooth and convex cost function and
p(⋅)
denotes the
inverse-demand function defined as
p(X)=d−rX
,
d,r>0
. The function
Qi(
⋅
,𝜉)
denotes the optimal cost function of firm i in scenario
𝜉∈Ξ
, assuming a value
Qi(xi,𝜉)
when the capacity level
xi
is chosen. The recourse function
𝔼𝜉[Qi(
⋅,
𝜉)]
denotes the expectation of the optimal value of the player i’s second stage problem
and is defined as
A Nash equilibrium of this game is given by a tuple
(x∗
1,⋯,x∗
N)
where
x∗
isolves (Playi(x∗
−i))
for each
i=1, 2, …,N
. A simple computation shows that
Qi(xi,𝜉)=min{0, hi(𝜉)xi}
, and hence it is nonsmooth. In order to obtain a smoothed
variant, we introduce
Q𝜖
i
(⋅,
𝜉i)
, defined as
This is the value function of a quadratic program, requiring the maximization of
an
𝜖
-strongly concave function. Hence,
Q𝜖
i(xi,𝜉)
is single-valued and
∇
x
iQ𝜖
i(⋅,𝜉)
is
min
xi
∈X
i
ci(xi)−
(
p(X)xi−𝔼𝜉[Qi(xi,𝜉)]
)
,(Playi(x−i
))
Q
i(xi,
𝜉
)
≜
min{hi(
𝜉
)yi(
𝜉
)
|
yi(
𝜉
)∈[0, xi]}
=max{
𝜋i
(
𝜉
)x
i|𝜋i
(
𝜉
)
≤
0, h
i
(
𝜉
)−
𝜋i
(
𝜉
)
≥
0}.(Rec
i
(x
−i))
Q𝜖
i
(xi,𝜉)
≜
max{xi𝜋i(𝜉)−
𝜖
2
(𝜋i(𝜉))
2|
𝜋i(𝜉)
≤
0, 𝜋i(𝜉)
≤
hi(𝜉)},𝜖>
0.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
513
1 3
Stochastic relaxed inertial forward‑backward‑forward…
1
𝜖
-Lipschitz and
𝜖
-strongly monotone [86, Prop.12.60] for all
𝜉∈Ξ
. The latter is
explicitly given by
Employing this smoothing strategy in our two-stage noncooperative game yields the
individual decision problem
The necessary and sufficient equilibrium conditions of this
𝜖
-smoothed game can be
compactly represented as
and C, R, and
D𝜖
are single-valued maps given by
We note that the interchange between the expectation and the gradient operator can
be invoked based on smoothness requirements (cf. [87, Th. 7.47]). The problem
(SGE
𝜖
) aligns perfectly with the structured inclusion (MI), in which T is a maxi-
mal monotone map and V is an expectation-valued maximally monotone map. In
addition, we can quantify the Lipschitz constant of V as
LV=LC+LR+L𝜖
D
,where
LC
=max
1≤i≤N
L
c
i
,
LR
=r
‖
‖
Id + 𝟏𝟏
⊤‖
‖2
=r(N+1
)
and L
𝜖
D
=
1
𝜖
. Here,
Id
is the
N×N
identity matrix, and
𝟏
is the
N×1
vector consisting only of ones.
Problem parameters for 2-stage SVI. Our numerics are based on specifying
N=10
,
r=0.1
, and
d=1
. We consider four problem settings of
LV
ranging from
10,
⋯
, 104
(See Table1). For each setting, the problem parameters are defined as
follows.
(i) Specification of
hi(𝜉)
. The cost parameters
hi(𝜉i)≜𝜉i
where
𝜉i∼𝚄𝚗𝚒𝚏𝚘𝚛𝚖[−5, 0]
and
i=1, ⋯,N
.
(ii) Specification of
LV,LR,
L𝜖
D
,
LC
, and
b1
. Since
‖
‖
Id + 𝟏𝟏
⊤‖
‖2
=
11
when
N=10
,
LR
=r
‖
‖
Id + 𝟏𝟏
⊤‖
‖
=
1.1
. Let
𝜖
be defined as
𝜖
=
10
LV
and
L
𝜖
D
=
1
𝜖
=
L
V
10
. It fol-
lows that
LC=LV−LR−L𝜖
D
and
b1=LC
.
(iii) Specification of
ci(xi)
. The cost function
ci
is defined as
c
i(xi)=
1
2
bix2
i
+aix
i
where
a1,…,aN∼𝚄𝚗𝚒𝚏𝚘𝚛𝚖[2, 3]
and
b2,
⋯
,
bN∼
𝚄𝚗𝚒𝚏𝚘𝚛𝚖
[0,
b1].
Further,
a≜[a1,…,aN]⊤∈
ℝ
N
and
B≜diag (
b1,…,
bN)
is a diagonal matrix with
nonnegative elements.
Algorithm specifications We compare Algorithm 1 (RISFBF) with a stochas-
tic forward-backward (SFB) scheme and a stochastic forward-backward-forward
∇
x
i
Q
𝜖
i
(xi,𝜉)
≜
argmax{xi𝜋i(𝜉)−
𝜖
2
(𝜋i(𝜉))
2|
𝜋i(𝜉)
≤
0, 𝜋i(𝜉)
≤
hi(𝜉)}
.
(∀i∈{1, …,N}) ∶ min
xi∈
X
i
c
i
(x
i
)−p(X)x
i
+𝔼
𝜉
[Q𝜖
i
(x
i
,𝜉)].(Play𝜖
i
(x
−i
))
0
∈F
𝜖
(x)
≜
V
𝜖
(x)+T(x), where
V
𝜖
(x)=C(x)+R(x)+D
𝜖
(x), and T(x)= 𝖭X(x),(SGE
𝜖
)
C
(x)≜
⎛
⎜
⎜
⎝
c�
1(x1)
⋮
c�
N
(xN)
⎞
⎟
⎟
⎠
,R(x)≜r(X1+x)−d, and D𝜖(x)≜
⎛
⎜
⎜
⎝
𝔼𝜉[∇x1Q
𝜖
1(x1,𝜉)]
⋮
𝔼
𝜉
[∇x
N
Q𝜖
N
(xN,𝜉)]
⎞
⎟
⎟
⎠
.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
514
S.Cui et al.
1 3
(SFBF) scheme. Solution quality is compared by estimating the residual function
𝗋𝖾𝗌(x)=‖x−Π
X(x−𝜆V𝜖(x))‖
. All of the schemes were implemented in MATLAB
on a PC with 16GB RAM and 6-Core Intel Core i7 processor (2.6GHz).
(i) (SFB): The (SFB) scheme is defined as the recursion
where
V
𝜖(X
k
)=𝔼
𝜉
[
V𝜖(X
k
,𝜉
)]
and
𝜆
k=
1
√k
. The operator
ΠX[⋅]
means the orthogo-
nal projection onto the set
X
. Note that
x0
is randomly generated in
[0, 1]N
.
(ii) (SFBF): The Variance-reduced stochastic modified forward-backward scheme
we employ is defined by the updates
where
A
k(Xk)=
1
mk
∑m
k
t=1
V𝜖(Xk,𝜉k
)
,
B
k(Yk)=
1
mk
∑m
k
t=1
V𝜖(Yk,𝜂k
)
. We choose a con-
stant
𝜆
k
≡
𝜆=
1
4LV
. We assume
mk=⌊k1.01⌋
for merely monotone problems and
mk=⌊1.01k⌋
for strongly monotone problems.
(iii) (RISFBF): In the implementation of Algorithm 1 we choose a constant
steplength
𝜆
k
≡
𝜆=
1
4LV
. In merely monotone settings, we utilize an increasing
sequence
𝛼
k=𝛼0(1−
1
k+1)
, where
𝛼0=0.1
, the relaxation parameter sequence
𝜌k
defined as
𝜌
k=3(1−𝛼0)
2
2(2𝛼2
k
−𝛼
k
+1)(1+L
V
𝜆
)
, and
mk=⌊k1.01⌋
. In strongly monotone regimes,
we choose a constant inertial parameter
𝛼k≡𝛼=0.1
, a constant relaxation parame-
ter
𝜌k≡𝜌=1
, and
mk=⌊1.01k⌋
.
(SFB)
Xk+1∶= ΠX
[
Xk−𝜆k
V𝜖(Xk,𝜉k)
],
(SFBF)
{
Yk=Π
X[Xk−𝜆kAk(Xk)],
X
k+1
=Y
k
−𝜆
k
(B
k
(Y
k
)−A
k
(X
k
))
.
Fig. 1 Trajectories for (SFB), (SFBF), and (RISFBF) (left: monotone, right: s-monotone)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
515
1 3
Stochastic relaxed inertial forward‑backward‑forward…
In Fig.1, we compare the three schemes under maximal monotonicity and strong
monotonicity, respectively and examine their sensitivities to inertial and relaxation
parameters. Both sets of plots are based on selecting
LV=102
.
Key insights Several insights may be drawn from Table1 and Figure1.
(a) First, from Table1, one may conclude that on this class of problems, (RISFBF)
and (SFBF) significantly outperform (SFB) schemes, which is less surprising
given that both schemes employ an increasing mini-batch sizes, leading to per-
formance akin to that seen in deterministic schemes. We should note that when
X
is somewhat more complicated, the difference in run-times between SA schemes
and mini-batch variants becomes more pronounced; in this instance, the set
X
is
relatively simple to project onto and there is little difference in run-time across
the three schemes.
(b) Second, we observe that while both (SFBF) and (RISFBF) schemes can con-
tend with poorly conditioned problems, as seen by noting that as
LV
grows,
their performance does not degenerate significantly in terms of empirical error;
However, in both monotone and strongly monotone regimes, (RISFBF) provides
consistently better solutions in terms of empirical error over (SFBF). Figure1
displays the range of trajectories obtained for differing relaxation and inertial
parameters and in the instances considered, (RISFBF) shows consistent benefits
over (SFBF).
(c) Third, since such schemes display geometric rates of convergence for strongly
monotone inclusion problems, this improvement is reflected in terms of the
empirical errors for strongly monotone vs monotone regimes.
Table 1 Comparison of (RISFBF) with (SFB) and (SFBF) under various Lipschitz constant
merely monotone, 20000 evaluations
LV
RISFBF SFBF SFB
error time CI error time CI error time CI
1e1 2.2e-4 2.7 [2.0e-4,2.5e-4] 1.6e-3 2.6 [1.3e-3,1.8e-3] 5.3e-2 2.7 [5.0e-2,5.7e-2]
1e2 2.7e-4 2.7 [2.5e-4,3.0e-4] 1.9e-3 2.6 [1.6e-3,2.1e-3] 6.1e-2 2.7 [5.8e-2,6.4e-2]
1e3 6.9e-4 2.7 [6.7e-3,7.1e-4] 2.2e-3 2.6 [2.0e-3,2.5e-3] 7.6e-2 2.5 [7.3e-2,7.9e-2]
1e4 2.7e-3 2.7 [2.5e-3,3.0e-3] 5.9e-3 2.6 [5.4e-3,6.2e-3] 9.4e-2 2.6 [9.0e-1,9.7e-1]
strongly monotone, 20000 evaluations
LV
RISFBF SFBF SFB
error time CI error time CI error time CI
1e1 1.5e-6 2.6 [1.3e-6,1.7e-6] 1.5e-5 2.6 [1.2e-5,1.7e-5] 2.9e-2 2.5 [2.7e-2,3.1e-2]
1e2 3.7e-6 2.6 [3.5e-6,3.9e-6] 3.6e-5 2.5 [3.3e-5,3.9e-5] 4.1e-2 2.5 [3.8e-2,4.4e-2]
1e3 4.5e-6 2.6 [4.3e-6,4.7e-6] 5.6e-5 2.5 [4.2e-6,4.7e-6] 5.5e-2 2.4 [5.2e-2,5.7e-2]
1e4 1.4e-5 2.6 [1.1e-5,1.7e-5] 7.4e-5 2.5 [7.1e-5,7.7e-5] 6.0e-2 2.5 [5.7e-2,6.3e-2]
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
516
S.Cui et al.
1 3
5.2 Supervised learning withgroup variable selection
Our second numerical example considers the following population risk formula-
tion of a composite absolute penalty (CAP) problem arising in supervised statistical
learning [7]
where the feasible set
W⊆
ℝ
d
is a Euclidean ball with
W≜
{w∈ℝ
d
∣
‖
w
‖2≤
D
}
,
𝜉=(a,b)∈
ℝ
d×ℝ
denotes the random variable consisting of a set of predictors a
and output b. The parameter vector w is the sparse linear hypothesis to be learned.
The sparsity structure of w is represented by group
S∈2{1,…,l}
. When the groups in
S
do not overlap,
∑g∈
S
‖w
g
‖2
is referred to as the group lasso penalty [6, 88]. When
the groups in
S
form a partition of the set of predictors, then
∑g∈
S
‖w
g
‖2
is a norm
afflicted by singularities when some components
wg
are equal to zero. For any
g∈{1, ⋯,l}
,
wg
is a sparse vector constructed by components of x whose indices
are in
g
, i.e.,
wg∶= (wi)i∈g
with few non-zero components in
wg
. Here, we assume
that each group
g∈S
consists of k elements. Introduce the linear operator
L
∶ℝ
d
→ℝ
k
×⋯×
l−times
ℝ
k
, given by
Lw =[𝜂w
g
1,…,𝜂w
g
l]
. Let us also define
where
𝛿W(⋅)
denotes the indicator function with respect to the set
W
. Then (CAP)
becomes
This is clearly seen to be a special instance of the convex programming problem (2).
Specifically, we let
𝖧1=
ℝ
d
with the standard Euclidean norm, and
𝖧
2=ℝ
k
×⋯×
l−times
ℝ
k
with the product norm
Since
the Fenchel-dual takes the form (3). Accordingly, a primal-dual pair for (CAP) is a
root of the monotone inclusion (MI) with
(CAP)
min
w
∈W
1
2
𝔼(a,b)[(a
⊤
w−b)2]+𝜂
�
g∈S‖
wg
‖
2
,
Q
=𝔼𝜉[aa⊤],q=𝔼𝜉[ab],c=
1
2
𝔼𝜉[b2],
h
(w)≜1
2
w⊤Qw −w⊤q+c, and f(w)≜𝛿W(w)
,
min
w
∈ℝd{h(w)+g(Lw)+f(w)}, where g(y1,…,yl)
≜
l
∑
i=1‖
‖
yi
‖
‖.
‖
‖
(y1,…,yl)
‖
‖
𝖧2
≜
l
∑
i=1‖
‖
yi
‖
‖
2
.
g
∗(v1,…,vl)=
l
∑
i=1
𝔹(0,1)(vi)∀v=(v1,…,vl)∈ℝk×⋯×
l−times
ℝk
,
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
517
1 3
Stochastic relaxed inertial forward‑backward‑forward…
involving
d+kl
variables.
Problem parameters for (CAP) We simulated data with
d=82
, covered by 10
groups of 10 variables with 2 variables of overlap between two successive groups:
{1, …, 10},{9, …, 18},…,{73, …, 82}
. We assume the nonzeros of
wtr ue
lie in the
union of groups 4 and 5 and sampled from i.i.d. Gaussian variables. The operator
V(w,v) is estimated by the mini-batch estimator using
mk
iid copies of the random
input-output pair
𝜉=(a,b)∈
ℝ
d×ℝ
. Specifically, we draw each coordinate of
the random vector a from the standard Gaussian distribution
𝙽(0, 1)
and generate
b=a⊤wtr ue +𝜀
, for
𝜀
∼𝙽(0,
𝜎2
𝜀)
. In the concrete experiment reported here, the error
variance is taken as
𝜎𝜀
=0.1
. In all instances, the regularization parameter is chosen
as
𝜂=10−4
. The accuracy of feature extraction of algorithm output w is evaluated
by the relative error to the ground truth, defined as
Algorithm specifications We compare (RISFBF) with stochastic extragradient (SEG)
and stochastic forward-backward-forward (SFBF) schemes and specify their algo-
rithm parameters. Again, all the schemes are run on MATLAB 2018b on a PC with
16GB RAM and 6-Core Intel Core i7 processor (2.6
×
8GHz).
(i) (SEG): Set
X≜W× dom (g∗)
. The (SEG) scheme [32] utilizes the updates
where Ak(Xk)=
1
m
k
∑m
k
t=1V(Xk,𝜉k
)
,
B
k(Yk)=
1
m
k
∑m
k
t=1V(Yk,𝜂k
)
. In this
scheme,
𝜆k≡𝜆
is chosen to be
1
4
L
V
(
LV
is the Lipschitz constant of V). We
assume
m
k=
⌊
k1.1
n
⌋
.
(ii) (SFBF): We employ the algorithm parameters employed in (i). Specifically,
we choose a constant
𝜆
k
≡
𝜆=
1
4
L
V
and
m
k=
⌊
k1.1
n
⌋
.
V(w,v)=(∇h(w)+L∗v,−Lw)and T(w,v)≜𝜕f(w)×𝜕g∗(v)
‖w−w
tr ue
‖
2
‖
w
tr ue ‖2
.
(SEG)
Yk∶= ΠX
[
Xk−𝜆kAk(Xk)
],
Xk+1
∶= Π
X[
X
k
−𝜆
k
B
k
(Y
k
)
],
Table 2 The comparison of
the RISFBF, SFBF and SEG
algorithms in solving (CAP)
The relative error and CPU time in the table is the average results of
20 runs
Iteration RISFBF SFBF SEG
NRel. error CPU Rel. error CPU Rel. error CPU
v400 5.4e-1 0.1 34.6 0.1 34.7 0.1
v800 8.1e-3 0.5 1.1e-1 0.5 1.5e-1 0.5
1200 6.0e-3 1.1 2.4e-2 1.1 2.4e-2 1.1
1600 5.2e-3 2.0 2.0e-2 2.0 1.9e-2 2.0
2000 4.6e-3 3.1 1.6e-2 3.1 1.5e-2 3.1
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
518
S.Cui et al.
1 3
(iii) (RISFBF): Here, we employ a constant step-length
𝜆
k
≡
𝜆=
1
4LV
, an increasing
sequence
𝛼
k=𝛼0(1−
1
k+1)
, where
𝛼0=0.85
, a relaxation parameter sequence
𝜌
k=3(1−
𝛼
0)
2
2(2
𝛼
2
k
−
𝛼k
+1)(1+L
V𝜆)
, and assume
m
k=
⌊
k1.1
n
⌋
.
Insights We compare the performance of the schemes in Table2 and observe that
(RISFBF) outperforms its competitors others in extracting the underlying feature of
the datasets. In Fig.2, trajectories for (RISFBF), (SFBF) and (SEG) are presented
where a consistent benefit of employing (RISFBF) can be seen for a range of choices
of
𝛼0
.
6 Conclusion
In a general structured monotone inclusion setting in Hilbert spaces, we introduce
a relaxed inertial stochastic algorithm based on Tseng’s forward-backward-forward
splitting method. Motivated by the gaps in convergence claims and rate statements
in both deterministic and stochastic regimes, we develop a variance-reduced frame-
work and make the following contributions: (i) Asymptotic convergence guarantees
are provided under both increasing and constant mini-batch sizes, the latter requir-
ing somewhat stronger assumptions on V; (ii) When V is monotone, rate statements
Fig. 2 Trajectories for (SEG), (SFBF), and (RISFBF) for problem (CAP)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
519
1 3
Stochastic relaxed inertial forward‑backward‑forward…
provided in terms of a restricted gap function, inspired by the Fitzpatrick function
for inclusions, show that the expected gap of an averaged sequence diminishes at the
rate of
O(1∕k)
and oracle complexity of computing an
𝜖
-solution is
O(1∕𝜖1+a)
where
a>1
; (iii) When V is strongly monotone, a non-asymptotic linear rate statement
can be proven with an oracle complexity of
O(log(1∕𝜖))
of computing an
𝜖
-solution.
In addition, a perturbed linear rate is also developed. It is worth emphasizing that
the rate statements in the strongly monotone regime accommodate the possibility
of a biased stochastic oracle. Unfortunately, the growth rates in batch-size may be
onerous in some situations, motivating the analysis of a polynomial growth rate in
sample-size which is easily modulated. This leads to an associated polynomial rate
of convergence.
Various open questions arise from our analysis. First, we exclusively focused on
a variance reduction technique based on increasing mini-batches. From the point of
view of computations and oracle complexity, this approach can become quite costly.
Exploiting different variance reduction techniques, taking perhaps special structure
of the single-valued operator V into account (as in [57]), has the potential of improv-
ing the computational complexity of our proposed method. At the same time, this
will complicate the analysis of the variance of the stochastic estimators considerably
and consequently, we leave this as an important question for future research.
Second, our analysis needs knowledge about the Lipschitz constant L. While in
deterministic regimes, line search techniques have obviated such a need, such ave-
nues are far more challenging to adopt in stochastic regimes. Efforts to address this
in variational regimes have centered around leveraging empirical process theory
[33]. This remains a goal of future research. Another avenue emerges in applica-
tions where we can gain a reasonably good estimate about this quantity via some
pre-processing of the data (see e.g. Section6 in [62]). Developing such an adaptive
framework robust to noise is an important topic for future research.
Appendix
Appendix AAuxiliary results
Lemma 16 For
x,y∈𝖧
and scalars
𝛼,𝛽≥0
with
𝛼+𝛽=1
, it holds that
We recall the Minkowski inequality: For
X,Y∈Lp(Ω,F,ℙ;𝖧),G⊆F
and
p∈[1, ∞]
,
In the convergence analysis, we use the Robbins-Siegmund Lemma [38, Lemma
11,pg.50].
(A1)
‖𝛼x+𝛽y‖2=𝛼‖x‖2+𝛽‖y‖2−𝛼𝛽‖x−y‖2.
(A2)
𝔼[‖X+Y‖p�
G
]1∕p≤
𝔼
[‖X‖p�
G
]1∕p+
𝔼
[‖Y‖p�
G
]1∕p.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
520
S.Cui et al.
1 3
Lemma 17 (Robbins-Siegmund) Let
(Ω,F,𝔽=(Fn)n≥0,ℙ)
be a discrete stochastic
basis. Let
(
v
n
)
n≥1
,(u
n
)
n≥1
∈𝓁
0
+
(𝔽
)
and
(𝜃n
)
n≥1
,(
𝛽n
)
n≥1
∈𝓁
1
+
(𝔽
)
be such that for all
n≥0
,
Then
(vn)n≥0
converges a.s. to a random variable v, and
(
u
n
)
n≥1
∈𝓁
1
+
(𝔽
)
.
Lemma 18 Let
z≥0
and
0<q<p<1
. Then, if
D≥1
exp
(
1
)
ln
(
p
∕
q
)
, it holds true that
zqz≤Dpz
for all
z≥0
.
Proof We want to find a positive constant
Dmin >0
such that
Dmin exp(zln(p))=zexp(zln(q))
for all
z>0
. Choosing D larger than this, gives a
valid value. Rearranging, this is equivalent to
D
=z
(
q
p)z
≥0
for all
z≥0
, or, which
is still equivalent to
ln(D)−ln(z)−zln(q∕p)=0.
Define the extended-valued func-
tion
f∶[0, ∞) →[−∞,∞]
by
f(z)=ln(D)−ln(z)−ln(q∕p)
if
z>0
, and
f(z)=∞
if
z=0
. Then, for all
z>0
, simple calculus show
f�(z)=−1∕z−ln(q∕p)
and
f��(z)=1∕z2
. Hence,
z↦f(z)
is a convex function with a unique minimum
z
min =
1
ln
(p∕q)
>
0
and a corresponding function value
f(zmin)=ln(D)+ln(ln(p∕q)) + 1
.
Hence, for
D≥
Dmin =
1
exp(1)ln(p∕q)
, we see that
f(zmin)>0
, and thus
zqz≤Dpz
for
all
z≥0
.
◻
Acknowledgements M. Staudigl acknowledges support from the COST Action CA16228 “European
Network for Game Theory”.
Author Contributions All authors contributed equally to the completion of this manuscript.
Availability of data materials All data generated or analysed during this study are included in this pub-
lished article.
Declarations
Conflict of interest The authors have no competing interests to declare that are relevant to the content of
this article
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License,
which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as
you give appropriate credit to the original author(s) and the source, provide a link to the Creative Com-
mons licence, and indicate if changes were made. The images or other third party material in this article
are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is
not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission
directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen
ses/ by/4. 0/.
𝔼[vn+1|Fn]≤(1+𝜃n)vn−un+𝛽nℙ−a.s. .
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
521
1 3
Stochastic relaxed inertial forward‑backward‑forward…
References
1. Attouch, H., Cabot, A.: Convergence of a relaxed inertial forward-backward algorithm for struc-
tured monotone inclusions. Appl. Math. Optim. 80(3), 547–598 (2019). https:// doi. org/ 10. 1007/
s00245- 019- 09584-z
2. Attouch, H., Cabot, A.: Convergence of a relaxed inertial proximal algorithm for maxi-
mally monotone operators. Math. Program. 184(1), 243–287 (2020). https:// doi. org/ 10. 1007/
s10107- 019- 01412-0
3. Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert
Spaces. Springer, (2016)
4. Facchinei, F., Pang, J.-S.: Finite-Dimensional Variational Inequalities and Complementarity Prob-
lems - Volume I and Volume II. Springer, (2003)
5. Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica
D 60(1), 259–268 (1992). https:// doi. org/ 10. 1016/ 0167- 2789(92) 90242-F
6. Jacob, L., Obozinski, G., Vert, J.-P.: Group lasso with overlap and graph lasso. Proceedings of the
26th annual international conference on machine learning, pp. 433–440 (2009)
7. Zhao, P., Rocha, G., Yu, B.: The composite absolute penalties family for grouped and hierarchical
variable selection. Ann. Stat. 37(6A), 3468–3497 (2009). https:// doi. org/ 10. 1214/ 07- AOS584
8. Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K.: Sparsity and smoothness via the fused
lasso. J. Royal Statistical Soc.: Series B (Statistical Methodology) 67(1), 91–108 (2005). https:// doi.
org/ 10. 1111/j. 1467- 9868. 2005. 00490.x
9. Tibshirani, R.J., Taylor, J.: The solution path of the generalized lasso. Ann. Statist. 39(3), 1335–
1371 (2011). https:// doi. org/ 10. 1214/ 11- AOS878
10. Attouch, H., Briceno-Arias, L.M., Combettes, P.L.: A parallel splitting method for coupled mono-
tone inclusions. SIAM J. Control. Optim. 48(5), 3246–3270 (2010). https:// doi. org/ 10. 1137/ 09075
4297
11. Latafat, P., Freris, N.M., Patrinos, P.: A new randomized block-coordinate primal-dual proximal
algorithm for distributed optimization. IEEE Trans. Autom. Control 64(10), 4050–4065 (2019).
https:// doi. org/ 10. 1109/ TAC. 2019. 29069 24
12. Rockafellar, R.T.: Conjugate Duality and Optimization. Society for Industrial and Applied Math-
ematics (1974)
13. Combettes, P.L., Pesquet, J.-C.: Primal-dual splitting algorithm for solving inclusions with mixtures
of composite, Lipschitzian, and parallel-sum type monotone operators. Set-Valued and variational
analysis 20(2), 307–330 (2012)
14. Jiang, H., Xu, H.: Stochastic approximation approaches to the stochastic variational inequality prob-
lem. IEEE Trans. Autom. Control 53(6), 1462–1475 (2008). https:// doi. org/ 10. 1109/ TAC. 2008.
925853
15. Shanbhag, U.V.: Chapter5. Stochastic Variational Inequality Problems: Applications, Analysis, and
Algorithms, pp. 71–107 (2013). https:// doi. org/ 10. 1287/ educ. 2013. 0120
16. Staudigl, M., Mertikopoulos, P.: Convergent noisy forward-backward-forward algorithms in non-
monotone variational inequalities. IFAC-PapersOnLine 52(3), 120–125 (2019)
17. Mertikopoulos, P., Staudigl, M.: Convergence to Nash Equilibrium in Continuous Games with
Noisy First-order Feedback. In: 56th IEEE Conference on Decision and Control (2017)
18. Briceno-Arias, L.M., Combettes, P.L.: Monotone operator methods for Nash equilibria in non-
potential games, pp. 143–159. Springer, ??? (2013)
19. Yi, P., Pavel, L.: An operator splitting approach for distributed generalized Nash equilibria computa-
tion. Automatica 102, 111–121 (2019). https:// doi. org/ 10. 1016/j. autom atica. 2019. 01. 008
20. Franci, B., Staudigl, M., Grammatico, S.: Distributed forward-backward (half) forward algorithms
for generalized nash equilibrium seeking. In: 2020 European Control Conference (ECC), pp. 1274–
1279 (2020). https:// doi. org/ 10. 23919/ ECC51 009. 2020. 91436 76
21. Friesz, T.L., Bernstein, D., Smith, T.E., Tobin, R.L., Wie, B.W.: Variational inequality formulation
of the dynamic network user equilibrium. Oper. Res. 41(1), 179–191 (1993)
22. Fukushima, M.: The primal Douglas-Rachford splitting algorithm for a class of monotone mappings
with application to the traffic equilibrium problem. Math. Program. 72(1), 1–15 (1996). https:// doi.
org/ 10. 1007/ BF025 92328
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
522
S.Cui et al.
1 3
23. Han, K., Eve, G., Friesz, T.L.: Computing dynamic user equilibria on large-scale networks with
software implementation. Netw. Spat. Econ. 19(3), 869–902 (2019). https:// doi. org/ 10. 1007/
s11067- 018- 9433-y
24. Börgens, E., Kanzow, C.: ADMM-type methods for generalized Nash equilibrium problems in Hil-
bert spaces. SIAM J. Optim., 377–403 (2021). https:// doi. org/ 10. 1137/ 19M12 84336
25. Tseng, P.: A modified forward-backward splitting method for maximal monotone mappings. SIAM
J. Control. Optim. 38(2), 431–446 (2000). https:// doi. org/ 10. 1137/ S0363 01299 83388 06
26. Boţ, R.I., Mertikopoulos, P., Staudigl, M., Vuong, P.T.: Minibatch forward-backward-forward meth-
ods for solving stochastic variational inequalities. Stochastic Syst. (2021) https:// doi. org/ 10. 1287/
stsy. 2019. 0064. https:// doi. org/ 10. 1287/ stsy. 2019. 0064
27. Cui, S., Shanbhag, U.V.: On the computation of equilibria in monotone and potential stochastic hier-
archical game. arXiv preprint arXiv: 2104. 07860 (2021)
28. Thong, D.V., Gibali, A., Staudigl, M., Vuong, P.T.: Computing dynamic user equilibrium on large-
scale networks without knowing global parameters. Netw. Spat. Econ. 21, 735–768 (2021)
29. Diakonikolas, J., Daskalakis, C., Jordan, M.: Efficient methods for structured nonconvex-noncon-
cave min-max optimization. International Conference on Artificial Intelligence and Statistics, pp.
2746–2754 (2021)
30. Fitzpatrick, S.: Representing monotone operators by convex functions. In: Workshop/Miniconfer-
ence on Functional Analysis and Optimization, pp. 59–65 (1988). Centre for Mathematics and its
Applications, Mathematical Sciences Institute..
31. Simons, S., Zalinescu, C.: A new proof for Rockafellar’s characterization of maximal monotone
operators 132(10), 2969–2972 (2004)
32. Iusem, A., Jofré, A., Oliveira, R.I., Thompson, P.: Extragradient method with variance reduction for
stochastic variational inequalities. SIAM J. Optim. 27(2), 686–72410526234 (2017)
33. Iusem, A.N., Jofré, A., Oliveira, R.I., Thompson, P.: Variance-based Extragradient methods with
line search for stochastic variational inequalities. SIAM J. Optim. 29(1), 175–206 (2019). https://
doi. org/ 10. 1137/ 17M11 44799
34. Geiersbach, C., Pflug, G.C.: Projected stochastic gradients for convex constrained problems in Hil-
bert spaces. SIAM J. Optim. 29(3), 2079–2099 (2019). https:// doi. org/ 10. 1137/ 18M12 00208
35. Geiersbach, C., Wollner, W.: A stochastic gradient method with mesh refinement for PDE-con-
strained optimization under uncertainty. SIAM J. Sci. Comput. 42(5), 2750–2772 (2020). https://
doi. org/ 10. 1137/ 19M12 63297
36. Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput.
Math. Math. Phys. 4(5), 1–17 (1964). https:// doi. org/ 10. 1016/ 0041- 5553(64) 90137-5
37. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Applied Optimiza-
tion, vol. 87. Kluwer Academic Publishers, (2004)
38. Polyak, B.T.: Introduction to Optimization. Optimization Software, (1987)
39. Attouch, H., Maingé, P.-E.: Asymptotic behavior of second-order dissipative evolution equations
combining potential with non-potential effects. ESAIM: Control, Opt. Calculus of Variations 17(3),
836–857 (2011)
40. Boţ, R.I., Csetnek, E.: Second order forward-backward dynamical systems for monotone inclusion
problems. SIAM J. Control. Optim. 54(3), 1423–1443 (2016). https:// doi. org/ 10. 1137/ 15M10 12657
41. Attouch, H., Peypouquet, J.: Convergence of inertial dynamics and proximal algorithms governed
by maximally monotone operators. Math. Program. 174(1), 391–432 (2019). https:// doi. org/ 10.
1007/ s10107- 018- 1252-x
42. Su, W., Boyd, S., Candes, E.J.: A differential equation for modeling Nesterov’s accelerated gradient
method: theory and insights. J. Mach. Learn. Res. (2016)
43. Nesterov, Y.: A method of solving a convex programming problem with convergence rate
o(
1
∕
k
2)
.
Soviet Math. Doklady 27(2), 372–376 (1983)
44. Gadat, S., Panloup, F., Saadane, S.: Stochastic heavy ball. Electron. J. Statistics 12(1), 461–529
(2018). https:// doi. org/ 10. 1214/ 18- EJS13 95
45. Lorenz, D.A., Pock, T.: An inertial forward-backward algorithm for monotone inclusions. J. Math.
Imag. Vis. 51(2), 311–325 (2015). https:// doi. org/ 10. 1007/ s10851- 014- 0523-2
46. Briceño-Arias, L.M., Combettes, P.L.: A monotone+skew splitting model for composite monotone
inclusions in duality. SIAM J. Optim. 21(4), 1230–1250 (2011). https:// doi. org/ 10. 1137/ 10081 602X
47. Bot, R.I., Csetnek, E.R.: An inertial forward-backward-forward primal-dual splitting algorithm for
solving monotone inclusion problems. Num. Algorithms 71(3), 519–540 (2016). https:// doi. org/ 10.
1007/ s11075- 015- 0007-5
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
523
1 3
Stochastic relaxed inertial forward‑backward‑forward…
48. Bot, R.I., Sedlmayer, M., Vuong, P.T.: A relaxed inertial forward-backward-forward algorithm for
solving monotone inclusions with application to GANS. arXiv preprint arXiv: 2003. 07886 (2020)
49. Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to
stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)
50. Juditsky, A., Nemirovski, A., Tauvel, C.: Solving variational inequalities with stochastic mirror-prox
algorithm, pp. 17–58 (2011). https:// doi. org/ 10. 1214/ 10- SSY011
51. Yousefian, F., Nedić, A., Shanbhag, U.V.: On smoothing, regularization, and averaging in stochas-
tic approximation methods for stochastic variational inequality problems. Math. Program. 165(1),
391–431 (2017). https:// doi. org/ 10. 1007/ s10107- 017- 1175-y
52. Gidel, G., Berard, H., Vignoud, G., Vincent, P., Lacoste-Julien, S.: A variational inequality perspec-
tive on generative adversarial networks. arXiv preprint arXiv: 1802. 10551 (2018)
53. Mishchenko, K., Kovalev, D., Shulgin, E., Richtárik, P., Malitsky, Y.: Revisiting stochastic extragra-
dient. In: International Conference on Artificial Intelligence and Statistics, pp. 4573–4582 (2020).
PMLR
54. Kannan, A., Shanbhag, U.V.: Optimal stochastic extragradient schemes for pseudomonotone sto-
chastic variational inequality problems and their variants. Comput. Optim. Appl. 74(3), 779–820
(2019)
55. Cui, S., Shanbhag, U.V.: On the analysis of variance-reduced and randomized projection variants of
single projection schemes for monotone stochastic variational inequality problems. Set-Valued and
Variational Analysis (to appear) (2021)
56. Rosasco, L., Villa, S., Vũ, B.C.: A stochastic inertial forward-backward splitting algorithm for
multivariate monotone inclusions. Optimization 65(6), 1293–1314 (2016). https:// doi. org/ 10. 1080/
02331 934. 2015. 11273 71
57. Palaniappan, B., Bach, F.: Stochastic variance reduction methods for saddle-point problems. In:
Advances in Neural Information Processing Systems, pp. 1416–1424 (2016)
58. Gower, R.M., Schmidt, M., Bach, F., Richtárik, P.: Variance-reduced methods for machine learning.
Proc. IEEE 108(11), 1968–1983 (2020). https:// doi. org/ 10. 1109/ JPROC. 2020. 30280 13
59. Friedlander, M.P., Schmidt, M.: Hybrid deterministic-stochastic methods for data fitting. SIAM J.
Sci. Comput. 34(3), 1380–1405 (2012)
60. Jalilzadeh, A., Shanbhag, U.V., Blanchet, J.H., Glynn, P.W.: Smoothed variable sample-size accel-
erated proximal methods for nonsmooth stochastic convex programs. arXiv preprint arXiv: 1803.
00718 (2018)
61. Jofré, A., Thompson, P.: On variance reduction for stochastic smooth convex optimization with mul-
tiplicative noise. Math. Program. 174(1–2), 253–292 (2019)
62. Ghadimi, S., Lan, G., Zhang, H.: Mini-batch stochastic approximation methods for nonconvex sto-
chastic composite optimization. Math. Program. 155(1–2), 267–305 (2016)
63. Gunzburger, M.D., Webster, C.G., Zhang, G.: Stochastic finite element methods for partial differen-
tial equations with random input data. Acta Numer. 23, 521–650 (2014)
64. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Dordrecht
(2004)
65. Davis, D., Drusvyatskiy, D.: Stochastic model-based minimization of weakly convex functions.
SIAM J. Optim. 29(1), 207–239 (2019)
66. Lan, G.: First-order and Stochastic Optimization Methods for Machine Learning. Springer Series in
the Data Sciences. Springer, (2020)
67. Combettes, P.L., Pesquet, J.-C.: Stochastic Quasi-Fejér block-coordinate fixed point iterations with
random sweeping. SIAM J. Optim. 25(2), 1221–1248 (2015). https:// doi. org/ 10. 1137/ 14097 1233
68. Combettes, P.L., Pesquet, J.-C.: Stochastic Quasi-Fejér block-coordinate fixed point iterations with
random sweeping ii: mean-square and linear convergence. Math. Program. 174(1), 433–451 (2019).
https:// doi. org/ 10. 1007/ s10107- 018- 1296-y
69. Rosasco, L., Villa, S., Vũ, B.C.: Stochastic Forward-Backward splitting for monotone inclusions. J.
Optim. Theory Appl. 169(2), 388–406 (2016). https:// doi. org/ 10. 1007/ s10957- 016- 0893-2
70. Spall, J.C.: Multivariate stochastic approximation using a simultaneous perturbation gradient
approximation. IEEE Trans. Autom. Control 37(3), 332–341 (1992)
71. Spall, J.C.: A one-measurement form of simultaneous perturbation stochastic approximation. Auto-
matica 33(1), 109–112 (1997)
72. Duvocelle, B., Mertikopoulos, P., Staudigl, M., Vermeulen, D.: Learning in time-varying games.
arXiv preprint arXiv: 1809. 03066 (2018)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
524
S.Cui et al.
1 3
Authors and Aliations
ShishengCui1· UdayShanbhag1· MathiasStaudigl2 · PhanVuong3
Shisheng Cui
suc256@psu.edu
Uday Shanbhag
udaybag@psu.edu
Phan Vuong
T.V.Phan@soton.ac.uk
1 Department ofIndustrial andManufacturing Engineering, Pennsylvania State University,
UniversityPark, PA16802, USA
2 Department ofAdvanced Computing Sciences (DACS), Maastricht University,
Paul-Henri-Spaaklaan 1, 6229ENMaastricht, TheNetherlands
3 Mathematical Sciences, University ofSouthampton, Highfield Southampto,
SouthamptonSO171BJ, UK
73. Barty, K., Roy, J.-S., Strugarek, C.: Hilbert-valued perturbed subgradient algorithms. Math. Oper.
Res. 32(3), 551–562 (2007). https:// doi. org/ 10. 1287/ moor. 1070. 0253
74. Barty, K., Roy, J.-S., Strugarek, C.: A stochastic gradient type algorithm for closed-loop problems.
Math. Program. 119(1), 51–78 (2009). https:// doi. org/ 10. 1007/ s10107- 007- 0201-x
75. Lei, J., Shanbhag, U.V.: Distributed variable sample-size gradient-response and best-response
schemes for stochastic Nash equilibrium problems over graphs. arXiv: 1811. 11246 (2019)
76. Lei, J., Shanbhag, U.V.: Asynchronous variance-reduced block schemes for composite non-convex
stochastic optimization: block-specific steplengths and adapted batch-sizes. Optimization Methods
and Software, pp. 1–31 (2020)
77. Borwein, J.M., Dutta, J.: Maximal monotone inclusions and Fitzpatrick functions. J. Optim. Theory
Appl. 171(3), 757–784 (2016)
78. Auslender, A., Gourgand, M., Guillet, A.: Resolution numerique d’inegalites variationnelles. In:
Lecture Notes in Economics and Mathematical Systems (Mathematical Economics) (1974)
79. Chen, Y., Lan, G., Ouyang, Y.: Optimal primal-dual methods for a class of saddle point problems.
SIAM J. Optim. 24(4), 1779–1814 (2014). https:// doi. org/ 10. 1137/ 13091 9362
80. Monteiro, R.D.C., Svaiter, B.F.: On the complexity of the hybrid proximal extragradient method
for the iterates and the ergodic mean. SIAM J. Optim. 20(6), 2755–2787 (2010). https:// doi. org/ 10.
1137/ 09075 3127
81. Chen, Y., Lan, G., Ouyang, Y.: Accelerated schemes for a class of variational inequalities. Math.
Program. (2017). https:// doi. org/ 10. 1007/ s10107- 017- 1161-4
82. Nesterov, Y.: Dual extrapolation and its applications to solving variational inequalities and related
problems. Math. Program. 109(2), 319–344 (2007). https:// doi. org/ 10. 1007/ s10107- 006- 0034-z
83. Malitsky, Y.: Golden ratio algorithms for variational inequalities. Math. Program. 184(1), 383–410
(2020). https:// doi. org/ 10. 1007/ s10107- 019- 01416-w
84. Burachik, R.S., Millán, R.D.: A projection algorithm for non-monotone variational inequalities. Set-
Valued and Variational Anal. 28(1), 149–166 (2020). https:// doi. org/ 10. 1007/ s11228- 019- 00517-0
85. Rockafellar, R.T., Wets, R.J.: Stochastic variational inequalities: single-stage to multistage. Math.
Program. 165(1), 331–360 (2017). https:// doi. org/ 10. 1007/ s10107- 016- 0995-5
86. Rockafellar, T.R., Wets, R.J.-B.: Variational Analysis. Springer, (1998)
87. Shapiro, A., Dentcheva, D., Ruszczyński, A..X.: Lectures on Stochastic Programming: Modeling
and Theory. SIAM, (2009)
88. Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning vol. 1. Springer, (2001)
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center
GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers
and authorised users (“Users”), for small-scale personal, non-commercial use provided that all
copyright, trade and service marks and other proprietary notices are maintained. By accessing,
sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of
use (“Terms”). For these purposes, Springer Nature considers academic use (by researchers and
students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and
conditions, a relevant site licence or a personal subscription. These Terms will prevail over any
conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription (to
the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of
the Creative Commons license used will apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may
also use these personal data internally within ResearchGate and Springer Nature and as agreed share
it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not otherwise
disclose your personal data outside the ResearchGate or the Springer Nature group of companies
unless we have your permission as detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial
use, it is important to note that Users may not:
use such content for the purpose of providing other users with access on a regular or large scale
basis or as a means to circumvent access control;
use such content where to do so would be considered a criminal or statutory offence in any
jurisdiction, or gives rise to civil liability, or is otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association
unless explicitly agreed to by Springer Nature in writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a
systematic database of Springer Nature journal content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a
product or service that creates revenue, royalties, rent or income from our content or its inclusion as
part of a paid for service or for other commercial gain. Springer Nature journal content cannot be
used for inter-library loans and librarians may not upload Springer Nature journal content on a large
scale into their, or any other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not
obligated to publish any information or content on this website and may remove it or features or
functionality at our sole discretion, at any time with or without notice. Springer Nature may revoke
this licence to you at any time and remove access to any copies of the Springer Nature journal content
which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or
guarantees to Users, either express or implied with respect to the Springer nature journal content and
all parties disclaim and waive any implied warranties or warranties imposed by law, including
merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published
by Springer Nature that may be licensed from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a
regular basis or in any other manner not expressly permitted by these Terms, please contact Springer
Nature at
onlineservice@springernature.com