ArticlePDF Available

Pseudogradient adaptation and training algorithms

Authors:
,{-,r4;r*n a't j:\"i ',
Гf,\
tv)
ъ.l..
Ea '>)
,ь-
:{ оt
ч
AURCAT 34(3) 341-422 (L973')
AUT0ll|lATl0Ш
AШD
RETUI(ITE O(l]!lTR(lL
Russian Original Vol. 34, No. 3, Part l, March, 1973
August 1, 197з
ABтOMAТИHA и TЕЛЕMEXAHИнA
(AvтoMAтlKA l тELЕMEKHANIKA)
тRANSLATED FRoM RUsslAN
. Тhis trans|ation is published urldeгthe
editoria| direсtion of the lnstrument Soсietv of America.
rA
l ' l c0l{sulтAl{тs вuREAU' шЕW YoRк
\y
nq
v7
ё<l
.{
AUT0tUlATl0Ш and REtUl0ТE G01{TR0L
A translation of Avtomatika i Telemekhanika
Consultants Bureau journals appear about six months after the publiсation of the
origina| Russian issue. For bib|iographiс ассuraсy, the Еngtish issue pub|ished by
Consultants Bureau сarries the same number and date аs the original Russian
from whiсh it wаs trаnslated- For example, a Bussian issue published in Deсem.
ber wilI appear in a Consu|tants Bureau ЕngIish trans|ation about the followiпg
June, but the translation issue wilI сarry the Deсember date. When ordering any
volume or partiсulаr issue of a Consultants Bureau journal, p|eаse speоify tne
date and, where appIicable, the vo|ume and issue numbers of the originа| Russian.
Тhe materia| you wi|| receive will be a trans|ation of that Russian vo|ume or issue.
August 1, 1973
Volume 34, Number 3, part 1Marсh, 1973
GoNтЕNтs
Еnol.i HUsS.
aА1 Е
oaд o
з49
DЕTЕRМINATЕ sYsTЕМs
Regulatols Guaгaпteeiпg the Аutoпomy of a Controlled systеm-V. I. Buyakas
On Control Fields of Dyпamiс Systems-V. N. Rozova
on thе Solvability of thе Problеm of thе Analytiсal Dеsign of a Coпtroller for Nonstatioпaly
Systems-L Е. ZabеLl.o
sToCHАsTIс sYsTЕМs
Correlatioп Funсtioп of thе output Signal of a Systеm With Possiblе Fau1t and Stationary Input
-А. N. Sklyaтevich
Algorithms for Serviсiлg Stoсhastic Flows of Dеmands in an Inventory-Control Problem
-v. M. Vishnevskii and Е. S. Кoсhetkov
oп Controlled Sеmi-Мarkov Proсеsses with Sеveral Targеt Funсtioпs-M. G. Tep1itskii.. . . .
ADAPTIVЕ sYsTЕМs
Pseudogradieпt AdaPtation and Training Аlgorithms-B. T. Polyak and Ya. Z. Tsypkin
Training Proсeduгes whiсh Combine Stoсhastiс Аpproximation aпd Miпimization
of thе Empirical Risk-L. P. Sysoev
AUToМАTA
Sequentia1 Dесomposition of ProЬabilistiс Аutomata _Yu. M. Afanas'еv, A. I. Kгysaпov,
and Yu. P. Letuпov
Modes of operаtion of the
-Yu. L. Tomfel'd FееdЬaсk Loop iп an Asynсhтonous Logiсal Netwoтk
The Rцssian prеss date (рodpЬano k peсhati) of thЬ issuе was 2/20/1913.
Publiсation thеrеforе did not oссur pтioт to this date, but must assumеd
to havе taken plaсe reаsoпably soon thеrеaftеt.
27356
бoU
366
3?1
enn
з9B
412
416
26
32
3B
69
B4
яo
ADAPтIVE sYsTЕMs
PsЕUDoGRADIЕNT ADAPTATIoN AND TRAINING ALGoRITHMs
B. T. Polyak and Ya , Z. T sypkin UDс 62-50
The authors Plopose a unified approagh, basеd onthe notion ofthе psеudogradient, to theanalysis
of vaгious stoсhasdс algoтithms foг miпimizing funсtionals. A general thеoгem гegarding сon.
vег8rпсe is pгoved; this is used to substantiate stoсhasdc.appгoximation algoгithms (regulaг and
seaгсh), random-searсh algoгithms' generalized stoсhastiс gradient аlgorithms, and vaгious paг-
tiсular algoгithms for identifi сatioп and Pattеrn-гeсognition problems.
Intгoduсtioп
At pгesent' theгe is пo laсk of various algoгithms foт findiпg the unсonditional extгemum of a funсtional
J(с) whiсh defiпes an oPtimality сriterion.
Whеn the funсtional J(с) is diffеrentiable and gradiепt VJ(с) сan be measuгed (even if noise is pгesent),
rеgular algorithms аrе used in whiсh the random realization of the gradient apPeaгs. Maпy algorithms сan be гe-
gaгded as "distorted" gгadient algoгithms; stePs aге made not dirесtly in the diгесtion of the anti8ladient, but they
involvе the use of this quantity. Foг example, thе gradient may be multipliеd by a matгix, only the signs of its
сomPonепts mаy be allowed foт, сertain terms in the gгadient еxpression may be disсardеd, etс. If thе gradiеnt
сaпnot be measuгed but it is possible to сomPute the values of realizadons of ](с), vaгious sеarсh algorithms are
used. Heгe the direсtion of motion is taken to be eithеr a finite-differenсe apprq(imation of the gradient (methods
of thе Kiefer-Wolfowitz type), or a stoсhasdс veсtoI (random-searсh mеthods), or a deterministiс vесtor that is
not diгeсtly related to thе gIadient (e.g., methods of сoordiпate desсent). Moгe сomplеx situations iп whiсh thе
optimaliry сriterion is пot diffeгегitiable aгe also possible. Iп some of thesе сases it is possiblе to use genегаlizеd
сonсePts of the gradient (i.e., the method of generalized stoсhastiс gradients). Finally, many adaptation and trainiпg
algoгithms are еssentiаlly of a noпgradient natule_no funсtional exists with гespeсt to whiсh thеse algoгithms aгe
gтadient аlgorithms.
The aim of our paper is to develop a geneгal approaсh whiсh will enсompass a11 the above diveгse situations
from a unifiеd viеwpoiпt. This approасh, whiсh rests on thе notion ofthе pseudogтadient, сonsisъ in the followiпg.
'у/e сoгsider an itеrаtion alqorithm of thе foгm
с[nl: ctn- 11_ 1[n]sIn],
whеre s[п] is a сertain тandom diгеction of motion, whiсh in gеneral depeпds on a1l the preсеding valuеs of с and
on ц and y[n] > 0 is a sсalar faсtor. We assume that there eхists a сеItain deterministiс smooth funсtional J(с),
i.е., thе optimality сriterioп. This funсtional сaп be eitheг speсified a pгiori (if the iпitial problem is to minimize
it) сan be introduсеd aгtifiсially. It is impoгtant to stess that it is пot assumed that it is possible to сomРute the
values of J(с) ol vJ(с) evеn with a гandom error (i.е., the quandties may not accessible to measuгemeпt). We
сall s[n] thе pseudogradient of j(с) at point с[n_1.] if wehave the сondition
YI(c[n-ll)'Msfn] >0,
i.e., if the veсtor s[п] on average makes аn aсute angle with the gradient; in other woгds, if *s[n] is on aveгage a
direсtion of deсrease of the fuпсtioпal. of сouгse, at eaсh point theгe exists a set of pseudogradiеnts of one funс.
floпal J(с). We should also note that no assumptiorв aгe made гegarding the сontinuity of Ms[п] as a funсtion of
Tгanslatеd from Avtomatika i Tеlemekhаnika, No.3, pp. 45.68, Мaтсh, 19?3. Фiginal аrtiс1esubmittеd
Iu|у 20' 1912.
@1973 Сonsultаnts Bureau, a diuision of Plеnum Publishing Соrpоrаtiоn, 227 ||еst ]7th Streеt, Nеul
York, N, У. 1001]. Аll rights rеseruеd. This articlе сannot be rеproduсed for any purposе шhatsoеuеr
tоithout pеrmission of thе publishеr. А сopу of this articlе is auailablе from thе рublishеr for $15.00.
g',t1
сгn*1]. Ifat eaсh step s[n] is a pseudogгadleпt ofJ(с)' then the iteтation algorithm will be сalled a pseudogradient
algorithm. It tuгns out that the с1ass of рeudogradiеnt algoгithms is very Ьroad and enсompasses almost all known
adaptation a nd training algoгithms.
Below will prove a general theoгem rеgaгding сonvегgеnсе of pseudogтadient algorithms and wil1 give
vaтious examрlеs of its use. Thus we will be able to obtain a number of both o1d and пew гesults гegarding thе
сonvelgenсe of vaгious algoтithms ftom а unifiеd viewpoint. Analysis of thе аlgorithms reduсеs basiсally to a
сheсk on their рeudogгadient naшre.
1' Gгadient and Pseudogradient A1gorithms
Let us гecаll thе adaptive appтoaсh to thе сonstruсtion of tтainiпg algoгithms [1' 2]. We will dеnote the
funсtional to be minimized (oPtimality сriteron) by J(с)' and its rcaIizat7on at the n-th step (at the point с[п -1])
by QftГn1' с[n*1]), i.e.,
I (c) : M" {Q @, c)) : I 0 @, c) p (dr).
:
The measuremеnt eггor foг J(с) at the n-th step will be dеnoted by g[n]:
Е[n] : Q@LnJ,с[n-1]) _t(с[n_1]).
We intтoduсe similar пotation for thе gradient of J(с) and гeаlizations and the eггor in measuгing the
gradiепt: v/(с) : M*|v"Q@,с)|: Io"8 @,c)p(d'r),
;
С|n1:v"Q@[n],c|n _ 1]) _ v l(c|n _ 1]).
In aссoгdanсe with the exшemum сondition VJ(с) = 0, the gгadient algorithm foг minimizing J(с) сan be
written in the form
c|n] : с|rl _ 1] _ y [n ]v "Q @|n1, c|n - 1]) ,
where 7Гn] >0 is a sсalar faсtoг. kamplеs of gradieпt algorithms for vагious pгoblems сan be found in [1..4].
Moreoveг, many algorithms whiсh aгe not striсtly gadient algorithms iп form (e.g., Kiеfeг-Wolfowitz algoгithms
or random-seaгсh algoгithms) сaп betrеаtеd as gтadient аlgorithms by сhanging to differеnt (smoothеd) funсtioпals
t5l.
In this рper we will investigatе generаl training algoгithms ftom a differeпt poiпt of view. We will not
attempt to сollsfiuсt a funсtional for еaсh algoгithm foт whiсh thе algorithm is a gradient algorithm. This eliminatеs
the diffiсu1ties assoсiated wittr the сomplexity or impossibility of obtaining suсh a funсtionа1. A сhange to рeudo.
gradiеnt algorithms yields a numbeг of advantages as сomPared to striсtly gгadient algoгithms. In the first plaсe,
direсt сomputation of the gгadient is impossible in many саsеs. This is the situation, for еxample, when only values
of the funсtional aгe aсcessible to mеasurement. seсond, frequently it is throretiсally possible but eхtrеmely
laborious to сomPutе thе gтadient (е.g., it is пeсеssary to сonstuсt seпsitivity funсtioпs). Pseudogтadieпt algoгithms
сan be muсh simpler. Third, in сertain pгoblems stтiсtly gгadient algorithms сonverge very slowly. Convеrgеnсe
саn be aссelеratеd by сhoosing diгeсt1oпs of motioп that differ fгom thе gгadient. Foг еxaпрle, in detеrministiс
minimizatioп problеms, methфs of the Newton tyPe сoпveгge much more rapidly thаn gгadient methods. Fiпally,
the notioп of gтadient bесomes meaningless foг nonsmooth funсtioпals. Аt the same dme algorithms obtained, e.g.,
ftom heuristiс сonsideгatioпs are реudogтadient with гesPeсt to otheт aгtifiсially intгoduсed smooth funсtionals
(with aсt as Lyapuпov funсtions). An example сan be pгovided by J(с) =|7z11c_c. [t2,wherе с. is the optimum
value of с. obviously, it is impossiblе to comPute J(с) VJ(с) sinсе the vаtue of с" is unknown. The pseudo-
gradient сondition, howevеr (whiсh in this сase assumes the foгm (с[n_1]_с.),Ms[п] > 0), сan usually сheсked
without diffiсulry. A number of suсh examples will dеsсribеd in S 5.
Tablе 1 shows several typiсal рeudogгadieпt algoгithms. It is assumеd that the funсtional is of the form
(1.]-), and t|tatrеalizations of the funсtional Q(х' с) oг of the gгadiеnt V"Q(x, c) сan be measurеd. The meaning
of сегtаin notadon as well as more preсise сonditions imposed on J(с) and on thе algoгithm Paгameteгs suсh that thе
algoгithms beсome реudogгadieпt algorithms will be foтmulated in s s3-5. Thе litеratuгe геferenсes in Tablе 1
aгe not intended to be сomplete and indiсatе only thе papers in whiсh the idea of the algoгithm was fiгst foгmulatеd
(although peгhaр in a somewhаt diffeгent foгm). Iп аdditioп to the geпeгаl algoгithms iп Tаble 1, we will also
378
t
(1.1)
(r.2t
(1.3)
(1.4)
(1.5)
TАBLЕ 1
Name ofa1goтithm
Gradient
Мonгo stoсhаstic
appro<imation
algoriftm)
Tгarшfoтmеd
gradieпt
Sign
Simplifled gradient
Regulaг random-
sедrсh a1gorithm
Random sеаrсh with
oairеd tгiaIs
Algoгithm
c |n] : c |n - 1]- т [nlV.Q @ |n|' c |'n - Ll)
Г ["]
с|,n|_с|n-1'|- is a maEiх
Г[n]V"Q @|n|, с|n-t])
I
Sеarф (Kiоfeг. l,
;;;;i;'- t"
stoсhastiс aP- |
pтo<imation I
V.Q (,, с\: А(c)q(x, c),.4 is a matгix,q is a vесtor
c |n| : с [n - 1'l _ т InIq @|n|' с [n - 1])
, еу 0"t0 unit veстoгs, * [n1 is a sсаlaг f i|n]: ci|n _ L|
o (.i [Пl, c |n - ||f c [п]ei)-Q (zo|n|' c[n-L]\
-'( vtI
q [,?] is a raпdom veсtor, c|n] : c |n - t] - т |nlq |n1
Q (z, In], с |n - L\ * c In 1q [n] ) - Q(J0[/ll, с |n -7 ] -o[л]q [n])
Х ; [;;i
tel
Cooгdinate dеsсent q tn\ : e; with рrobability pi|n]). e)0'
c |n| _ с |n - 1'] - т |n]q I'nl
хс [z]
Q @, c) is nondifferеntiаЬlе but сoпvex in с ,
йQ (", c1 is the.genetalizd gгadient'
с |n]: c |n-1l-1tn]V"Q @|п], c Ln - L|\
investigate сeгtain speсiflс algoгithms whiсh allow for thе spесifiс featuтes of the pгoblems in quеstion (in paг-
tiсular, ada ptive identifi сation a nd pattern-reсognition а1 goгithms).
of partiсulaг importanсe foг evaluating the еffeсtiveness of pseudogтadient algoгithms are thе problems of
rate of сonvrгgenсe, stabilitt, aпd ovеral.l quality of the аlgorithms. In this papeг we wil1 limit oursеlves to
вtablishiпg thе faсв of сonvergeпсe.
2. Conveгgеnсe of Pseudogгadient A1яorithms
Until reсеntly' сonveгgenсe ofvarious сlasses ofstoсhastiс algoгithms has beеn invеstigated independently.
The known faсts in this area aтe summarized in [3, I41. Let us Ieсall only сеrtain basiс тesulB.
Тhe simplest сase of sfiiсtly gadiеnt аlgorithms was сonsideгed as eaгly as the сlassiсal papeг of Robbins
and Morшo [6]. Iп the most general situаtion, iп whiсh neither uпiquеness noг even existenсе of a minimum point
of the funсtional is assumed, the сonvergen"e Ьf gгadient аlgorithms was established iп [15, 12] (Thеoтem 2). The
investigation of searсh algorithms begins with the work of кiefer and Wo1forwitz |1.o7. The сonveгgenсе of thе
multidimensional anаlog of the Kiefeг-Wo1fowitz method was proved by B1um [16] (seе also [15, 11]). Rastrigin
initiated the usе of гandom-searсh methфs (sеe, е.g., [9]). Ceгtain rеfinеmеnts and substantiаtioпs of random-
seaгсh algoтithms may be found in [11, 1?-19]. Finally, gradient algoгithms wеге gеnегalized to the сase of non-
diffeтentiable сoпvex funсtioпаls in [ 1 1. 13 ].
clnl : cln - Ll - 1 [n]signV"Q @ [nl, c [n - I1)
q [n] is a тandom vrсtoг l с|nl : c |tl - 7|
-тIn]q [z] (q[n]тf.Q @|nl, с ["-1]))
379
As for the substantiadon of geпегal nongradieпt algoгithms' thе fiгst гesult iп this аrea is Blum's Г16]. tle
сonsideгed аn iteтative Proсess for finding the root ofthe гegгession еquation aпd proved its сonvergenсe by mеans
of an analog of the Lyapuпov function whiсh he intтoduсеd. It is interesting to note that, although Blum's woгk is
often сitеd, his геsults aгe evidently not suffiсiently wеll known. This is probaЬly the сommoп fate of many сlassiсal
studies' Blum's appгoaсh was subsеquently devеloPed and stlengthened by Aizerman, Braverman, and Rozonoеr
[20]. A numЬer of intеresting rеsults сan found in thе monograph of Nеvel'son аnd Khas'minskii [35].
The abovе papeгs сonsideгеd algoгithms whiсh opeгate in the presenсe of additive noisе гesulting ftom inaс-
сuraсiеs iп measuring thе gгadienв funсtionals. on the othег hand, many paPeгs wеrе devoted to nongradiеnt
algorithms foг deterministiс Ф(tremum pгoblems (seе, in pardсulaг |21'' 221and the bibliography given there). As
a rule, suсh algoгithms сannot be used when noise is present.
It is pоsible to have intermediate types of algoгithms whiсh opeгate only in thе absenсе of additivе noise
but whiсh make it possible to find the extemums of funсtioпals whеn the eтror in measuriпg thе gгadieпts funс-
tionals is of a relative пafuге. This mearв that the eгroг deсreases aпd tends to zе|o as thе minimum point is aD-
Pгoaсhed. This kind of erтor is generated, for example, by multipliсative noise.
Below we will pгove а basiс theorem rеgaгding the сonvегgenсe of disсrete pseudogradiеnt algorithms. It
turns out that, by using just this theorem alone, it is possible to establish сonvеrgeпсe of almost all known аdаptation
and tгainiпg algoтl&ms _bottrwithand without additive and multipliсadve noise. Thus the theorem enсomPasses
both deterministiс and stoсhastiс optimization problems.
Let us now give an exaсt formulation of the theorem. Consider the algorithm
c|nl: c|n _ t] - т[z]sIz]'
wheгe all the с[n] belong to the Hilbert sPaсе tI' с[0] is fixed' y[n] > 0 arе deteгministiс sсalaг faсtors that depend
only oп n, and s[n] are rеalizations of thе гaпdom veсtoг s with values in H, whose distгibution dеpends, generally
spеаking, on с[0], . . ., с[п_1], s[1]' . . ', sГn_1], and n. obviously, any iteгation algorithm may be written in
the form (2.1). Foг example, if thе initial algorithm has the form
c|n]: c|n- t] _ Г[z]v"Q@|n],c|n- t'l),
wheгe Г[n] is an operatoг that dеpends on с[0], . . ., с[n-1], x[1], . . ., x[n], n' may takе s[n]= y[n1.1Г[n1 .
v"Q(x[n], с[n_1]), wheгe 7[п1 is a determlnistiс sсalar fасtor that is сhosen in some fashion, and thus rеduсe thе
аlgorithm to the basiс form. We will гepeatеdly use this pсsibility in the futuгe.
Assume that for с[0]' . . ., с[п_1], s[1], . . ., s[п*1.], n we have the сoпditional mathеmadсal expeсtatiorв
s аnd |l s|l 2 (whiсh will be denoted by Мstn] and Mll s[n1|| 2 rеspесtivеly). Asumе also that inH we have speсifiеd
a deterministiс funсtioпal J(с) (optimaliry сriteгioп) thаt is boundеd from below and diffeгеntiable, and whosе
gтadient VJ(с) satisfies a Liрсhitz сonditioп' i.e.,
I(с)>/*}-*,
||v/(с + а) _ v/(с) || < Д|lс|l' с,e е H'
Assume that algorithm (2.1) is a pseudogradient algorirhm:
vt(cln-L1)'MsInl>0.
We impose a сonstтaint on the steP leпgth in the algoгithm:
Mllslnlll'< llzl * Kl(cfn- 1l) + K,vI(cln-I1)'Mslnl.
This сoпdition meaпs tьаt Mll s[n]ll z as a funсtion of n doеs not itlсrease more гapidly than some-sequеnсe
7[n] (wheгe possiЬly \[n] + -), and as a funсtion of c it inсгeаses not moгe rapidly than J(с) or VJ(с),Ms satisff
the сonditioп
380
(2.6)
Q.7)
(2.2)
(2.З)
(2.4)
(2.5)
г1
).у.|n])'In]< -.
2'
It is cleаr thаt if the у[п] aгe bounded, then сondition (2.,l) is a сorollary of (2.8).
Thеorem 1. Assume ttrat сonditions Q.2)-(2.1) hold, as well as eitheт of the following:
1'[n]( -,
}ъ[nl - 0, K':0, liin y[z] < 2 / LK,.
Theп for any c[0] the sequence с[n] geпeгated by algoтithm (2.1) is almost сertainly suсh that the limit
J(с[n]) exisъ for it aпd
lim V/(с[n - 1]),Ms[n] :0.
tг-l
LJ
(2.',l)
(2.8)
(2.e)
(2.10)
(2.11)
(o 1o\
(2.r3)
(2.\4)
(2.15)
(2.16)
The proof of thе theoгеm, whiсh is based on standaтd appliсation of fасts гegarding the сonvеrgenсе ofsemi-
martingalеs, is given iп thе appепdix. The сondition Еy,[n] < ., whiсh implies that 7[n] + 0, iпsures thе сon-
veIgenсе of the algorithm whеп additive noise is present. In the absenсe of suсh noise (foтmally for }.[n]=0,K1=0)'
e:<prеssion (2.s) is replaсed by the lеss rigid геquiгement fifr-7ln]t< 2fuK,, whiсhdoеs not requiгe that 7[n] tend
to zeto. We should note that in the seсond сase it is possible to obtain a higheг rate of сonveгgenсe.
We should emphasizе that, under thе assumptions made in the theoтem, it is impossible to assrrt that algorithm
(2.1) minimizes funсtional j(с). This is not surprising siпсe, foг example, all the сoпditions of the thеorеm hold
foг s[n] = 0. If in the рeudogтadient сondition we гequirе stliсt iпequality foг all с[n-1] other than the minimum
poiпts, we сan obtain тhe stronger assertions given be1ow.
Coгollary 1. Assume that, in addition to the сondidons of Thеorem 1,
vI(c|n_ L7)'MsIn] > 6(е) } 0 for t(c|n_ 11) > I- + е
for all с > 0. Thеn l (c|n)l\ l, .
Coгollarv 2. Assume that, in addition to the сonditions in Theorem 1, thе spaсе Н is finite-dimrnsioпаl
GI =Rk), sеE of the foгm {с: j(с) > сonst} aге bouпded, aпd
vJ(c[n_tl),MsIn] > 6(е) > 0 foт ||vI(c|n_ 1])|| > e
for all o > 0. Thеn therе almost сertaiпly еxists a suЬequenсe ni and Points с r suсh that
v.r(с.):Q, c|n,1\g., t(c[n])ar1с'1.
Foг an aгЬitrary set с сH and poi.nt c6H we denote the distanсe from с to C by p(с, C), i.е.,
p\c,C\:]nf ||с _ о||.
{,еU
Corollary 3. Assumе that, in addition to the сonditions of Thеoгеm 1, the set C l of minimum poinв of
J(с) is not empry and
inf J (с) > 7- foг p(c, C-у 2 е,,
for all с > 0. Then p (с[n1'
then с[n] ч с*.
vI(с[n-L])"MsLn]
aс.
с.) + 0, /(с[n])
>6(s) >0for p(с,Cr)2е
+ I'. In paгtiсulaг, if с. сonsists of the single point сi,
381
The prooft of Corollaries 1-3 arе very simple and aгe theтеfoгe omitted.
Foт thr саsе in whiсh C. сonsists of a single Point and (2.8) holds, Corollary 3 is а dirесt 8еneтalization of the
above results of Blum [16]. As сompared to the latter, however' it is not assumed that a sеcond derivative of J(с)
о<isв, that м [l *nlll 2 is bouпded, that H is finite-dimensional. Coгollary 3 is also similaг to Theolem \,III of
Aizermaп, Bгaverman, and Rozonoeг [20] aпd to Thеoгem a.a iп [35] (in [20, з5] it is assumеd that J(с) is twiсe
differentiable,and \[n]=сoпstintheсonditioпanalogousto(2.5)). Theorem1andCorollaries].and2withсoп-
dition (2.8), in which it is not assumеd that minimum Points exist, aгe thеmselves geneтalizations of the above
геsults теlating to stiсtly gгadient algoгithms [15, 12]. Finally, giveп сondition (2.9) the assertions of the theoгеm
and сorollaтies relatе to сertain гesults гegarding the сonvergenсe of deteгministiс minimizadon algorithms (see
l2l,227 and the bibliography given there).
The material whiсh follows (s s3.5) is devoted to an analysis of thе сonveгgепсе of vaгious algoгithms by
means of Thеorem 1. If the initial pгoblеm involvеs the miпimization of a smooth fuпсtional J(с), thеn the sub-
stantiation of a speсifiс algoгithm гeduсеs to сheсking the сonditions of Theorem 1, pгimaтily the реudogгadient
сondition (2.4). Heте algorithms aгe dividеd into two 1аrgе сlаsses, гegulaг and seaгсh (whiсh will be dealt with
in $ $ 3 and 14 тespeсtively). Iп the formeг, a stoсhastiс realization of the gradient is used in сonstruсtiп8 the algo-
гithm, whi1e in the seсond it is assumеd that only rralizations of the funсtional arе aссessible. In the moге сomplex
situation in whiсh the funсtioп to be minimized is not smooth, it is possiblе to utilize Theorem 1 by сoпstruсting a
diffегent аltifiсially intoduсed functional with гespесt to whiсh the algoтithm is а рeudogгаdiеnt algoгithm. Еxam-
ples of this approaсh will be given in $5.
3. Regu1ar A1яorithms
Assume that in spaсe H we aге given funсtiona1 J(с) satis8/ing сonditions (2.2) and (2.3). It is assumed that
at poiпв с[n_1] the гaпdom realization of the gгadiеnt vсQ(x[n], с[n_1]) is known. We reсal1 that ([n] =VсQ(xгn]'
с[n_1)_vJ(с_1]) deпotes the тandom erroт in mеasuтing the gгadient, M E[n] = 0, and assume that the varianсe
([п] satisfies thе сoпdition
M||С[n1l|, < r[ru] * K'l(c|n _ 1]) + KJ|vI(с|n _ 11) I|,.
The most important Partiсular сases of ttr.is сondition аre as follows:
aЛЕ[n]ll, { o,
i.e.' additive noise (the grаdient is measurеj with a speсified absolutе ешor);
M||ЕLn1ll, < .KзlIY I (c[n - 1]) ||,
i.e., multipliсative noise (the gгadiепt is measuгed with a sрсifiеd relative eпor).
Let us сonsider the gтadient algorithm (multidimensional Robbins-Monгo stoсhastiс approximaтion
c|n| : c|n _ 11 - у|n]v "Q @|n|, c[n _ 1'l) .
T!e9!9щз. Assume that сonditions (2.2)' (2.3)' (2.6)' (2."1), (3.1), arrd either of (2.8) (2.9) hold (with
K,=1+K3). Thеn foг aпy с[0] in algoгithm (з.2), the quantity limJ(с[n]) almost сertainly exisв, and lim || VJ(с.
ac
tn]) TI = 0. If, moгeovеr, H =рк and the set {с: J(с) =сonst} aте boundеd, then theге almost сeгtainly exists a
point сr, VJ(с.) =0, and sequеnсe ni suсh thаt сtЧ]T с., J(с[n])19 J(сr). If the point с. foг whiсh VJ(сr1=g
A^
is unique, then с[n] + сi.
Pгoof. WehaveMs[n]=VJ(с[n_1])'sothatvJ(с[n_r])Tмs[n]=llVJ(сtn-1])ll2andeхpгеssions(2.4)an(2.12)
ho1d. ontheothеrhand, Mlls[n1||2= llvJ({n_r])||2+мll6tn]ll2, sothateхpгession (3.1)implies that (2.5)holds
with K,= 1 +K3. Using Coгollаry 2, we obtain al1 the asseтtiorв of Theorem 2.
The poof is given to show how simple it is to apply Theoгem 1 in this сase. Henсеfoгth we will omit ele-
mentary сomputations of this tyPe. It is for this leason' iп paгfсu1ar, that the pгoofs of Theoгems 3, 4, and 6 arе
omittеd.
з82
(3. 1')
(3. t" 1
algorithm)
(3.2)
Theoтem 2 (allowing for (2.8))stren8thens the result of Theoгem 1iп[15]. In a somеwhatmoгe 8eneral situa-
tion (when MB[n]=b[n], rll ьtn]l| < .), a similar assertion was pтoved in[12] Сfhеoгem 2). Fiпally' subjeсt to сon-
dition (2.9), Theorem 2 implies results that Beneв1-ize those alгeady known foг thе dеteтministiс case [21].
Let us now ana|уzе the diгесt geneгalization of the gгadiеnt method in whiсh the sсalar faсtor y[n1 is replaсed
by the lineаr oPeтatoт Г[п] (in the finite-dimensional сase, Г[n] is a matrix):
cIn] : c[n _ Ll_ г [z] v "Q @|n], c[n _ Ll) '
Lеt us assume that Г[n] satisfies the following сonditions:
}' ',.1,1Ii : ф,
=
(3.з)
(3.4)
(з.5)
c"Г|n1с > 1"|||[nl|l llс||,, t" } 0 for a|I c е II.
Conditioпs 13.4) and (3.5)hold' for example, if Г[n]=7|г!]г' and Г is a symmetтiсal positivedefinite oPeгatoг
(this being equivaleпt to transfoгmatioп of the meFiс of the initial space in сonjunсtion with a striсtly gгadient
methф; sеe [7]). Howeveг, the opeгator Г[n] need пot be symmetriсal in general. For eхamplе, it сan havе the
foгm Г[n] = y[n]г, Г=Г' + Г2, wheгe Г1 is a symmetriсal Positive definite oPetatoг aпd Г2 is a skew-symmetriсal
opегatoг' i.e., сTГ,с= 0 (for the deteгministiс сase with a quadratiс funсtional J(с), iteradve methods With suсh
operatoIs weгe сorвideтed in t23]). wгite y[n1 =||г 1n1ll .
Theoгеm3. Wheпсonditiorв(2.2),(2.з),(2.1), (3.4),(з.5),andeitherofсonditions(2.8)and(2.9)hold'the
asseгtioпs of Theorem 2 are valid foг algoгithm (3.3).
We сan investigate the algoгithm in whiсh ttre gradiеnt is subjесt to a tansformation of a somewhat dif-
feгent rype iп exaсtly the same way. Speсifiсally, let us assume that
Y "Q (x, c) : А(c) q(т, c), (3.6)
where A(с) is a linеar oPeтator ftom H to H, and thе veсtoг q(х, с) is сapablе of being mеasurеd. Foг examplе,
A(с) сan be a sensitivity matix, whose сomputation is extremеly сomplех. Thus, if the funсtional has the form
Q @, c) : llл (с) * r||,, Y "Q (x, с) : R, (c), (r1(с) * g), (3.7)
thеnА(g)=.x'(c)T is thе sensitivity matriх (Jaсobi maтix), whosе сoпstruсtion гequiгes that we сalсulate the paтtial
derivatives, and q(x, с)=R(с)+x is a measuгeable veсtor. of intеrest is the method in whiсh the opeгator А(с) is
disсardеd, пamely
c|n] : сtn _ 11 _ у|nlq@|n7, с|n - 1]) ,(3.8)
where thе пoise ([n] =q(x[n]' с[n_1])_\q(x' сГn-1]) satisfiеs сonditions (3.1). Similaг methods weге сonsidетed
in|2a], as геgaгds the fuпсtional J(с)= ll кrсlll 2 in the absenсe of пoise. We will assume that A(с) is positive dеfinite
(although not neсessarily symmеtriсal and bounded), i.e.,
e'А(c) c > ?u||с||", }, > 0 and |l,4 (с) lI < a(3.9)
foг all с, с6H.
Theorem 4. Wheп сonditions (2.2), (2.3), (2.6)' (2.,l)' (3.7)' (3.9)' and eitheг of сonditions (2.8) and (2.9)
hold, for any с[0] in algorithm (3.8) we will have lim M*Qk, c1n1) 0 (аnd сoгвequently lim |l vJ(сtn]) l| = o).
Let us now anaIуzе another gтoup of рeudogadieпt algoгithms in whiсh thе сomPonenв of the gradient arе
subjeсt to a nonlinеаг trагвformation, and let us disсuss the "sign" algollthm whiсh is most сhaгaсteristiс for this
gгoup. Hепсеforth a subsсгipt nФ(t to a vrсtoI will denote the сorresponding сomponеnt ofthe veсtoг' the notafon
sign с, wheгe с= (с1, . . ., сk), wi11 denote a vесtor with сomponents (sign C1, . . ., sign сц), aпd thе funсtion sign t
(tbeingasсalar)isequal to1foгt>0,_1 fort<0,апd0fort=0. LetusсoгIsideranalgorithminwhiсhon.ly
thе signs of ttre grаdient сomPonents aте allowеd foг (еe [8]):
(3.10)
383
cLn\ : cfn _ I] _ у|nJ sign V"Q @tn], c|n _ 1]).
We wi1l assume that еасh noise сomponent assumes positive and nеgative values equipгobably and
exists a nonzero probabiliф that the пoise is smаll in absolute value, i.e.,
P (Е[nl'> 0) : P(Е[z]' < 0),
Р(0< 6[z]'< e) > 6(е), P(_ е { Е[n]r< 0) > 6(е)
forall с'>0andall c[0],..., с[п-1]' n,where 6(о)isamonotoniсallyiпсreasingfunсtion, 6(s)>0for е >0.
Theoтеm 5. Givеn сonditiorв (2.2), (2.З), (2.6)' (2.8)' (з.11)' aпd (3.12), the same asseгtioпs as in Theoгem2
hold for algorithm (3.10).
The proof is given in the appendix. We should emphasize that both сonditions (3.11) and (3.12) aге essеntial
(when thеy arе violatеd, the algorittrm сan lose the рeudogгadient pгoperty). The neсеssity of (3.11) has alreаdy
beeп пotеd in the literaшгe [25]. We will give a simp1е example that shows that failure to obseгve (3.12) сan сause
ttre algorithm to diveгge. Assume that the veсtor с is one.dimensional, J= c2 p, and, ([n] assumеs the values + 1
еquiproЬably foг all с[0], . . ., с[n_1], n. Then for |сtn_1]|> 1a steP in аlgorithm (3.10) is аlways madе in the
гequisite dirесtion (to point 0). But foт l сtn_1]| < 1 we will have sign V"=Q&[n], c[п_1])=sign ([n], so that a
stеp will be made in eitheг direсtioп with probability 1/2,. Thus we will have a ra ndom walk on] the segment ('1, 1).
We shorrld also note that iп algorithm (3.10) we have ttstn]!|2 = k, exсePt foг points at whiсh some of thr
сompoпenв of v"Q&[nl' с[n_1]) are equal to zero' so that we сannot havе }"[п] = 0, Kt=0 in (2.5). Theгеfoтe
сondition (2.8) сannot be гeplaсed by сoпditioп (2.9) that 7[n] is boundеd. Foг the algoгithm to сonveг8e Without
assuming that y[п] + 0, it must be mфiflеd. E'oг examplе' wl сan tаkе the algorithm
с[n1 : c[n - I] - х|nJ|I V"Q(z[z], c|n _ t,1||
sign V"Q (z lnl, c[n - 171 ,
and substandate its сonveгgenсе in the absenсe of additivе noise with сondition (2.9).
Let us пow сonsidег гegular algorithms in whiсh the veсtoт q[n] is сhosen in some гandom fashion (indеpen-
dently of the realization of the gradient) and steps are made either in the diтeсtion of q[n] of _q[n] depending
on whiсh of thеse veсtols makes an aсute anq1е with the rca|tzation of the sтadiеnt:
cIn]: c|n- 11-у|"Jq|n](v"8 @[n], ctn- 1])'q[rz]).
Assumе that the veсtor q[n] is distributed suсh that foт any ссH, ll сll = t the rms va1ue of its projeсtion onto
с is nonzегo (iп othеr woгds, foг eaсh hypeгplane there exists a пonzeгo pгobability that q[n] wil1 apрar outside it,
oг, equivalently, the covariаtion mafiiх of фn] is пoпsiпgulaг):
M (q|n1,с)" > ?u> 0'
For e><ample, expгession (3.1a)ho1ds inthespaсe кk if qtn]has a positivedistributioпdensity onsomе boundеd
сonvex body for whiсh 0 is an interioг surfaсе point. Another example in whiсh (3.14) holds is гandom сooгdinate
desсent' in whiсh uпit veсtoгs ei with Fobabilities pi[nl > с ) 0 arе takеn as q[n].
Henсeforth we wi1l assume that the random quantitiеs q[n] and ([n] aгe independent, the vaгianсe of ([n]
is boundеd, aпd thе veсtoгs фп] aгe not exсessively laтge:
M', o{сI"1,q|n1} :0, M||с[z] |l, < o,, M||q|n7lln{ с,
(i:7, "',4)'
Theoгеm 6. Assume that сonditiorв (2,2), (2.З), (2.6), (3.14), and (3.15) hold, and с[0] is aгbitгary. Thеn if
eithеr о<pгession (2.8) holds or o2 _ o аnd iim ytn] <2|'/Lo4 then the asseтtions of Theorem 2 hold for algorithm
(з.13).
Weshould пote that if in (3.13)weallow only for thesign of v"Q&[n1, с[п_1]Tq[n], i.e., сonsider the
algoгithm cfu1 : с|n - |] _ у|n1q|nJsign (v"Q (rIn7, c|n _ L)).q[n] ),
it will сonveгge only subjeсt to additioпal аssumPtions of the type of (3.11) aгd (3.12) regarding the erroг ([n].
384
that therе
(3.11)
(3.12)
(3.13)
(3.r4)
(з.1 5)
Еxсept for аlgorithm (3.8)' have nowhere madе spесifiс the foтm of the funсtional to be minimizеd.
Natuгally, foт narrower сlases of fuпctionals the сonvеrgenсe сonditiorв of the general algoгithms сan be гefinеd
and, moгeover, spесifiс minimization algorithms arе possiblе foг them.
Let us сorвideг one example of this typе, аn identifiсation Pгoblеm. Аssumе that we have a lineaг object
to whosе input we fеed vесtors xГn] €Rl( with сertain speсifiеd (but unknown) distributions, апd at Whose outPut we
measuге the sсalaг quaпtity (сr)Tx[п] in thе pгesenсe of noisе ([n]. aгe гequired to гeсover theJeсtor с.
from the mеasurеd values of x[n1 and y[n1=(сr)Tx[n]+ 6[n]. We intфuсe the funсtional
I (r) : M,. u{fu - c'r)'}.
Then the gradieпt algorithm for miпimizing it has the form
с[n1: c|n_ 11 - х|n]("|n_ 1],r|n7 _ у|nJ)r|n1.
vadous mфifiсations of (3.17) aге also possible, e.g.,
(з. 1 6)
(3.17)
/a 1a\
/a 10\
(з.20)
(3.21)
c|n]: c|n _ 1I_ 'c|n _ 1], !Iу1-_ у|n\.',n',
' llrlnlll"
с[n] : c[n_ L] _ ^l|n1k|n - 7l'r[n] _ у|n1) sign z[z],
c[n] : cfn - 1l _ 1[z]sign{c|n - |]'x|n| _ у|nl)r|nl,
Bibliogтaphiсa| data regaгding thesе algorithms сan found in [1, 2, 4]. We note only that (3. ].8) is thе
Kaczmarz al8orithm, (3.19) is the Nagumo.Noda algoтithm, and (3.20) is the "peгсeptron" algorithm.
Lеt us makе some assumPtioпs regаrding the random геadings x[п] and пoise ([п].
1. The пoise ([n] is unbiаsed and is of bounded varianсe:
MС|.n] :0, MЕ"|n|4 o"'
2. Thе veсtoгs x[n] аre пot excessivеly large, and the сovaгiadoп matгix is пonsiпрlar:
MllxLnlll' { an (i : 1,. .., 4), M {(c'x)z) > }"llcll" (З.22)
/a oal
3. r|n]and([ru] аre independепt.
Iп substandating algoгithm (3.19), wil1 require а stlonger assumption:
4. АIL the сomponenв ofx[п] are indеpendent, and the meaп valuе of eасh is equal to zero:
p(dr):Pl(drt) pt,(drн), Mr,:0, M|r'| > s ) 0. (3.24)
Theoгеm 7. Assume that сonditions (з.21)-(3.23) hold, and с[0] is aгbitгary. Then:
1. If eitherсoпditioгls (2.6)and (2.8)hold oгМ(z[n]= 0 and lБ.ytn] =2}J/аzoц, thenwehaveс[n] T"..
2. If ME2[n] = 0, thеп we havе с1n1 1.i с * in al8orithm (3.1s).
3. If сonditions (3.24) hold and eitheт We haуe (2.6) and (2.8) or М 62[п] = 0, lim f|п7 < 2az / e, thеn in
algoтithm (3.19) have с[n] 5с. .
+. If сonditions (2.6), (2.8), (3.11), and (3.12) hold, then in algoгithm (3.20) havе с[n] T "..
The proof of the thеoтem is given in the appendix. It is based on a сheсk of the pseudogгadieпt naturе of thе
algorithmswithrespесttoeitherfunсtioпal (3.16)orfunсtioпal J(с)=1/2пс_с.l] 2. Weshouldnotеthattheсon-
ditions indiсated in the theorem aгe еsseпtial. Еoт eхаmple, Kaczmaтz'algoгithm (3.18) may пot сonveгge iп the
Prеseпсe of noise (i.e., foг М(2[п] > 0). Conditioп (З.2a) сannot disсaгJed foг algorithm (з.1s). For examplе,
385
assume that k=2 aпd both сompoпeпв x1[n]' x2[п] havе the same sigп (in рrtiсular, x[n] may be uniformly dis-
tibuted in *reregion l*'l = l, |*,l S 1, x1x2 > 0, and theп (3.22) ho1ds Ьut not (3.24)). Heгe the veсtor sign x[n]
will Ьe always сolliпeaт to the veсtor (1.1). Theгefore, except foг the саse in whiсh c1[0]_с2[0]= с1. _с2', algo-
тithm (3.19) wi1l пot сonvelge. Finally' assumption (3.11), (3.12) regarding the noise ё[п] is also essentiаl for
algorittrm (3.20). For eхample' if ([n] = * 1 еquipтobably, then as с[n] аpproaсhes с. the sign of the differenсe
с[n_l]Tx[n]_y[n]= (с[п_1]_с.)Tx[п]_ ([n] wi1l be dеtermined only by the sign of ([n] and with equal pгobabiliry
the steр will be made eithеr in the diтeсtion of x[n] or of _х[n] aгd thus сonveтgenсe will not oссuг. Мoгeover,
it is impossible to take 7[n] = 7 > 0 iп algoтittrm (3.20) eveп in the absеnсе of noise (sinсe in this сase the step
leпgth will пot tend to zero).
4. Searсh A1goгithms
Now assume that the value of the funсtioпal J(с) (and not of the gradient J(с), as befoгe) is сapable of bеiпg
measured at аn arbitrary Point с'
Let us сoгвider a general sсheme whiсh embraсеs most seaгсh algoтithms. At the point с[n_1] we сhoosе,
inastoсhastiсoтdeteтministiсfashion, l veсtoтsql[n], . ..,ql[n]. ThеnumbеrZ mayassumeаnyvalue. For
example, in the гandom-searсh or сoordinate-dеsсent methф we have Z =1, in thе Kiefeт.Wolfowitz method z =k
(wheгe k is the dimension of the spaсe), аnd in methds of expeтiment planning l > k. Then at the Z + 1 poiпв
с[n_1], с[n_]+с[n]q[п], . . . ., с[n_11 +сr[n]q,[n]. we сomPute the values of J(с) with resрсtive eтгors g0[n1,
Etn], . . .' EZ [n]. We denote the measuтed values тespeсtively by
Q @'[n\, ctn _ LJ) : I (c|n_ t] + ЕoIz], Q @'[nl' c[n _ 1] I a|n]q||nl)
: l (c|n_ 1] + q,|n1q'tn|) * Е.[o], ..., Q(x,|n}' cIn_ 1] + oIn1qltn1)
:l(c|n-1] + a|nlq|[n]) +Е.[z].
Then we make a step in aссordanсе with thе formula
с|n!: c[n _,1* ;{f]-Е |Q @.|nI, c[n _ LlI o'|n!q,|n\) _ Q @,[nl' cfn _: 1l),I q,|nl. @.1)
i:1
Thе numeгiсal сoeffiсienв y[n], с[n] satisfy thе сoпditions
o[n] + Q,
11 т,[,] = -
11 a'LnJ
(whегetheseсondmeansthatthelengthsof the"nial stePsnсt[п]mustteпd tozеro moгегapidlythaпthelеngttш
of the 'woгking steр" 7[n]). As regaгds the veсtors qi;п1 we will assumе that the following сondition holds: foг
еасh с€H and all n
l">0.
This сondition has the same meaning as (3.14).
Assume' moreoveг' that ttrе qi;n1 "," not о<сessively 1aтge, and thе noisе gi;n] i, not сoггelаtеd with qi[п],
unbiased, aпd of bounded vaгiaпсe:
м \llo,tn]|lj< o1 U:1,...,6), MЕ'|n!:o,
MЕ'[n]qt[n]:g, M(|t[n]), o,.
' { у'k,q,InJ),| > хl"tt""
L.- t
l-t
386
(4.3)
TABLЕ 2
Name of algorithm IЕorm of qi[n] Refeгenсe
Asymmetriсal variant of Kiefer-
Wolfowitz alqoтithm
7
ь- t, [16, 15]
Symmetiсal variaпt of Kiefer-
Wolfowitz alqorithm | =2k .lг.r = o (i =1
nLrrlL'..
qiГnl = _e. ,. (i=k
l-к
k),
т1, 2k)
[10, 26, 211
Random seaгсh with undireсtioпal trial q[n] unifoгmly distributеd oп unit
sРheтe 111
Random seaтсh with paired tials l=2 q[п] unifoгmly distтibuted on unit
, t- - l- -
sPneгe' q-[nl = _q1n [e, 18]
stoсhastiс m - gradiеnt al gorithm l=m
1=m<k qi;n1 "," гaпdom orthonoгmalized
veсtoгs г9]
Random сoordinate dеscent L -L ql[n] = e' with pтobability Pilnl > o > o[34]
Еx peгiment planпing methods z>k q'1n1 u,e dеterministiс vесtoгs that
do not lie in oпe subspaсе [28]
Theoгem 8. Assume that сonditiorБ
Assume that eittler (z.8) and (4.3) hold
rithm (4.1) we will have
(2.2), (2.3), (4.2), (2.6), 14.4), and
o2 = 0 and li* ytn] < y, wherе 7
@L
Щ ll v". tz]) |[ : 0.
(4.5) hold, and с[0] is arbitгary.
is suffiсieпtly small. Thеп iп algo-
The proof of the thеorеm Gее the appendix) impliеs that method (4.1) is not in general a psеudogгadieпt
methф; the рeudogradieпt сondition сan be violatеd in a few stePs (at points at whiсh сt[n] is not smаll as сom-
paгed to сll vI(с[п_1] t] ). "Basiсally," howеver, thе реudogradiеnt pгopеrty holds, and this makes it possible to
usе Theorem 1.
Tab|e 2 gives eхamplеs of algoгithms to whiсh Theorem B is appliсablе. In the table, k denotes the dimen-
sion of thе spaсe, and e1, . . .' ek aге unit veсtoгs. We should еmphasize that in a number of algorithms the veсtoгs
q\n1 u'e deteгministiс. Table 2^Ьho indiсates thе paper in whiсh the paгtiсulaг algoгithm was proposеd sub-
standated. In most of these (exсept for [15, 21! ttle pгoof of сonveгgenсe is givеn for the сasе in whiсh the mini-
mum Point сiexisB aпd is uniquе. In some сases, сondition (4.2) on thе tгial stеPs is replaсed by the more гigid
сondition Есt[п] yгn] ( ю, Д,Пd it is possible to prove a stгonger assettion regaгding сonvergenсе (e.g., сonvегgenсе
withгespeсt to the funсtional). In [11] the idea was flгst expressed of suЬstantiatiпg гandom-searсh methods by
mеans of geneгal thеoгems on сonvetgenсе. Table 2 indiсates a laгge gтoup of methods known as "methods of
expетiment planning" [28]. Dеspitе the сommon natulе of the problem, the methds in this gтoup wегe developed
independently of stoсhastiс-aPProximation searсh algoгithms' and, a5 fаr as we know, no systemafiс invеstigation
was madе of theiт сoпvelgenсе.
Likе the gгadient algorithm, algoтithms of rype (4.1) maу be subjесted to vaгious tгansfoгmadons. Foт exam-
plе, the vrсtoг on the right side of (+.t1 тna' be multipliеd by a positive dеfinitе matгix [27]. Iпstеаd of thе dif.
feгеnсea(xitn], с[n-1]+ сt[n]ql[n])_Q(xo[n], с[n_1J), we may al1ow only foт sign iп (4.1). Suсh a method (for
Z = 1) was proposed in [8] aпd substantiatеd in [17]. To prove сonveгgenсе of this ''sign' algoгithm' it is neсessary
to make speсial assumPtions гegaгding the noise of the same tyPe as (3.11) and (3.12). Moгeoveг' we may сhange
с[n_1.] oгtly when the mеаsured value of the funсtional deсreases ('гandom searсh With геtuгn whеn the step is
uгвuссеssful" [9])' wheге thе tтial and woтking steР6 may be сomЬined. The pгoof of сonveгgеnсe of a similaг
algorithm is givеn in [19]. Еinally' from among the veсtoгs q,Гn]' . . ., q , [n] we may сhoose oпly the oпе for
38?
for whiсh the quantity Q&i[п], с[n_1] +oгп]qi[n])- Q(x0[п], с[n -1]) is maximum C'тandom seaгсh with optimum
trial'). We will not disсuss the suЬstantiation of all these algoтithms, sinсe the teсhпique of pгoof remaiпs thе same
as aЬove.
5. Algoгithms for Minimizin8 Nonsmooth Fuпсtionа1s
In all eхamplеs that we have considerеd heтetofoгe, the initial problem involves thе minimizatioп of a
funсtional sadsfying сoпditions (2.2), (2.3). If ttre funсtional to Ьe minimized is nonsmooth (oт not boundеd ftom
below), then to apply Theoгem 1 must сonsшuсt a smooth funсtioпal J(с) artifiсially. There is no need for the
values of I(с) or VJ(с) to be aссessiblе to mеasurеmеnt; it is suffiсiеnt that it be possib1e to сheсk the pseudo-
gradient сondition.
Let us fiгst сonsideг the сase in whiсh aгe required to minimize a сoпvex сontinuous Фut nondifferеntiablе)
funсtional J(с) in thespaсeH. At eасh point, this funсtional has a geпeв1.izd gradient (suppoгt fuпсtional), whiсh
we will denote by VI(с). This is a veсtor ftom H whiсh is dеteгmined (gеneтally speaking, not uniquely) from the
сondition jG+ф>j{фaii@),a
foт all c, Е с н. дt ttle points of diffеrentiabiliry, Vi(с) = v1с). Let us assume that at the point с[п_1] the geneг-
ahizd gradient Vf(с[п-1])is сapable of Ьeing measurеd with eггoг ([n], as a result of whiсh obtain thе гandom
u""to' V.ё{*[n], с[n_1])=й("tn-ll) + Eгn]. will assume that
MЕ[n] :0, MIIЕ|n\ll, { o,(t + llсl|,).
The algorithm are iпvestigating is a diгeсt analog of the gradient algorithm (mettrф of geneгalized
stoсhastiс gгadient)
с|nJ : ctn _ 1] _ vIz] fl "Q 1r1n], c|n - 1D .
If thегe is пo noise, thеn, unlike the smooth сase, algoтithm (5.3) will not сonveтge foт y[nJ =7> 0. However,
We сan PгoPose a somewhat diffеrеnt method for сhoosiпg the step lеngth (whiсh тесalls Kaczmatz' al8oгithm (3.1s)),
whiсh сoпverges faiгly rapidly in the absenсe of noise:
с|n|:c|n-L1,- irpp-rt)-;- fli 1c 1" - tl).
lltri r,c tz - tt) ll,
нeгej. = inf.J(с) is assumеd to be known, and the quantities j("гn-r]), Vi(сtn_r]) aге capable of being
measured. We will assumе ttratti(с) doеs not inсгease very fast:
llVi1с1 Il, < к(1 * Ilсl|,),
sup11vJ1g1l| <- foт Ilс|| <R<-.
Assume ttrat j(с) is a сonvex сontinuous fuпсtional, and that the valuе of с[0] is arbitтаry.
Theoгem 9.
1. If сoпditions (2.6), (2.8), and (5.5') hold, thеn in algoгithm (5.3) we have lim T("гn]) u="Т. r,n рrtiсulaг,
fo' } " = - .). If, moгeoveг' the set C . of minimum poiпв is not emРty' thеп lim i (qn] 1" T*' с6n1 T " . ."".
2. If (5.5')holds,r* >* чtheпinalgoгithm(5.ц)wehave!!щi("1n11 =Т.. lfC.isnotempty,then
timТ(сгn]) = J*.с[п] T". 66..
Thepгoof of thethеorem (seetheappendix)isbasеd ontheintгoduсtionofthefuпсtioпal J(с)=l7, [|с-сrfl2,
wheгe сi is any minimum point, and on a сheсk of the рeudogradiеnt сoпditioп foт algorithms (5.3) and (5.4) with
гesРесt to this fuпсtionаl.. The result of Thеorem 9'with slight vaгiafons, has alгeady been given (sеe [11-13] as
regаrds algorittm (5.3), and [29] as regards (5.a)). Heтe we only show how еаsily it сan obtained ftom thegeneral
theoгem (Theoгem 1).
Г
388
(5.1)
(5.2)
(5.3)
(5.4)
(5.5')
(5. ы')
Let us now сonsideг the situation in whiсh the fuпсtioпal to minimized j (") ь nonсoпveх and disсontinuous.
Heгe we сould intгoduсе a сеrtain funсtional J (с) obиinеd by smoothingj (q) and app1y some гegulаr or sеаrсh
аlgoгithm. But eaсh сomputadon of J(c) VJ(с) would тequiгe rePeаted сalсulation of J(с). сould proсeеd
other\^rise and apply some random searсh algoгithm direсtlyto j (с). Then еaсh step would тequirе 9n1y two сom-
putations of J (с)' and the methф would be рeudogradient with гesРeсt to thе smoothed funсtionаl I (с) (sincе the
тandom steps in the algoгiтhm тea|Ize a Proсess of smoothingT (c)).
Thus, let us сonsideг the seaгсh algoтithm
c|n| : c|n_ 11 -у|nlq|n](tk|n - 1] + qtnD _i1,1n_ t])). (5.6)
Hеге q[п] is a random veсtoг, aпd it is assumed ttrat the values of r(с[n_1] агe measured without erгoг. We
should emphаsize thaq as сomргed to the versions of the random.seaгсh method thаt we сoпsideгеd eaгlier, heтe
there is no Parameteг сt[it] that tends tozeto- thеlength of thе trial steрs in algorithm (5.6) does пot deсгease.
on the сontгary' the value of у[n] must tend to zeto' despite the absenсe of noise. The distribution of q[n] will be
asumed to stationary with density P(ф; radial' i.e., depending oпly on the length of q; and
p(dqlnl): p(q)dq, p(q): p(q,) foг lIgll:l|g,11, (5. ?)
(for example q[п] may be uniformly distributed oп a unit ball have a notmal distribution with zeгo mean value
and variаnсe o2I).
We constгuсt a new densityЪ(q) r'o* thе сondition vЁ(ф= _qp(q). It is сlеaт that sinсe p(o) depends only
oп тlqll , this equatioп has a solution. It is not diffiсult to dеtermine it осpliсitly. We intтoduсe the smoothed
funсtioпa1
J (r) (5.8)
l/(с)| (5.9)
t'or all с' then, as is еasy to show (see appendix)' this funсtional satisfies сonditiors (2,2), (2,з).
T!gg,eдn-ц Under сonditioпs (2.6), (2.8)' (5.?)' and (5.9)' aгrd foг any с[0]' in algoгithm (5.6) we have
lim ll vI(сtn])|l : = 0. If thе only stаtionary point of J(с) is a miпimuц poiпt, then algorithm (5.6) аlmostсегtaiгrly
сonverges to it.
Thus algorithm (5.6) minimizes not thе initial funсtional J (с) but the .smoothеd " funсtional J (с). This
approaсh to the pтoblem is natural, siпсе it it mеaпinglеss to sееk the minimum of a disсontinuous funсtion i(с) *
the value of T(с) on aпy еnsеmble of poinв сonveys no information тegaгding its vаlues аt other points. We сould
similaгly investigate the matter of the smoothеd functionals to whiсh otheт minimization searсh algorithms сoг-
геspoпd.
In сonсlusion, let us сoгвider сertain iterative аlgorithms foт reсogпitioп training. It is kпown (seе' e.g.,
|1-, 4, 20, з0.з2]) that many patteгn.reсogпition problеms сaп be геduсed to thе Pгoblem of solving a system of
homogeneous linear inequalities in Hilbert spaсe:
с'g{0 foгall rеА. (5.1 0)
Here A is a fiпite infinite set whose elemen8 apPeaг гandomly in suссession in aссordanсe with a сегtaiп
speсified pгoЬability distгibution P(dx) oп A. The problеm has muсh in сommoп with the identifiсationDrob-
lem сonsideгed $ 3 (whiсh reduсes to the solutioп of a system of linеaг equations that appear at random), and iimi.
1аг algoгithms сan used foт its solution. denote by x[п] a point ftom А whiсh appears at the n-th step.
t ttqtt,p(q)dq < *,
t''oo _ (q + ф p h + ф||dq< /{|l4|l
P-
_ ),r(с*q) p(q)dq.
H
If
з89
Leт us сonsider thе following iteтative algoгithms:
cl.nl:c[n-t]-vlru] 1 t sign cln - 1!'r[n! (c[n - L7'rlnl)al.n},
c|nf : c|n _1 1 _ 1 + sign :Jz _. r ].u [,] k[n _ 77'
а|n!) r[n;,
ctnl: c[n - 11- ^,1n17 * "igo"I' -Ll"tlol trot.
a
The сonneсtions between algoгithms (5.11) and (3.1?), (5.12) and (3.18), and (5.14) aпd (3.20) aгe сleаг.
Bibliographiсaldataregaгdingalgorithms(5.11)-(5.14)сanbefoundinthewoгksсitedabove. Itisalsoshownthat
the attempt to сonsider these algorithms as gгadient algoгithms gеnerally 1eаds to nonsmooth funсtionals.
We will assume that inequalities (5.10) have the following solutioш
C' : {cеH Ic"сqQ for all IеА\ +o
(in terms of the initial pтoblem' this meaгв that erгoг-fгee гeсogпition is poвsiЬle).
Sometimеs we will rеquirе a strongеr assumption гegaтding thе existeпсe of a poiпt с.66 " suсh that
(с-)'с{_в(0 foгall rеА
("тepreseпи bility hypothеsis " ).
We will assume thar the statistiсs of thе vесtoгs x[n] arе suffiсiently representativе; iп ргtiсular, if с does not
satisfy (5.10), thеn there о<isв a nonzero probability that this faсt will be гeveаled by means ofthe veсtors iп ques-
tion: P{c'x>0}>0 for cёC*,
Finally' we assume that the sеt А is bounded:
lla|1 q, for rеА.
Theoтem 11. Assume that сonditiorв (5.15), (5.1?), (5.1s) hold, and т}rat с[0] is aгbitrary. Theп:
1. If Еy[n]= -, y[n] >0, l.imytn] <2/d2 (e.g., 7tn] =7, 0
ac
с[n] + с.€ сo.
2. Iп algorittrm (5.12) we have с[n] * с* сt.
<2/dz), theninalgoritirm (5. 11) we hаve
3. If (5.16)holds, theп algorithm (5. 13) ъ almost сeгtainly finite (i.e., с6n11с".6 Cr foгsome n).
4, Lt(2.6) arrd (2.8)hold, theninalgoтithm (5.14)welraveс[n] T с"€ сi. If (5.16)holds, thenalgorithm
(б.1a) is almqt сеrtаinly finite for aпy 7[n]:y > 0.
The proof of thе thеoгem, whiсh is giveп in the appendix, is based on a сheсk of сorrditions (2.a) and (2.5)
as applied to funсtioпal I(к.)=|/zl|с_с.|| 2. wtrat is пew as сompaгed to the preсеding theoгеms in this seсtioп is
the asseгtion that сеrtain algorithms are finite; this assегtion is also implied by Theoгеm ].. The fiгst rеsult гegardiпg
the fасt thatpeтсeptтonalgorithm (5. 14) with 7[n]=сonst is сonstant is that of Novikoff [30]. It was suЬequеntly
stгengthenedin[20]. PгooftoftheсonvегgenсeofalgoгithmsofthesameryPesas(5.11)-(5. 13)(аndсertainmore
geпeгal ones) weтe obtаinеd iп paгtiсular in [31' 32]. Theoгеm 11 offers an appгoaсh to uпiform derivation of suсh
assertioпs (some of whiсЦ e.g., тegardiпg the сoпvегgеnce of algorithms in the absenсе of the теpгesentаbiliry hy-
pothesis (5.16) aгe peгhaр nеw). We should also пote that algorithms (5.11)-(5. 14) сan readily be gеneгalized to
the сase in whiсh (5.10) is гeplaсеd by a system of morе geneгal inеqualities (e.g., сoпvex nonlineaг aпd inho-
mogeneous ones).
390
(5.1 1)
(5.72)
(5.13)
(5.14)
(5.1 5)
(5.1 6)
(5.17)
(5.18)
APPЕNDIX
Pгoof of Thеorem ].. From (2.з) we obtain for аny с, Е сH
L
l/(c * а)_ J(c)_ VJ(c)таl< Т |lа||,.
If in (A.1) we substiшte (A.1) foг с aпd с[n] ftom (2.1) for с+Ъ, we obtain
L
J (cInD < /(с[n _ r])_ т[n]V/(с[n _ 1])'s[n]*;v,[nll|s[zll|,.
We denotе byЕ theminimum o.algebra generatеd by the raгrdom quаntities s[1]' . . .'
*re quaпtity J(с[n])_I'. We take the сonditional mathematiсal exPeсtation of both sides of
and use (2.5): M{p[nllF}< р[" _ l]_ у|nlYl(c|n _ 1])'}/{s[z] |F}
LtKф\
* ^ у"|"lи{||stnlll,lF}( р[n _ 11 ( t *---v,["] )
2\ZI
_vtz]v/(с[n, _ 1])'J4{s["1lF} ( ,_Ц,7o1) +*,,t"]rtn]* K!^L fLnlt'.
\zlzz
Iпview of (2.8) (2.9)wehаvе t_KiL уtn|>o for suffiсiently large n, ard therefoгe theseсond teгm
oп the гiфtside of (A.3) is nonnegativ" in.""l.d,nсe with (2.4), i.e.,(A.3) assumes the foгm
It{tl|nllг}< рtn_,1 ( ,1*!,,pl ) +f v'r n|ъtn1*K'!у,pJt..
Iп the сase of (2.9) wehavе Kr=0, ).[n] =o, ,o ttrat M{p гu]| г} = р[n*1], i.e., p[п] is a semimartingale.
Foг thе сase of (2.8), we сan use the semimaгtingale theorem in exaсtly the same way as done in [33]. In all сases
we obtain thаt lim р[n] almost сeтtainly e'(ists, aгrd the unсonditional expeсtations M;l[n] aгe uпiformly boundеd:
,}1р[n] { с.
. We should note that thе тesult is valid even if (2.8) is гeplaсed by the weаkет сondition 7[n] + 0.
In iпequality (А.3) we пow сhange to the unсonditional expeсtatioш:
rKtL\/KzL\
rtil"l< { t+iЭv,tnl | мp"|n_ 1]-тtn] ( 1 _7тtnJ )
\2t\zt
X M{YI(c|n_ t])'.{и{sIn ]|г}}+ 1v,t n!?"tn|* K|L у,|n1,
22
We sum (A.5) over n fгom 1 to оq. In view of (A.a) aгfl (2.1) havе, both for сondition (2.8) and
K& L K1L_ - -
-z\z[nlvр[n _ t] { оo, ->у2[nlхLn]{ oo, _2\2|nl.- -'
and therefore tKrL\
>уtz] ( 1_-тт[n] |M{vr@[n_|l),M|s[n]|F}}< -.
\ZI
Foг suffi'сiently large n, we will have 1_!9уp1 } 0 ) 0, for both (2.8) aпd (2.9), and theгefoге
(4.1)
( .2)
s[п_1]' and by r[n1
iпequalities (А.2)
(A.з)
(4.4)
(A.5)
for (2.9),
(4.6)
Zу[n]M {Yl (c|n _ |l)TM |s[z] |F}} a oo.
391
But Е y[n] = cq while VI(с[n-1])Tм {stn] |F} =0, and theтefoге it follows fгom (A.6) that thегe exists а
subsequeпсe ni for whiсh
l1m M{vI(c[nt - 1])'.&/{sln'llF}}: 0.
Еquadon (A.?) implies that there exists a suЬequeпсe пi. suсhthat VJ(сtni.*11)Tм{stni'1 tг} 3сo
Allowingfor(2.4), thisyiеldsthatlimVJ(сtn_1])I{5;д]lг} =,o,aпdthеtheorertristhuspгovdd.
We should notе that if in (2.9) the сondition 1Tй ytn] < 2 /-L]Kz is replaсеd ф the more гigid сondition 0 <
€1 s yгn] = 2(1_ a) pK,, 6z ) 0 and we also require that *re inequality
yr(c[n- 1.1),MslnJ 2 KllyI(ctn - t]) ilr, K > o
hold (a paltiсular сase of (2.12)), then the above сalсulаtioгs would also imply the сonvergеnсe J (с[n]) Т J " "t
the гate of а geomeпiсal pгogгеssion, and с[n] would сonveгge to the minimum point с* at the same гate. We will
not disсuss this matter in detail, sinсe the topiс ofrate of сonvergепсе is not сonsidегed in thЬ paper.
Proof of Theorem 5. Еor algorithm (3.10) we have s[n]=3igп v"Q&[п1, с[n-1]), so that
Ms[n|i: P{V"Q(с, cIn_!l); > 0} - P{Y.Qp, с[z_ t])i < 0}.
If VJ(с[n_1])i } 0, we obtаin ftom this, using (з.11) and (3.I2),
M sfn]i : P |([n]1 } -V./ [n _ 1 ]) J - P{6["]n < _Y I (с|n _,' i \'}
) P|_vI (с|n _ |])l { Е[z]t < 0} > 6(V/(с[z _ 1])'.
Similаrly, foт уJ(g[n_1])i < 0 we have Мs[n]i =_6(_ VJ(с[n.1]i). Therefoтe we alwаys have VJ(с[n*1])iMs.
[n]i > | vJ(dп_r]). tсJn_r1),| ), so that
YI (cLn _ !,!)'Ms|nl= llv/(с[n _ 1l) Il 6 ( ilvr(сt" _ tll ш '' .
\ h !.
Thus expressions (2.4) and (2.12) hold. Sinсe llstn]|| = {k, expresion (2.5) a1so holds, and therefoгe Corollary
2 is appliсable.
Pгoof ofTheoгem 7It сan Ье сheсked diгeсtlv that for fuпсtional (3.].6)
I (c) ) 0' V/(с) : м*|((c _ с*) тэ) с},
ilv/(с a c) -vr@)||: ||M"{(aтl)с}ll < ||a||M||c||z { ozllаlI,
so that сonditions (2.2) and (2.3) hold with L= с2. Еurthermore, ftom (3.22) we have
|lvr1"; ll llс _ с*lI } YI (c)\ {c - с*) : M*||(c - с*)т,1z; >- ?u||c _ c*||2,
|lv/(с) || >- l'||c _ с*||' J(c) _ I (c*) : M", u{(2у - c,a - c*'x)
Х (c*тg _ c,с)I : M"|[(c _ c*)Tс]2} 2 ?у||c _ c-||2'
i.e., J(с) satisfies сondltions (2.15) and (2.76). For algorithm (3.1?) сhесk сonditions (2.4) aгrd (2.б):
Msfnl : YI(c[n_ t])' v/(с[n _ L7)"MуLn] : llV/(с[tl' _ 1]) ll2,
Иlis[n]ll, : M",|@|n _ Ll,a _ !)2||x||2|: M.{|(c[n _ t]_ c.)'g],llr|lз}
* MЕzIn|M||all, < ll с [ n _ 1 ] _ c. ||zM||x||L *'o,o" o zаz+ * ш vr (.t n _ t ]) ||,.
Appliсation of Corollary 3 yields the first assertion of thе theorem.
In algoгithm (3.18) we w111 assume ttrat 7[п1 = 1, while as J(с) we takе % |l "_".l| ?. тhen
YI (cln - ll)'ltsltul: M" [ (с[z _ 7l_ c.утg1z
llсll, '
[ (с[n - l]_ с.)т3]z
I
392
M||s|n1||2 : м" llsll'z
Ф,1)
os j+ *.
i.e., expressioпs (2.4) and (2.5) ho1d with \[n] = 0, Kl= 0, Kz= 1. Siпсe у[n] = 1. < 2 /LKz=2, expression (2.9)
holds, and thus Theoгеm 1 aпd Coгollary 3 arе appliсable; this yields the seсond assertion of the theorem.
For algorithm (3.19) with J(с)=l/, || с_с.ll 2 we obtain, аllowing for (3.24),
t
Ms|nli: M' |[ ! 1.1"_r |l_ cf)аil sign,, } : 1.1n_r1 l'_ сi)M"|li|,
tLr-J I t
i-L
h
V.r(с[z_t])'Ms|n|: ! 1.1o_l], _ cf)zш*|li|) sllс[n_l]_ с.l|2,
I
l/IlsIz]||, < и,{t @|n_71_ c''1t'1z1 * M€"tnl аz||c|n _ t] _ с.|l2 * o2.
In othег words, оtpгessions (2.4) and (2.5) hold wittr \[n] = 02, K,= 0, K2= a2/ Е' and this yiеlds the
third assertion of the theorem.
t'inally, for (3.20) with J(с)= 7z llс_с.ll2, obtain' usiпg (3.11) and (3.12),
Msfn'j: M",a{'sign (с[z _7I,z_ф} - M'\aLP_(ЕLnl1(c|n_1]_ c')'c]' -P(6tn] t (c[n-1]-c.)'э)]},
YI(с[n_|)),Ms|n]>-M"{|(c|n-1]_с*)'э|6(l(с[n_1]_с")тс|)}r.}1||sIz]l|l{'/Ilcll,{az.
Thus Theoтem 1is appliсable, and obtain that lim Мx {|с[n-11.с")Tх|о(l (сгn*r]-с "1т* 1 11 1.6,
while ttre set {сtn]} is almost сeгtainly bounded. Theгefore theгe almost сеrвin1y exists a рint Еaпd subsequenсе
ч suсh that с[n,1 * fr and
lim}/"{| ФLo,_ 1] _ с-)"u |0(|("Ino_ 1]_ с-)тz|)} : Q.
Consideгthefuпсtion 9(с)=М*{l (с_с*)Tx|{"_с*)Txt)}. sinсe 6(e)in(3.12)сanalwaysbechosen
tobeсoпtiпuous, ?(с)isalsoaсontinuousfunсtionof с. Theгefoге9(o=0. Butsinсeф(e)=0foг s>0,theсon-
dition that EG)= 0 and (3,22) imply thаt ё^=n.' . Sinсe J(с[n])= }, 1с1n1_с i |t2 сonveгges, and for the sеqueпсе ni
wghaveniJ(с[n,1)* 0, it follows that с[n] ]"с. for the entire sequеnсe с[n], апd this сompletes the Proof.
Proof of Thеorem 8.*Using (4.4), (4.5), and (А.1)' we obtain
YI (c|n _ L!)IMs|n|:f,i o,..', _ t])' ,' { l U Ф|n _ L]* a]n.!qt|n])_ I(c|n_ t];с'tnl l }
: M" ! Р(vl(.t, _ !l),ct|nl)2 * l(c|n-[J* a|nlqt|nJ,_ l(cIn_!'D- a|n|УI(ctn_LDqttnl
ILJ olal
Lalnl ,2||vI (c|n- t])|| (rlvrr.t, _ 1]) |l_ Ч!)
Ltr
M||s|n11|2 < _*-;,o { Г r Q @fu _,!,! * a[nl q||n] ) _ / [ n _ 1 ] ) )'
a.Lnl
Х V/(cIn _ 1]).c'Iz] } > rпvrt.tn _ t]) |l'
I
Jп
{ }' Iv.r(сtn - 1])'c'In]|l|cllz]lI,
ILJ
2'
As сan be sеen ftom the resultant bound, algorithm (4.1) is nоt iп general а pseudogгadient algorithm. But
if с[n] is small as сompaгed to l| VJ (с[n*1]) [l , the рeudogтadient сondition holds; this faсt will be subsequeпtly
utilized.
Let us now еstimate Мllstn]Il2:
39t
* 2o2l||q||n1|'' } = #* { I [сtnl Ivr(ctn _ tl)"qr[n] l*
- !+ilc,Iz]lt,', 11q,1,111,} - #t,I *,fLls,In]ll, < 2a,tl|vJ(с[n _ 1]) 11z * L2a2!n7ao * ж
We сonsider those гealizatiorв of thе stoсhasdс proсess (4.1) foг whiсh ll VJ (с[n]) fl > Е > 0 for all n. Then
the above estimatеs imply, foг с[n] = a А,o 3 (whiсh holds foг suffiсiently laгge n in viеw of (a.2))
Y I (c|n _ 1]), M stn| 21|Y I (c|n_ 1] ) Il ( ^|o, (,tn_ t ]) l| _ * ) = } ш o, 1. 1, _ r ]) l|,.
\ zt z
Theгefore for с[п] = с S llvJ(сtn-l]) |[ we have
2o2аz l Lzaв\ ..
P/|IsIz]|l, =;й* \zа-+, ) |lv/(с[n_ t])l|,.
Thus expтеssions (2.4) and (2.5) hold for suсh тeаlizatiorв with \[n]=2 a2a2/а2уn1,К,=0, K,= (4an+Lza)/)-
If (4.з)holds, сonditioп(2.,I) is also valid. If howeveг' o =0 аnd 7tп] is suffiсiently small, expгession (2.fl)
holds. Coпsequently, in both сases foг llvJ(сtn])lI = u. Theorem 1is appliсable, and therefore tim v](сГn-1]r.
ac
Ms[п] = 0; and sinсe llvl(сgn*11||2 =z / lvJ(с[n-1]),Мs1n], the probаbility of the геаlizatioгв foг whiсh llvl(".
tn])l| > с > 0 foг а11 n is еqua1 to zero. This is equivalent to the asseтtion of the theorеm.
ProofofTheoгem9. WetakeanaгbitaryJ>j*аndpointЁforwhtсhj(Е;<f,andintoduсethefunсtional
I(с)=|/ztl с_Ёll2. of сouгse, J(с) satisfi'es (2.2) aпd (2.3) with L= 1, and VJ(с) = с*8. Еurthermoгe, foг algorithm
(5.3) we havе, in viеw of (5.1), (5.2)' and (5.5,),
YI(c|n_|!J,M;\n|: (c[n_ 1] _ Ф,trikIn_|D >i1,1o_|D _"I(Ф,
M||s|nl||, : ||vl(c|n _ t])|l, * M||€tn\ll, < (1(*o,) (t * llс[z_1]ll,)
< + o,) (1 + ||e||2 * l(c|n _ 1])).
Let us сonsider those realizationsof algoгithm (5.3) for wшсь j(сtn]) =i for all n. Theп the aЬove inequalities
imply for them that сonditions (2.a) and (2.5) hold aпd Theoгеm 1 is appliсаble. Thereforeliщ vJ(с[n_1])1Мs.
[п] = 0, but vJ(с[n*1])Tмsгn] =Tjкal> 0. This implies thatthe probabi1ity of suсhrеalizations (withj(сtn] >J)
is zeтo. In view of thе faсt ttratj >j* is arЬitrary, this meaпs that lim jt"гn: *l*.
Now assume C* is nonemPty. We iпtroduсe l(с)= lцс_g* fl2, сi€C..
As above,
Y I (cfn - tl),Msln\ )"1 (cln - ll) -"1* > O,
M||stnlli, < (/(* o,) (1* Ilс*ll2 i I(cIn _ r])).
Using Thеoгem 1, wg obtain thatliдц jt"гnll*!r, while lim J(с[n]) almоst сertainly еxists, i.e., thesequenсe
{сtп} is almost сertainly bounded. Therefore we сan almost сeтtaiгrly сhoose a suЬequenсe ni and point66H
weaklv v y
suсh that фiJ --; ё' J(с[nl])+;* (in viеw of the faсt that bouгtdеd seв inH aгe weakly сompaсt). But the
funсtioпal T(") i' "onuo and сontiпuous, and theгefore it is weаkly loweг semiсontinuous. Coгвequeпtly, J(d)=J.,
i.e., d €с.. we теplaсe с. by t in the definition of J(с). тhen J(сГn'1) + 0, but as befoге lim J(с[п]) almost
сеrtainly exisв. Thus ](с1n11T 0, i.e., сгпl T i
Lеt us now Pгove сonveгgenсe of algori*rm (5.4), limiting ouгselves to the сasе of noпempty C r. Taking
y[n] = 1, J(с) = l| c-с.|l 2, " . €с ., we obtaiп
(I (cln - Ll)- 1"12 r'il . rп' Ф@|n_ t])_r;;z
rlllls[zl;1-: -]ll--r
|lV/(с[z - tl) |l,
l
394
УI (cLn _ |D'MsIn]2 Itiit"to- 1l) tl, '
i.e., сonditiorв (2.2)-(2.5) ho1d withL=1, \[n] =0, Kl= 0, Kz=1. Therefore ехpгessioп (2.9) alsoholds, aпd
Theoтеm 1 implies that
(I (cLn - tl) - J.) 'z
lim :0.
llvJ(с[n - 1]) ll.
Мoreover, J(с[n]) teпds to a limit, i.е., all the {сtn]} aгe bounded. Theгefoгe gxpression (5.5") implies that
ttrе ilй(сtn*1]) [| aгe bounded, and henсe theassertion we have proved yields limj(сtn]) = j*. тье сonсludiпg
paгt of the proof is the same as foг algorithm (5.3).
Proof of Theoтem 10. BywгittingJ(c)intheform /(.):s "I@)Bp_c}dz, ' wесanusetheruleof dif.
ferentiation with respeсt to a parameter. As а тesult we obtain
!"r (с) : _ S,., vv1 - с) сIz :Sin* (1) qp (q) dq.
Allowiпg foг this expгession, as well as foг (5.?) and (5.9), we find that J(с) sadsfies сondition (2.3). The
validity of (2.2) follows immediately fгom (5.9). Fuгtheгmorе, beсause of (5.?) wehave \ qp(q)dc:0 ' so that
r,4slnl:Jtjt. t" - 1l+ cJ-i pln - tl)) qp(c) ds: yJ(c ln-rl),
xrllslnlf : jd @In_tl* s) -i 1,1n - rl)),llql),p (q)d,q,_{4R2o2,
i.e., сoпditiotв (2.a) and (2.5) hold and thеrefoгe Theorem 1 is appliсablе.
Proof of Theoгеm 11. We will wгite (сTx)* -- max { 0, "T}. Assuming that с. is an aгbitrary point fгom Cr,
we iпtгoduсe IG) =| /z П "-" *fl 2; then сonditions (z.2) and (2.3) hold with L = 1.
Foт аlgoгithm (5.11), siпсe с.Tx s0 foг all x' we obtain
Yl (с[n _ t]),Ms[n] : M {(c|n _ 1'!,x) a@|n - t] _ c*),а} 2 foI |(с[n _ 1]'c) +,},
Д/lis In ] l}, : {(c[n_ 1 ]'а1 } 11,11 z } 4 d,2 M { (c|n _ t1' Ф\},
i.e.'(2.a)and(2.5)holdwith ltn]:0,Kl=0,K,=d2-,and(2.6)and(2.9)hotdurrdеrtheаssumptionswehavemadе
гegaгding 7[nJ. Fгom Theorem 1we obtain that1imМ{(сtn_l1Tx)2*T0, and the set {сtn]} is almost сeгtainly
bounded. Therefore there rxisB a point Ё6H, to whiсh the subsеquепсe с[ni] сonverges weakly. The funсtional
g(с) = ц(gTx)}, as сan be chесkeddiгесtly, is сontinuous and сonvex in с' and thеrefore it is weaklv loweг semi-
сontinuous. From this we obtaiп that 9(Ё)=0. But if Ё€C., expгession (5. 1?) implies ftatм(Еx)*2 > 0. тhегefoгe
Ё€с.. In view of the faсt that сr is aгbitтary in the definition of J(с), we t"ke "* =Ё. Th"n the almost сeтtain
сonvеIgenсe of J(сГn]) апd the сonvеrgenсe of the subsequenсe с[ni1 - dimply that the entiгe sequenсe с[n] сon.
veгges to с.
In algorithm (5.12) we havе
V I (cln - [D'Mslnl: M(ct., _ 1],t) +(c|n _ |!_ c*)тa
ll -tlz
_2
(cln - llrr\ .
='_-u* -, Л,lr|is[n]|I2: 11
The тemaining reasoning is as above.
Algorithm (5.14) сan be similarly investigated:
(c|n - |1,ф1
llсll,
YI (c|n - tl),Ms[nl ) M (сIn _ 1]'э) +'
MllslnJllz { .4{ilzli, { dr.
Let us пow provе that algoгithms (5.13) and (5.14) aтe finite wheп "representability hypothesis" (5.16) holds.
Let us note fiгst of all that о iп (5.1 6) сan be regaгded as arbitaгily large beсause of thе faсt thаt iпequalities
(5.10) are iпhomogeпeors (e.g., by rеplaсing с r by }.с * for large Positive l). Fuгthermoгe, iп pгoсessеs (5. 13)
aгd (5.1a)Wecandisсardallpointsatwhiсhс[n_1]x[n] <0Giпсeсdoesпotсhangeatttrem). Forthегesulting
ao<
"сompressed' Proсеsses' we should understand Mstn], M|| s[n]|| 2 to be the сonditional expeсtations of the сoтгes-
ponding quantitiеs subjесt to the сondition с[n_1],x > 0. For "сompressed" proсess (5.14) we сan write
Yl (c[n - |l)"trIs[nl : IуI {(c[n _ 1] _ c*),x|c|n - 1]'э } 0}
} M {_c'",ric|n _ |fu ) 0} 2 е,
Mllslnlll2 < d'.
Theтefore may assumе that сonditions (2.4) aпd (2.5) hold with \[n] = 0, Kr=0, Кz=d2/e. sinсe mаy
be taken to aгbitгarily large, for aпy 7 > 0 the сondition f <2LKz=2 o/d2 holds, i.е., (2.9) holds. Using
Theoгem 1., obtain thatlim VJ(с[n*1])Tмs[n1 1сg1o, the "сompгessed'pтoсess. But thъ сontгadiсts the above
bound vJ(с[n*1]),Мs[n] > s > 0. Consequently, with probаbility 1it is impossible to сotlstгuсt a ''сomPressej"
Proсеss сorвisting of an infiпitе numbеr of iteradons. As a гesult, the initial pгoсess teгminatеs almost сeгtainly
aftet a finite numbeт of steF. The геsultant Point is the solution, in view of сondition (5.16).
Thе гeasoniпg for the сase of algoгithm (5.13) is similaг. We will give only the сorгesponding bounds.
Taking 7[п] = 1, we obиin
YI(c[n_l|])".01s[n] : M |щ+Pf, @|n_||_ c,),.t|c|n-|],, = o}
I llrll
Sinсеwemayassumеthat6< с, theгesultantboundsimplythatмl| stn]ll 2svJ(с[n_1])Tмsгn]. тtrus
сonditioп (2.5)holds with \[n]= 9, K1=0, K,=1, and theгеfore (2.9)holds as well, sinсe y[n] = I<2/LKz=2.
LITERATURЕ сITЕD
1. Ya. Z. Tsypkin' Adaptaтion and Training in Automaflс Systems [iп Russian], Nauka, 1968.
2. Ya. Z. Tsypkin' Fundamentаls of the Theory of Leaгning systems [in Russian], Nauka, 19?0.
3. М. Bazan, Stoсhastiс Appгoximatioп [Russian tгansladon], Мir, tg,l2,
4, J. M. Mendеl and K. S. Fu (еditors), Adaptive Learning and Pattern Reсognition Systems. TЪeory aпd
Aрpliсatioпs., Асаd. Pгess, 19?0.
5. Ya.z. Tsypkiп' "Smoothеd гandomized funсtiona1s and algoгithms in adaptation aпd training theory,.
Avtomat. i Telemekhan., No. 8, 1971.
6. H. Robbiш and S. Monтo, "A stoсhastiс approximation mеthф,'Annals Мath. stat., 22, No. 1, 1951.
1. J. B. CroсkettandH. C1teгnoff, '.Gгadientmethds of maхimization," PaсifiсJ. Math., 5, No. 1, 1955.
8. V. Fabian, "Stoсhasdс appгoximation methds," Czeсhoslovak Mаthematiсal Journal, Щ No. 1, 1960.
9. L. А. Rastrigin' Stoсhastiс Search Меthods [in Russian], Nauka, 1.968.
10. Е. Kiefeт and J. Wolfowitz, "stoсhastiс еstimadon of the maximum of a геgression funсtion,'' Аnnals Math.
Stat., 23, No.3, 1952.
11. Yu. M. Еrmol'еv and Z. V. Nekrylova, ''Some stoсhasdс appгoximation mеthods," Kibernetika, No. 6,
ryoo.
12. Yu. М. Еrmol'ev, "Mеtttod of geneгalizеd stoсhastiс gradieпв and stoсhasdс quasi-Fejeг sеquenсеs,''
Кiberпetika, No. 2, 1969.
1з. B. M. Litvakov, 'Conveгgenсе of reсursion algoгithms for teaсhing Pafteгn гeсognitioп'" Avtomat. i
Telemekhan., No. 1, 1968.
\4. N. V. Loginov, 'Stoсhastiс aРPгoximafloп methods,n Avtomat. i Telemekhan.' 1966.
15. I. P. Devyateгikov, A.I. Kapliпskii, andYa..Z. Tsypkin, ''Coпvеrgеnсе of training algoгithms," Avtomаt.
i Tеlemekhaп., No. 10' 1969.
].6. J. R. Blum' "Multidimensional stoсhastiс appror<imatioп mеthods," Annals Мath. stat., Ц No. 4, L954,
17, Ё. M. vaisbord and D. B. Yudiп, "Multiextemum stoсhastiс approximation," Tekhniсhеskaya Kibeгnetika,
No. 5, 1968.
18. Ya. S. Rubiпshtein, "Comparison of гandom sеarсh and stoсhastiс aPProximatioп," in: Theory and Appliсatiorв
of Rаndom Searсh [in Russ1an], Zinatne, Riga, 1969.
19. o. V. Guseva, 'Conveгgenсe of one raпdom-seaгсh algorithm," Kibernetika, No. 6' 19?1.
396
I's*6
!1,
-1
lls
cIn
=*{
M||s|n]u,: м {
(,|n _ |7,x * e) |c|n _ 7],а }o } = },
tt"
1l
th
(cln -+ 6)', | .1"- 1t* > oI.
20.
21.
22.
23,
24.
o(
26.
90
30.
a1
on
M.A.Aizeгmaп, Е. М. Bгaveгrnan, and L. I. Rozonoer, Method of Potеntial Funсtions in thеTheory of
Мaсhine Training [in Russian], Nauka, 1970.
Yu. I. Lyubiсh and G. D. Maistтovskii, 'Genеral theory of гelaxadoп Proсеsses foг сonvex fuпсtionals,'' Usp.
Мatem. Nauk' 25' No. 1' 1970.
B. T. Polyak, ''Coпveгgenсe of the Method of fеаsiblediгeсtions in extremum pгoblems,,'Zh. Vyсhislit.
Matem. i Matem. Fiz., 1,1, No. 4, 1971.
A. A. Samarskii, Introduсtion to the Theory of Diffeгenсe Sсhemеs [in Russian], Naukа, 19?1.
V. Ya. Katkovnik, '.Sensitivity of gradient proсedures,'' Avtomat. i Telemekhan., No. \2, 191o.
Ё. п. Aved'yaп, "one modifiсation of the Robbins.Monio algorithm, " A vtomat. i Telemеkhan., No. 4, 1,g6,t.
K. B. Gray, "Appliсations of stoсhastiс аpproсimation to the oРtimization of гandom сiгсuiB,'' Proс. Symp.
Appl. Мath., |6, L9Ф.
H. J. Kushner, 'Stoсhastiс appro<imation algorithms foг the loсal optimization of funсtions ъlith nonuniquе
sЙdonary points," Bгown Univ., 1972 (prepгint).
V. V. Nalimov and N. A. Cheтnova' statistiсal Мethods of Planпing Еxtгemum Еxргimenв [in Rusian],
Nauka, 1965.
B. T. Polyak, "Miпimizatioп of noгвmooth funсtionals,,' Zh, Vyсhislit. Мatem. i Мatem. Fiz,, 9, No. 3, 1969.
A. Novikoff, "on сonvегgеnсe proofs foг peгсeptгoпs,. Pгoс. Symp. Math. Theor. Аutom., Ц' 1963.
V. A. Yakuboviсh, "Certain general theoretiсal pтinсiplеs regaгding thе сoгвtтuсtion oftтainable гeсognitioп
systеms," iп: Computeг Teсhniques arrd Pгogгamming Pтoblems [in Russian], No' 4, Izd. LGU, 1965.
V. N. Еomin, "stoсhastiс analogs for fiпite-сonveгgent training algorithms foг recognition systеms," in:
Computeг Techniquеsand Pтoblems of Cybегпetiсs [in Russian], No. 6, Izd, LGU, L911,,
Е. G. Glаdyshev, "Stoсhastiсappгoximation," Teor. Vегоyatn. i ee Pгimen., ]..0' No.2, 1965,
R. C. Buck, ''Stoсhastiс asсent,n Numer. Мattr., 4' No.3, 1962.
M. B. Nevel'son and R. Z. Khas'minskii, Stoсhasdс Approximadon аnd Reсuггent Еvaluadon |in Russian],
Nauka, 19?2.
OQ
a(
a oг,
... However, the gradient may only be partially evaluable, due to, for instance, the need to numerically evaluate an integral [38], or the true kernel is a latent convolution of one's chosen kernels [39]. In this case, pseudo-gradients [40], i.e., stochastic descent directions that are positively correlated with the true gradient, are the only update directions that are available. Moreover, their use in stochastic pseudo-mirror descent has recently achieved state of the art performance in intensity estimation in point processes [38]. ...
... Now we shift focus to deriving an iterative approach to solving (1) via a functional extension of mirror descent [45]. We select a Bregman divergence that ensures positivity of the range of the function f during optimization, and present a generalization of a gradient called a"pseudo-gradient" [40]. The merit of employing pseudo-gradients is the ability to define approximate search directions that are useful in integral approximations that arise in point process intensity estimation, as well as a broader ability to incorporate samples into the functional representation through a kernel embedding. ...
... as the case maybe. Equivalently, a pseudo-gradient g t is any search direction that forms an acute angle with the original gradient ∇R(f t ) in the dual space, as stated in [40], and may be used in lieu of the true gradient when its evaluation is costly or intractable [38]. Next we present a few instances of pseudo-gradients to build intuition: i) Stochastic Gradients: ...
Preprint
Full-text available
We consider the problem of expected risk minimization when the population loss is strongly convex and the target domain of the decision variable is required to be nonnegative, motivated by the settings of maximum likelihood estimation (MLE) and trajectory optimization. We restrict focus to the case that the decision variable belongs to a nonparametric Reproducing Kernel Hilbert Space (RKHS). To solve it, we consider stochastic mirror descent that employs (i) pseudo-gradients and (ii) projections. Compressive projections are executed via kernel orthogonal matching pursuit (KOMP), and overcome the fact that the vanilla RKHS parameterization grows unbounded with time. Moreover, pseudo-gradients are needed, e.g., when stochastic gradients themselves define integrals over unknown quantities that must be evaluated numerically, as in estimating the intensity parameter of an inhomogeneous Poisson Process, and multi-class kernel logistic regression with latent multi-kernels. We establish tradeoffs between accuracy of convergence in mean and the projection budget parameter under constant step-size and compression budget, as well as non-asymptotic bounds on the model complexity. Experiments demonstrate that we achieve state-of-the-art accuracy and complexity tradeoffs for inhomogeneous Poisson Process intensity estimation and multi-class kernel logistic regression.
... A.3 Convergence Proof for Adasum [26] discusses the requirements for a training algorithm to converge to its optimal answer. Here we will present a simplified version of Theorem 1 and Corollary 1 from [26]. ...
... A.3 Convergence Proof for Adasum [26] discusses the requirements for a training algorithm to converge to its optimal answer. Here we will present a simplified version of Theorem 1 and Corollary 1 from [26]. Suppose that there are N training examples for a model with loss functions L 1 (w), . . . ...
... • ∀i : α i ≥ 0, ∑ i α i = inf, and ∑ i α 2 i < inf. The following Theorem is taken from [26]. ...
Preprint
Stochastic gradient descent (SGD) is an inherently sequential training algorithm--computing the gradient at batch $i$ depends on the model parameters learned from batch $i-1$. Prior approaches that break this dependence do not honor them (e.g., sum the gradients for each batch, which is not what sequential SGD would do) and thus potentially suffer from poor convergence. This paper introduces a novel method to combine gradients called Adasum (for adaptive sum) that converges faster than prior work. Adasum is easy to implement, almost as efficient as simply summing gradients, and is integrated into the open-source toolkit Horovod. This paper first provides a formal justification for Adasum and then empirically demonstrates Adasum is more accurate than prior gradient accumulation methods. It then introduces a series of case-studies to show Adasum works with multiple frameworks, (TensorFlow and PyTorch), scales multiple optimizers (Momentum-SGD, Adam, and LAMB) to larger batch-sizes while still giving good downstream accuracy. Finally, it proves that Adasum converges. To summarize, Adasum scales Momentum-SGD on the MLPerf Resnet50 benchmark to 64K examples before communication (no MLPerf v0.5 entry converged with more than 16K), the Adam optimizer to 64K examples before communication on BERT-LARGE (prior work showed Adam stopped scaling at 16K), and the LAMB optimizer to 128K before communication on BERT-LARGE (prior work used 64K), all while maintaining downstream accuracy metrics. Finally, if a user does not need to scale, we show LAMB with Adasum on BERT-LARGE converges in 30% fewer steps than the baseline.
... Most of the related literatures available to solve optimization problems with the use of gradient like methods under the presence of noise were analyzed under boundedness assumptions on the objective function and the decision variable or show only lim k⟶∞ inf‖∇f(x k )‖ � 0 [72][73][74][75]. Authors in [55] discussed convergence results for the following method, by removing various boundedness conditions such as boundedness from below of f, boundedness of ∇f(x k ), or boundedness of x k : ...
Article
Full-text available
Distributed optimization is a very important concept with applications in control theory and many related fields, as it is high fault-tolerant and extremely scalable compared with centralized optimization. Centralized solution methods are not suitable for many application domains that consist of large number of networked systems. In general, these large-scale networked systems cooperatively find an optimal solution to a common global objective during the optimization process. Thus, it gives us an opportunity to analyze distributed optimization techniques that is demanded in most distributed optimization settings. This paper presents an analysis that provides an overview of decomposition methods as well as currently existing distributed methods and techniques that are employed in large-scale networked systems. A detailed analysis on gradient like methods, subgradient methods, and methods of multipliers including the alternating direction method of multipliers is presented. These methods are analyzed empirically by using numerical examples. Moreover, an example highlighting the fact that the gradient method fails to solve distributed problems in some circumstances is discussed under numerical results. A numerical implementation is used to demonstrate that the alternating direction method of multipliers can solve this particular problem, by revealing its robustness compared with the gradient method. Finally, we conclude the paper with possible future research directions.
... This problem is intimately related to understanding the asymptotic behaviour of (f (X n )). The classical analysis of Polyak and Tsypkin [PT73] yields existence of the limit lim n→∞ F (X n ) and shows that lim inf n→∞ f (X n ) = 0 under appropriate assumptions. Later, Walk [Wal92] showed that in an appropriate setting one has almost sure convergence lim n→∞ f (X n ) = 0. Similar results were established in various settings, see [Gai94,LP94,Gri94,MS94,BT00]. ...
Preprint
Full-text available
In this article, we consider convergence of stochastic gradient descent schemes (SGD) under weak assumptions on the underlying landscape. More explicitly, we show that on the event that the SGD stays local we have convergence of the SGD if there is only a countable number of critical points or if the target function/landscape satisfies Lojasiewicz-inequalities around all critical levels as all analytic functions do. In particular, we show that for neural networks with analytic activation function such as softplus, sigmoid and the hyperbolic tangent, SGD converges on the event of staying local, if the random variables modeling the signal and response in the training are compactly supported.
... For example the term γ(1 + ε)D t in the recursion of Y n,k , (72), is fixed for all k > k t , and Y n,k evolves as a convex combination of the previous iterate and this fixed term, and therefore Y n,k gets closer and closer to the fixed term. On the other hand {W n,k } corresponds to the classical stochastic gradient algorithm for minimizing a quadratic cost function, whose convergence is well-established, e.g., see Poljak and Tsypkin [1973]. Therefore we have lim k→∞ Y n,k = γ(1 + ε)D t , and lim k→∞ W n,k = 0, ∀n. ...
Preprint
We present fictitious play dynamics for the general class of stochastic games and analyze its convergence properties in zero-sum stochastic games. Our dynamics involves agents forming beliefs on opponent strategy and their own continuation payoff (Q-function), and playing a myopic best response using estimated continuation payoffs. Agents update their beliefs at states visited from observations of opponent actions. A key property of the learning dynamics is that update of the beliefs on Q-functions occurs at a slower timescale than update of the beliefs on strategies. We show both in the model-based and model-free cases (without knowledge of agent payoff functions and state transition probabilities), the beliefs on strategies converge to a stationary mixed Nash equilibrium of the zero-sum stochastic game.
Article
Adequate management of modern corporate communication networks is possible if many control procedures function in near real time. In this case, the processed network monitoring data must have accurate characteristics sufficient for making objective management decisions. This fully applies to the data monitoring of network traffic parameters, which determines the relevance of the proposed work. The proposed in the paper algorithm for on-line estimation of traffic parameters in corporate multiservice communication networks is based on the concept of conditional nonlinear Pareto-optimal filtering V. C. Pugachev. Its essence is that the estimation of traffic parameters is performed in two stages - in the first stage, we evaluate the forecast values of parameters, and in the second, with the next observations of random sequences, we make adjustments to their values. Traffic parameter values forecasts are constructed in a small-sized sliding window, and the adjustment is implemented on the basis of pseudo-gradient procedures whose parameters are adjusted using a fuzzy control algorithm based on the Takagi-Sugeno method. The proposed algorithm belongs to the class of adaptive algorithms with prior learning. The maximum value of the average relative error of estimation of traffic parameters was less than 8.2%, which is a sufficient value for the implementation of operational network management tasks. At the same time, the actual scientific and technical task is to analyze the comparison of the characteristics of the developed adaptive algorithm with the characteristics of the optimal algorithms, the characteristics of which are the maximum achievable. Translated with www.DeepL.com/Translator (free version). The results of a comparison of the proposed method with the optimal Coleman filtration (OKF) are presented.
Preprint
Reinforcement learning (RL) has recently achieved tremendous successes in many artificial intelligence applications. Many of the forefront applications of RL involve multiple agents, e.g., playing chess and Go games, autonomous driving, and robotics. Unfortunately, the framework upon which classical RL builds is inappropriate for multi-agent learning, as it assumes an agent's environment is stationary and does not take into account the adaptivity of other agents. In this review paper, we present the model of stochastic games for multi-agent learning in dynamic environments. We focus on the development of simple and independent learning dynamics for stochastic games: each agent is myopic and chooses best-response type actions to other agents' strategy without any coordination with her opponent. There has been limited progress on developing convergent best-response type independent learning dynamics for stochastic games. We present our recently proposed simple and independent learning dynamics that guarantee convergence in zero-sum stochastic games, together with a review of other contemporaneous algorithms for dynamic multi-agent learning in this setting. Along the way, we also reexamine some classical results from both the game theory and RL literature, to situate both the conceptual contributions of our independent learning dynamics, and the mathematical novelties of our analysis. We hope this review paper serves as an impetus for the resurgence of studying independent and natural learning dynamics in game theory, for the more challenging settings with a dynamic environment.
Article
Accurate forecasting of the state of technical objects is necessary for effective management. The technical condition of the object is characterized by a system of time series of monitored indicators. The time series often have difficultly predictable irregular periodicity (quasi-periodicity). In this paper, to improve the accuracy of such series forecasting, models of quasi-periodic processes in the form of samples of a cylindrical image are used. The application of these models is demonstrated by forecasting of a hydraulic unit vibrations. It is shown that the use of these models provides a higher accuracy of prediction compared with the classical approaches.
= cq while VI(с[n-1])Tм {stn] |F} =0, and theтefoге it follows fгom (A.6) that thегe exists а subsequeпсe ni for whiсh l1m M{vI(c[nt-1])
  • Е But
But Е y[n] = cq while VI(с[n-1])Tм {stn] |F} =0, and theтefoге it follows fгom (A.6) that thегe exists а subsequeпсe ni for whiсh l1m M{vI(c[nt-1])'.&/{sln'llF}}: 0.
A.?) implies that there exists a suЬequeпсe пi
  • Еquadon
Еquadon (A.?) implies that there exists a suЬequeпсe пi. suсhthat VJ(сtni.*11)Tм{stni'1 tг} 3сo
4), thisyiеldsthatlimVJ(сtn_1])I{5;д]lг} =
  • Allowingfor
Allowingfor(2.4), thisyiеldsthatlimVJ(сtn_1])I{5;д]lг} =,o,aпdthеtheorertristhuspгovdd.
we obtаin ftom this, using (з.11) and (3.I2), M sfn]i : P |([n]1 }-V./ (с [n _ 1 ]) J-P{6
  • Vj If
If VJ(с[n_1])i } 0, we obtаin ftom this, using (з.11) and (3.I2), M sfn]i : P |([n]1 }-V./ (с [n _ 1 ]) J-P{6["]n < _Y I (с|n _,' i \'} ) P|_vI (с|n _ |])l { Е[z]t < 0} > 6(V/(с[z _ 1])'.
n.1]i) Therefoтe we alwаys have VJ(с[n*1])iMs. [n]i > | vJ(dп_r]). |о tсJn_r1),| ), so that YI (cLn _ !,!)Ms|nl= llv
  • Foт Similаrly
  • Уj
Similаrly, foт уJ(g[n_1])i < 0 we have Мs[n]i =_6(_ VJ(с[n.1]i). Therefoтe we alwаys have VJ(с[n*1])iMs. [n]i > | vJ(dп_r]). |о tсJn_r1),| ), so that YI (cLn _ !,!)'Ms|nl= llv/(с[n _ 1l) Il 6 ( ilvr(сt" _ tll ш ''.
Adaptaтion and Training in Automaflс Systems
  • Ya Z Tsypkin
Ya. Z. Tsypkin' Adaptaтion and Training in Automaflс Systems [iп Russian], Nauka, 1968.
Smoothеd гandomized funсtiona1s and algoгithms in adaptation aпd training theory
  • Z Ya
  • Tsypkiп
Ya.z. Tsypkiп' "Smoothеd гandomized funсtiona1s and algoгithms in adaptation aпd training theory,.
A stoсhastiс approximation mеthф
  • H Robbiш
  • S Monтo
H. Robbiш and S. Monтo, "A stoсhastiс approximation mеthф,'Annals Мath. stat., 22, No. 1, 1951.
Gгadientmethds of maхimization
  • J B Croсkettandh
J. B. CroсkettandH. C1teгnoff, '.Gгadientmethds of maхimization," PaсifiсJ. Math., 5, No. 1, 1955.
Stoсhasdс appгoximation methds
  • V Fabian
V. Fabian, "Stoсhasdс appгoximation methds," Czeсhoslovak Mаthematiсal Journal, Щ No. 1, 1960.