Content uploaded by James C. Bezdek
Author content
All content in this area was uploaded by James C. Bezdek on Mar 18, 2016
Content may be subject to copyright.
.\' t'rt trr
l. Ptt ru I I t'l & k i ctrtifi<' Cotttltrttatirttr.r 9 ( 2fi) I ) I 9-2ti
Local convergence of Tri-Level Alternating optimization
Richard J. Harhawal'I, Yingkang Hul. and James C. Bezdek2
lMathematics and Computer Science Department, Georgia Southern Universiq'
Statesboro, GA 30460
?Computer Science Department, University of West Florida
Pensacola, FL 32514
Abstract
Tri-level alternating optimi:ation (TLAO) of a real valued function f(u') consists of
partitioning the vector variable rll' into three parts, say \^' = (x,)'.2), and alternating
Lptimizations over each of the three parts while holding the other two at their newest
*r"lu.r. Alternatine optimization is not usually the best approach to optimizing a
function. However, in cases when (x,y,z) has special structure where each of the partial
optimizations can be performed very easily, then the method can be simple to implement
and computationalll' competitive with other popular approaches such as quasi-Newron or
conjugate gradient methods. Convergence analysis of ri-level alternating optimization is
given which shou,s that the method is locally, q-linearly gonvergent to minimizers at
*tti.tt the second derivative of the objective function is positive dehnite. A usefulrecent
application of tri-level alternating optimization in the area of pattern recognition is
described.
Keywords - alternating optimization, local convergence' pattern recogrrition
1. INTRODUCTION
in this paper we consider the convergence analysis of a technique for computing local
solutions to the problem:
#T' r(w) ' (1)
where f: Rs ) R is rwice differentiable. The technique is called tri-level alternating
optimi2ation (TLAO). Application of this technique requires partitioning the variable w
e Rtasw:(x,y,z),withx e RP,y e Rq, and* e Rt. TLAOattemptstominimize f
using an iteration that sequentially minimizes f over each of the grouped subsets of x' y,
Rcc.'tr etl .lantrarl l(X) IlOt{.1-2-56.1 50-1.-i() O Dvnanric Publishers. Inc.
20 Hathaway, Hu, and Bezdek
and z variables. The TLAO procedure is stated next, and the notation: "arg min" is used
to denote the argument that minimizes; i.e., the minimizer.
Tri-Level Alternating Optimization (ILAO) of f: Rr ) R
TLAO.I Partltlon w € R- as w = (X,y,Z), wrth x € R', y € K', and w € K', p +
Q * r = s. pick the initial iterate *(0) = (x(0),y(0),2(0)) and stopping
Partition w e Rt as w = (x,y,z), with x e RP, y e Rq, and w e tt-
criterion. . For example, choose a vector norm ll.ll *d termination
threshold e, and stop when ll*"(t.tl) - wG)ll < e or when k > T, where
T is a maximum iteration limit.
setk=0. ll.l[
TLA0-2 Compuie *G+l)- argmin f1x,y(k),"(k)1
xeRP
TLAO-3 Compute r(k+l)- arg min f1x(k+l),y,2(k);
y€Rq
TLAO-4 Compute r(k+l) - arg min f1x(k+l),r(k+t),r,
zeRt
TLAO-S tf llw(t*tl) -w(k)ll ( e ork> T, then quit; orherwise, set k= k+l and
ilil
go to TLAO-2.
A bi-level version of this approach is analyzed in Bezdek et al. (1987). The bi-level
version has been widely used to optimize numerous firzzy clustering criteria. Our interest
in the tri-level version is motivated in p"rt by the need to validate the optimization
procedure employed in the recently devised pattern recognition tool in Hathaway and
Bezdek (1999) that is briefly described in Section 3. This technique is a modification of
the popular fuzzy c-means algorithm (Bezdek, l98l) and is capable of effectively
clustering incomplete data. Incomplete data vectors are data vectors that are missing
values for some (but not all) components. This note will help supply the underlying
convergence theory for the useful new clustering technique in Hathaway and Bezdek
(1999) and other statistical and fwa methods for pattem recognition that alternate
optimizations over three sets of variables.
We mention that the global convergence of TLAO to minimizers (or in some cases.
saddle poins) of f follows easily from the general convergence theory of Zangwill
(1969), and is based on the monotonic decrease in the objective function values as
iteration proceeds. In short. Zang;will's theory can be used to show that under mild
(2)
(3)
(4)
[-ocal ('onvergence of Tri-Level Alternuting Optinrization
assumptions any limit point of a TLAO sequence is a point (x*,Yt,z*) satis$ing (2-4)
*ith *G*l) : x*, yG) : y(k+l) = y*, and zG) = zG+l) = z*. This type of point could either
be a minimizer or a saddle point, but in practice, computed (x*,Y*,z*) values are almost
never saddle points.
The next section gives the local analysis of TLAO. Section 3 briefly describes a new
clustering algorithm that uses TLAO to optimiie a particulax clustering criterion. The
final section contains concluding remarks and some ideas regarding worthwhile future
work.
2. LOCAL COITYERGENCE ANALYSIS OF TLAO
Letf:RPx RQx rr'*+ nandpartitionw=(x,y,z) e RPx Rq* Rt. Weshow inthis
section that TLAO is locally, q-linearly convergent to any minimizer of f for which the
Hessian of f is positive definite. Corresponding to (2-4) we define X: RQ x Rt ) Rp, Y:
RP x n* ) Rq, andZ: RP * Re ) n'-t, as :
2l
x(y,z) :
Y(x,z) =
z(x,v) =
arg min f (x,y,z)
xeRP
arg min f(x,y,z)
y€Rq
arg min f(x,y,z)
ze Rt
(5)
(6)
(7)
The reasoning used in this section is invariant to translation of any minimizer of f by a
constant vector, so we can simpliff our notation by assuming that the local minimizer of
interest is (0,0,0) . pP;pqxRt. We first show that under reasonable assumptions, X, Y,
and Z are continuously differentiable near (0,0) e RQ"Rt, RPxRt, RPxRQ, respectively'
(Hereafter, we sometimes leave it to the reader'to infer the appticable dimensions of
points such as (0,0) rather than explicitly mentioning them') We let f'(w) denote the
sxs Hessian matrix of f.
Lemma 2.1 Let f: RP x RQ x nt > n satisff the conditions:
(i) f is C2 in a neighborhood of (0,0,0);
(ii) f"(0,0,0) is positive definite; and
(iiD (0,0,0) is a local minimizer of f'
Then in some neighborhood of (0,0) e R9*Rt, the minimizing function X(y,z) in (5)
exists and is continuously differentiable. Similar results hold for Y(x,z) and Z(x'y)'
22 Hathaway, Hu, and Bezdek
[f"*1x,y,z; f*r(x,y,z) fug,y,)f
Proof. Partition f'(x,y,z) as lf,*{x,l,z) fr(x,y,z) fn(x,y,z) l. t, tU and (ii),
f fo(x,y,z) f o(x,y,z) f og,y,z) )
fo(x,y,z) is positive definite and nonsingular in a neighborhood of (0,0,0). The
implicit function theorem guarantees a continuously differentiable function X: Rq x Rr )
RP, defined in a neighborhood of (0,0) e RQxRt, satisfting f*(X(y,z),y,z) = 0. This
implies x = x$t,z) is a critical point of K . ,y,z), and this together with (iii) gives us that
X(0,0) = 0. Since (X(y,z),y,2) is near (0,0,0) for (y,z) near (0,0), it follows using (i) and
(ii) that fo(X(y,z),y,2)is positive definite for g,z) near (0,0), and this implies that
X(y,z) is a minimizer of ( ,y,z). This shows that the continuously differentiable
function guaranteed by the implicit function theorem is in fact the minimizing function in
(5). Similar arguments give the results for y(x,z) and,Z(x,y). I
For notational convenience in the following, we define A: f "(0,0,0) and partition it as
fxy (0,0,0)
fw (0,0,0)
fry (o,o,o)
Define the mapping s: RP x Rq x Rt ) RP x R9 x Rt corresponding to one iteration
through steps TLAO-2, 3 and 4 as:
S( X,y,Z) = ( St (x, y, z) , 52 (x, y, z) , 53 (x, y, z) ) (9a)
: ( X(y,z), Y(X(y,2,),2),Z(X(y,z),Y(X(y,z),2)) ) (9b)
The results of Lemma 2.1 imply that S is continuously differentiable in a neighborhood
of (0,0,0) with s(0,0,0) = (0,0,0). Let p(s'(0,0,0)) denote rhe spectral radius of s'(0,0,0).
As will be seen in the proof of Theorem 2.1, the fundamental properry needed to establish
convergence of a TLAO sequence is that p(s'(0,0,0)) < l, which is proved in Lemma 2.2.
Lemma 2.2 Let f: RP x RQ x nt + n satisfy the conditions of Lemma 2.1 and iet s:
RPx R{* il) Rpx Rer Rtbedefuredby(9). ttren p(S'10,0,0)) < l.
I51, (o,o,o) S1, (o,o,o) slz(o,o,o)]
Proof. Partition s'(0,0,0) as I s2*(0,0,0) Sry(0,0,0) s2z(0,0,0) l. In calculating
[S3" (0,0,0) S3y (0,0,0) S3" (0,0,0)]
s'(0'0,0), we will need the various panials \,(0,0), xz(0,0), y,(0,0), yz(0,0), zx(0,0),
and Z"(0,0), which are obtained first. We suppress the argument (0.0) in the following.
To obtain X,r, differentiate f*(X(y,z),y,2) = 0 with respect to y and evaluate at (0,0) to get:
fxz (0,0,0)l
fyz(0,0,0) | = f'(0,0,0). (8)
fzz (0,0,0)l
[o* A*y n*r-l [r** 1o,o,o;
o=lor Ayy er'l= lr^io,o,o)
Lo* Ary A-) [t{o,o,o)
xr= - A*lA*, (10b)
The remaining partials are calculated by differentiating t(x,Y(x,z),2) = 0 with respect to
x and z, andfr(x,y,Z(x,y)) = 0 with resPect to x and y. They are:
l-oc a I Convergence. of Tri-Leve I A lternat i ng Opt irn izat ion
which yields: fxx(0'0'0) xv + Ev(0'0'0) = 0'
\ = - e*lA*v
Differentiating lr(X(l,z),V,2):0 with respect to z and evaluating at (0,0) gives
Y*:-AilA"*
Yr--AilA.
z*: - A). A,-.
zy = - \).Ary
Now the components of S'(0,0,0) are calculated using (10) as:
Slx(0,0,0)= 0€RP*P;
S2x(0,0,0)= 0eR9'P;
S3x(0,0,0):0eRoP;
S1u(0,0,0): \: -Ail A'y;
S2u(0,0,0) = Yfi: A-; Rr R-i A*, ;
53,(0,0,0) = Zfi+ZvYxx.y= A-'r]-.Ao. A*1A*y
t). t"y Ail o^ a*l nr;
Slz(0,0,0) = Xr:- A-*l e*r;
Szz(0,0,0) = Y*Xt *Y.: A-rJ A^ A*l A*r- Afl nr";
S3z(0,0,0) = Z*X.-+ zyY xxz * \Y r-- A-.jA,,( A-x,l Axz -
rJ, n.yAil A"* A* A*, + A),no e-/ e,o
(10a)
(l 0c)
(l0d)
(l 0e)
(100
(l la)
(l lb)
(l lc)
(l 1d)
(l le)
(ll0
(l le)
(l lh)
(l1i)
We can now establish that p(S'10.0,01) < I by recognizing an important relationship
between 5'(0,0,0) and A. Define the matrices B, C, and D as
24
[o* o ol to-Axy-R*,] [o*
t=lo^ Ayy 0l,C=lo o -A*l,anaD=l o
Lo^ Azy o-) [o o o j Lo
Hathaway, Hu, and Bezdek
00
Avv o
o Art
Note that A = B - C. Since A is positive definite, it follows that D is positive definite
and B is nonsingular. A straightforward but tedious calculation yields
5'(0,0,0) : B-l C(13)
By Theorem 7.1.9 in Ortega (1972) and the assumption that A is symmetric and positive
definite,wehavethat p(s'10,0,0)) = p(B-tA)< I if A- B-c isap-regularsplitting.
Bydefuritiotr,B-cisaP-regularsplittingifBisnonsingularandB+cispositive
definite. By earlier comments, it only remains to show '.hat B + C is positive Jefinite.
The symmeric part of B + C is
1(r.c)*(B'*ct) = j("*cr)*(nr *c) = {(o1*(or) : o, (14)
which is positive definite. I
We now give the main result for local convergence, which is essentially an adaptation of
Ostrowski's theorem (Theorem 8.1.7 in Ortega, lg72).
Theorem 2.1 Let w* be a local minimizer of f: Rs ) R for which f'(w*) is posirive
definite. and let f be C2 on a neighborhood of w*. Then there is a neighborhood U of w*
such that for any ,"(0). u, rhe corresponding TLAo iteration sequence {w(k), dehned
using S in (9) as w(k-l) = 51*(k\ converges q-tinearly to w*.
Proof. As discussed eariier, we can assume that w* : 0. It is necessarv to show
rli11*(k*t)= rg1stL*tl,*(0\=0 (r5)
for all choice, of ,or(0) close enough to w*. Apply Lemma 2.2 to obtain
. p=p(s'1o,o,o;) <t. (16)
Pickd > 0 such that p +2 6< l. By Theorem 3.8 in Stewart (1973), there exists anonn
il'lls on Rs such that lor all w e Rs,
lis'1o.o.oiwllu < (p +s) llwllu (17)
since s' is continuous near w* = (x*.y*,2*) = (0.0.0). rhere is a number r > 0. such that
],,,,
[.ocal ('onlersence of 'l'ri-Level Alternating Optittrizatiort
lls'iwr)*zlla < (p +za) llw2llu (l 8)
e nt 1 llwl[u < r]. From (18) and the fact that S(0) = 0, we
2-5
for all w, and wt e B, = {w
have:
llst*lll'
(le)
The result of ( l9) establishes that for initialization of TLAO near w*, the error is reduced
by (p + 26) < I at each iteration, which gives the local q-linear convergence of 1*(k); to
w+.
3. AN EXAMPLE OF TLAO
An important problem from the area of pattern recognition is that of partitioning a set of
datz X: {x1,...,\} c Rt into natural data groupings (or clusters). A popular and
effective method for partitioning data into c fuzzy clusters {C1,...,C"}is based on
minimization of the fuzzy c-means functional (Bezdek, 1981):
J.(U,v) = it uilil*u -v'llt,
i=1k=l
where: m > I is the fuzzification parameter;
U = [Uit], where Ug: degree to which xL e C1i
V = [V1,...,V.J, where V1 e Rt is the center of Ci; and
ll.ll is an inner product norm on Rt.
The optimization of (20) is auempted over all V e Ruc and U e lr'\.n, where
(cnl
Mf.n: ] UeR"'n lUil eto,ll'lU* =l')-Uik > 0: Vi'kl (21)
lFrk=r)
The most popular method for optimizing (20) is a bi-level alternating optimization over
the U and V variables known as the fuzz,v c-means algorithm (Bezdek, l98l). It
calculates a new V liom the most recent U via
: l[i,',*,**il,
I
< Jlls't*)'" llu at
0
s (p +26) llwll'
I
(20)
26 Hathaway, Hu, and Bezdek
v j,i, (22)
and a new U from the most recent V by
'7 i,k, (irl
where Vli and xlt are the jth components of V1 and xp, respectively, and llxi -V6ll > O V
h. (See Bezdek (1981) for the necessary condition that supplants (23) when one or more
ilxp-v6ll =0.1
Real-world data sets sometimes contain incomplete data which contain feanre
vectors with one or more missing values (Jain and Dubes, 1988; Dixon , lg79). Cases of
this kind can arise from human or sensor error, subsequent data comrption, etc. For
examole. darum component xjk may be missing so that r, e Rt has the lorm
.T
(x1p..r.p.?.x-,r.....,\t)^. Unfornrnately, the irerarion described by (21) and (?3) requires
access to complete data. If the number of data with missing componenrs is small. then
one option is to delete that portion of the data set and base the clustering entirely on rhose
data containing no missing components. This approach is problematic if the proportion
of data with missing components i5 significant and in all cases fails to provide a
classification of those data deleted from the cluster analysis. Recently a missing data
version of the fuzzv c-means aleorithm (Hathaway and Bezdek. 1999) has been devised
tbr the incomplete data case and we give the algorithm here as a useful example of
TLAO.
In the tbllorving we will use i to represent missing dara componenrs: e.g.. xk :
(xtk..\:k.i;t-\rr,....\t)T. Th. collection of ail missins dara components will be denoted
by i c x. we adapr fuzzy c-means to missing data using a principal of optimaliry
similar to that used in maximum likelihood methods trom statistics. In rhis case. we
assume that the missing data is consistent with the dara set having strong cluster
strucnrre. \l'e implement this optimal completion strareg, b.v d-vnamically estimating the
missing data components to be those numbers that optimize the cluster sructure, as
measured bv minimum values of the criterion in (20).
The incomplete data version of fu2ry c-means from Hatharvay and Bezdek (1999)
minimizes (10) over U. V, and X. ttr. minimization is done using TLAO based on the
tbllowins updating. The current values of V end U are used to esrimate the missine data
compon.nrs * by:
V i,,.e X. 11a)
JN
u,, = [*i,u**,-J/[Euil J,
u,u = (il* o - u, ll-' u*-" /*t,,,,. - - vn fl
-z irm-u ),
o,- = liuii v, ], ii,il i
[-ocll Convergence rrf Tri-l-elel Alternating Optintization
The current missing data values X are then used to complete X so that (22) can be used to
calculate the new V. The third inner step of one complete TLAO iteration is then done
using the completed X and the new V in (23) to calculate the new U. Preliminary testing
of this missing data approach for clustering has demonstrated it to be highly effective.
The convergence theory of the last section guarantees the procedure to be locally
convergent to minimizers, at a q-linear rate.
4. DISCUSSION
Tri-level alternating optimization attempts to optimize a function f(x,y,z) using an
iteration that alternates optimizations over each of three (vector) variables while holding
the other rwo fixed. The method is locally, q-linearly conversent to any minimizer at
which the second derivative is positive definite. The giobal analysis of this method fits
into the general convergence theory of Zangwill (1969), which guarantees that any limit
point of an iteration sequence must satisry the first order necessary conditions for
minimizing f.
A recent exampie of TLAO tbr an important problem in panem recosnition was
given. The TLAO scheme applied to the fuzzy c-means function allows clustering of
data' sets where some of the data vectors have missing components. Preliminar-v
numerical tests olthe approach have shown it to produce good results even when there is
a substantial proponion of incomplete data.
Alternating optimization is often not the best approach to optimization. but it is
certainly worth consideration if there is a natural partitioning of the variables so that each
of the partial minimizations is simple. While the convergence rate of the AO approach is
in senerai oniy q-linear, if the partial minimizations are simple. the method can still be
competitive or superior to joint oPtimization approaches rvith taster rates (e.g.. q-
superlinear) of convergence (Hu and Hathaway. 1999).
One of the most interesting mathematical questions concerning this approach is horv
to systematicaily and efficienrly determine the "best" partitioning of the variables so that
the value of p in (16) is as small as possible. We expect the value of p to be small rvhen
the partitioning produces groups of variabies that are "largely independent". For
example. minimizarion of f(x.y.z) = *l * yJ + ,7 can be ,lone in one TLAO iteration ( p =
0) because there is complete independence among the three variables. Other
computationall-v oriented questions concern how to best tbrmulate a relaration scheme
and how an alternating optimizacion approach can best be h;-bridized "vith a q-
superlinearl-v- (or taster) convergent local method. Finally. the authors plan to uniff the
convergence theory of grouped coordinate descent and extend it to the case of m-level
aiternating optimization. and to survey many imponant instances oi alternating
optimization t-vpe schemes in pattern recognition and statistics.
REFERENCES
Bezdek. J.C. (1981). Punern Recognition wirh Ftc=' Objecrire Functiotrs. Nerv
York: Plenum Press.
21
'28 Hathaway, Hu, and Bezdek
Bezdelq J.C., Hathaway, R.J., Howard, R.E., Wilson, C.A., & Windham, M.p.
(1987). Local convergence analysis of a grouped variable version of coordinate
descent Journal of Optimization Theory and App lications, v. S 4, 47 I -47 7 .
Dixon, J.K. (1979). Pattern recognition with partly missing d^tL IEEE Transactions
on Systems, Man and Cybemetics, v.9,617-621.
Hu, Y., & Hathaway, R.J. (1999). on effrciency of optimization in fuzzy c-means,
preprint.
Hathaway, R.J., & Bezdeh J.c. (1999). Fvzzy c-means clustering of incomplete
dat4 preprint.
Jain, A.K., & Dubes, R.C. (19E8). Algorithms for Clustering Data. Englewood
Cliffs, NJ: Prentice-Hall.
fu"gq J.M.{1972). Numerical Analysis: A Second Course. New York Academic
Press.
stewart, c.w. (1973) - Introduction to Matrix Computations. New York Academic
Press.
Zangwill, W. (1969). Nonlinear Programming: A Unifed Approach. Englewood
Cliffs, NJ: Prentice-Hall.