PosterPDF Available

Inexact higher-order proximal algorithms for tensor factorization

Authors:

Abstract

Paper https://www.researchgate.net/publication/370766076_Inexact_higher-order_proximal_algorithms_for_tensor_factorization
Inexact higher-order proximal algorithms for tensor factorization
Valentin Leplat1Anh-Huy Phan1Andersen Ang2
1Center of Articial Inteligence Technology, Skoltech, Moscow, Russia 2School of Electronics and Computer Science, University of Southampton, UK
Abstract
Higher-order Methods (HoM) for factorizaon models
Two new ecient & implementable proximal HoMs
Setup: minimizing p-times dierentiable convex function
We solve argmin
xE
f(x).(1)
ERna vector space (with inner product & norm)
f:ERclosed, convex, p-mes dierenable with bounded derivave
sup
xE
Dpf(x)
=:Mp<+,(2)
Dpf(x)[h1, ..., hp]:pth-order direconal derivave of f(x)along direcon h= [h1, ..., hp]
kDpf(x)k= max
hDpf(x)[h]p:khk 1
Higher-order proximal point methods
pth-order prox operator proxp
f(¯
x):=argmin
xEnfp
¯
x,λ :=f(x)+λdp+1(x¯
x)o(3)
λR+,pN,fp
¯
x,λ is the pth-order model,dp+1(a) = 1
p+1kakp+1
(3) generalizes the prox operator proxf(¯
x):=argmin
xEnf(x) + 1
2λkx¯
xk2o
Inexact proximal method
Oen (3) can’t be computed eciently =compute an approximate sol.
HoM achieves fast convergence even if (3) is not solved exactly
(Nesterov21) denes a set of acceptable sol. to (3) as
Ap
λ,f (¯
x, β) = nxE:k∇fp
¯
x,λ(x)k βk∇f(x)ko,(4)
where β[0,1) is a tolerance parameter.
If we have 1st-order opmality k∇f(x)k= 0 or we set β= 0, we have the
ideal cases of Ap
λ,f (¯
x, β)that (3) is solved exactly.
AIPPA: Accelerated Inexact Pth-order Proximal Algorithm
where 2R+and p2N, the function fp
¯x,is the pth-order
model and dp+1(a)= 1
p+1 kakp+1. Note that (2) generalizes
the proximal operator
proxf,x):=argmin
x2Enf(x)+ 1
2kx¯xk2o,
and (2) reduces to the proximal-point algorithms [13], [14] if
p=1. The method achieves faster convergence even when it
does not solve (2) exactly, as shown in [11].
B. Inexact proximal method - BLUM
Often (2) cannot be computed efficiently, hence an approx-
imate solution is computed using an optimization procedure.
The work [11] defines a set of acceptable solutions to (2) as
Ap
,f x, )=nx2E:krfp
¯x,(x)kkrf(x)ko,(3)
where 2[0,1) is a tolerance parameter. In case we have
1st-order optimality krf(x)k=0or we set =0, we have
the ideal cases of Ap
,f x, )that (2) is solved exactly.
Let Ak:=2(1)
(k
2p+2 )p+1,ak+1 :=Ak+1Akand define
0(x):=dp+1(xx0), Algorithm 1 shows the accelerated
inexact pth-order proximal algorithm, abbreviated as AIPPA.
Algorithm 1 AIPPA.
Input: x02E,2[0,1),>0,0(x):=dp+1(xx0)
Output: An approximate solution to Problem (1)
1: for k=0,1,... do
2: vk:=argmin
x2E
k(x)and yk:=Ak
Ak+1 xk+ak+1
Ak+1 vk
3: Compute Tk2Ap
,f (yk,)and update as
k+1(x)=k(x)+ak+1f(Tk)+hrf(Tk),xTki
4: Choose xk+1 such that f(xk+1)f(Tk).
5: end for
Convergence rate: the sequence {xk}kgenerated by AIPPA
satisfies f(xk)f?O(1
kp+1 )[11, Theorem 2].
Stopping criterion: Condition (3) can be checked in practice
and integrated directly as a stopping criterion for any proce-
dure used to compute Tkgiven yk.
Bi-level framework: BLUM has two levels:
an upper-level corresponds to a chosen pth-order proxi-
mal algorithm
a lower-level where an algorithm running on lower-order
derivative (for a instance the (p1) derivatives) is used
to approximately solve Step 3 in AIPPA.
We are now ready to discuss the set up of the BLUM
framework in the newly proposed algorithm for solving CPD.
III. PROPOSED ALGORITHM FOR CPD
Now we present a new inexact 3rd-order proximal algo-
rithms for computing a nonnegative CPD of an input nonnega-
tive n-way tensor, i.e., approximating a tensor with a sum of R
rank-one nonnegative tensors. Let a tensor Thas dimensions
I1I2···IN, let matrices U(n)have size InRfor
1nN. Computing a rank-Rnonnegative CPD of Tis
achieved by solving:
argmin
U(n)0
1nN
DTI1U(1)2···NU(N):=FU(n) ,(4)
where Dis a measure of discrepancy between two tensors (we
consider two cases later), Iis the identity tensor of dimensions
all equal to R, the symbol nis the n-mode product [2]. Note
that the proposed method can be used to compute a low-rank
matrix approximation of an matrix, which is a special case of
nonnegative CPD where input Tis a 2-way tensor and the
feasible set is E.
We solve (4) by Block-Coordinate Descent (BCD) that
consists of optimizing alternatively over one factor of the
factorization while the others are kept fixed at their most
recent value, i.e., at each iteration we successively solve n-
subproblems for CPD; say one in U(n)and the others in
U(1),...,U(n1),U(n+1),...,U(N)alternatively (after rewrit-
ting tensor decomposition model using tensor matricization
along the different modes). Each subproblem is solved using
a variant of AIPPA. In the following we present the update for
one factor U(n)to tackle (4). The updates of the other factors
are performed in a similar fashion under permutation.
Consider the update of U(1), after matricization along the
1st-mode we have:
T=I1U(1) 2... NU(N)
,T(1) =U(1) U(2) ···U(N)T,(5)
where T(i)is the unfolding of tensor Talong mode i,is
the Khatri-Rao product. Note that (5) is an NMF of V=TT
(1)
with factors X=(U(1))Tand W=U(2) ···U(N).
We now consider two functions D, denoted as f(X):
min
Xf(X):=1
2kVWXk2
FX
i,j
log(Xij ),(6)
min
Xf(X):=1
24kVWXk4
4X
i,j
log(Xij ),(7)
where 0is a penalty weight and log(x)is the log-barrier
function that promotes nonnegativity of x. Note that both fin
(6) and (7) are separable w.r.t. each column of X, i.e., denoting
V:,j =vand X:,j =x(the j-th column of Vand Xresp.),
solving (6) and (7) boils down to solving mminimization
problems in parallel (m=number of columns of X), where
each minimization being performed over one particular column
of X. For instance (6) becomes:
min
x2Ef(x):=1
2kvWxk2
FX
i
log(xi).
To solve each of these subproblems in the CPD, we use a
variant of AIPPA with p=3and :=1
p(in the “upper-level”
part). For such choices, we assume that the 4th derivative of f
is bounded on QEwith constant M4>0and let =3M4
[11], here Qis the nonnegative orthant of appropriate size.
Ak:=2(1β)
λ(k
2p+2)p+1
ak+1 :=Ak+1 Ak
Φ0(x):=dp+1(xx0)
We pick β=1
3for CPD.
Convergence (Nesterov21): {xk}from AIPPA sases f(xk)f?O(1
kp+1)
(4) can be used as stopping criterion for procedure used to get Tkgiven yk
BLUM Bi-level framework: has two levels:
up-lv corresponds to a chosen pth-order proximal algo
low-lv where an algo running on low-order derivave is used to
approximately solve Step 3 in AIPPA.
Step 2 in AIPPA: vk=argmin
xE
Φk1(x) + akf(Tk) + f(Tk), x Tk
Let g0= 0,gk=gk1+akf(Tk), the problem is simplied to
vk=argmin
xE
gT
kx+dp+1(xxk),
which has opmal sol v?
k=xkgk/(kgkk11/p)
3rd-order algo for CPD
Rank-Rnonnegave CPD of a tensor T
argmin
U(n)0
1nN
DTI ×1U(1) ×2· · · ×NU(N):=FU(n),(5)
We solve (5) by BCD, where each subproblem is solved using AIPPA.
Why BCD: all-at-once approach is too expensive for HoM.
We consider two funcons D, denoted as f(X):
2-norm-power-2 : min
Xf(X):=1
2kVW Xk2
FγX
i,j
log(Xij)(6)
4-norm-power-4 : min
Xf(X):=1
24kVW Xk4
4γX
i,j
log(Xij)(7)
Xis the matrix variable of mode-nfactor
Wis the Khatri-Rao product of other factors
Vis input tensor Tin mode-nunfolding
negave log is for nonnegavity constraint and γis a parameter
For example,
(6)
can be cast as
min
xEf
(
x
)
:
=
1
2kvW xk2
FγX
i
log
(
xi
), which
belongs to problem class (1).
Step 3 in AIPPA
Determine M(upper bound of direconal derivave in (2)) For the
4-norm-power-4 (7), compung Mfor the 4th derivave boils down to
eigenvalue problem for 4th-order tensor I ×1WT×2WT×3WT×4WT.
Solving high-order prox Compung Tk AH,f (yk,1
3)is equivalent to
proxf,3M4(yk):=argmin
xEn¯
f(x):=f(x)+3M4d4(xyk)o,(8)
where fis the cost funcon of (6) or (7).
We use Bregman gradient descent to compute Tk: update xi+1 via minimizing
the linearized f(the ¯
fin (8))
If we drop the constraints (set =0), one need to only
consider (7) for computing Xsince M4=0for (6) (and
then loosing high convergence rates offered by the higher-
order proximal algorithm).
A. The step 2 in AIPPA
We now present solution to the problem at step 2 in AIPPA,
i.e., solving the minimization vk=argmin
x2E
k(x),where
k(x)=k1(x)+akf(Tk)+rf(Tk),xTk.
Define g0=0and gk=gk1+akrf(Tk), the above
minimization problem is simplified to
vk=argmin
x2E
gT
kx+dp+1(xxk),
which has an optimal solution
v?
k=xkgk
kgkk11
p
.
Note that ykcan be written as
yk=xk 1k
k+14!gk
kgkk11
p
.
B. The step 3 in AIPPA
Two issues arise when solving the higher-order proximal at
step 3 in AIPPA.
a) Determine M(the upper bound of directional deriva-
tive): For (7), one can show that computing the uniform
upper-bound for the 4th derivative boils down to solving an
eigenvalue problem for the 4th order derivative tensor
I1WT2WT3WT4WT.
Moreover, if >0, one can easily see that D4log(xi)
is not bounded above for xi!0. In this case, we tune
numerically M4along iteration of our algorithm to ensure
numerical stability and the objective function along iterations
is monotonically decreasing.
b) Solving the high-order prox: Recall Section II, a
crucial step for AIPPA is the computation of the “lower-level
part in BLUM: solving Tk2AH,f (yk,1
3), where we dropped
the symbol pto ease the notation. In our case, we want to
compute an approximate solution for the following 3rd-order
proximal operator:
proxf,3M4(yk):=argmin
x2En¯
f(x):=f(x)+3M4d4(xyk)o,
(8)
where fis the objective function of (6) or (7).
For an efficient computation of Tk, we use Bregman gra-
dient descent (BGD, Algorithm 2), where the update of xi+1
involved the minimization of the linearized f, denoted as ¯
fin
(8).
In step 2 of BGD, we use L=3
2(suggested in [11]), and
the term yk(xi,x)is the Bregman divergence between xi
and xwith respect to yk, defined as follows [11]:
yk(xi,x)=yk(x)yk(xi)hryk(xi),xxii,
Algorithm 2 Bregman gradient descent (BGD)
Input: Given yk,,M4,0, set x0=yk
Output: An approximate solution to Problem (8)
1: while krf3M4,yk(xi)k>krf(xi)kdo
2: xi+1 :=argmin
x2E
hr ¯
f(xi),xxii+Lyk(xi,x)
3: end while
where
yk:=1
2hr2f(yk)(xyk),xyki+3M4d4(xyk).
After some algebra, step 2 of BGD can be simplified to a
quartic minimization
xi+1 =argmin
x
(xyk)TQ(xyk)
2+2gT
kix
3+3M4
4kxykk4,
where
Q=r2f(yk)is Hessian of f(x),
gki =rf(yk)3
2Q(xiyk)3M4
4kxiykk2(xiyk).
We now discuss how to solve the quartic problem. Recall by
assumption fis convex thus the Hessian admits an eigenvalue
decomposition (EVD). Let Udiag()UTbe the EVD of the
Hessian Qand let the vector c=UTgki where gki is defined
above, the optimal solution x?
i+1 of the quartic minimization
is given by
x?
i+1 =yk2
3Uc
i+?,
where ?is the unique nonnegative solution of the following
non-linear scalar problem
?=argmin
M4
3 X
n
c2
n
(n+)2!2
X
n
c2
n(+1/2sn)
(n+)2.
We obtain ?by fixed-point iteration: setting the gradient of
the above function w.r.t. to zero gives the following update
4M4
3X
n
c2
n
(n+)2.
Remark: on quartic function For the cases (6) and (7) with
Wfull rank, the function fis strongly convex. However, the
model ¯
fis not globally strongly convex due to d4as x4is not
bounded below by x2when |x|<1.
Remark: why BCD At first glance, it seems possible to
optimize all the factor matrices all-at-once [15] by stacking
all the block variables into one large variable. However this
creates an explosion in dimension, and making the BGD
iteration very expensive to compute as we now need to run a
huge EVD in the order of QN
i=1 Iifor every iteration.
C. IAHOM: Inexact Accelerated Higher-Order Method
Lastly, Algorithm 3 IAHOM summarizes the proposed
general method for computing the rank-RCPD of a N-way
input tensor T(with or without nonnegativity constraints), for
both objective functions from Problems (6) and (7).
Step 2 of BGD is a quarc minimizaon problem
xi+1 =argmin
x
(xyk)TQ(xyk)
2+2gT
kix
3+3M4
4kxykk4,
where Q=2f(yk)is Hessian of f, and the linear term
gki =f(yk)3
2Q(xiyk)3M4
4kxiykk2(xiyk).
Solving quarc problem fconvex =Hessian Qhas eigenvalue
decomposion Q=Udiag(σ)UT. Let c=UTgki, then opmal x?
i+1
x?
i+1 =yk2
3Uc
σi+λ?,
where λ?is the unique nonnegave sol of scalar problem
λ?=argmin
λ
M4
3 X
n
c2
n
(σn+λ)2!2
X
n
c2
n(λ+ 1/2sn)
(σn+λ)2,
can be solved numerically.
IAHOM: Inexact Accelerated HoM (= AIPPA for nonneg. CPD)
Algorithm 3 IAHOM for nonnegative CPD.
Input: a nonnegative N-way tensor, M4>0,0, rank R.
Output: Nonnegative factors U(1),...,U(N)
Initialization:{U(1)
0,...,U(N)
0}
1: for k=0,... do
2: for n=1,...,N do
3: Update U(n)
kas an inexact solution of:
min
U(n)0F(U(1)
k,...,U(n1)
k,U(n),U(n+1)
k1,...)
by Algorithms 1 and 2.
4: end for
5: end for
IV. NUMERICAL EXPERIMENTS
We now compare Algorithm 3 for both Problems (6)
and (7), respectively dubbed as IAHOM-O2 and IAHOM-
O4, with the well known methods Hierarchical Alternating
Least Squares (HALS) [16] and SDF-NLS [17] (a L-BFGS
method) implemented in TensorLab [18]. We consider low-
rank synthetic datasets: we generate each entry of {U(n)}
(1nN) using the uniform distribution in [0,1] and
let T:=I1U(1) 2··· NU(N). We consider N=3,
R2{5,10}and In2{50,100}. To compare the solutions
generated by the algorithms, we report the evolution of the
relative data fitting error defined as
E(k):=kTI1U(1)
k2···NU(N)
kkF
kTkF
along iterations k. The results for the different datasets (dif-
fering in rank Rand dimensions) are shown on Figs. 1-3.
IAHOM-O2 and IAHOM-O4 converged faster than HALS and
SDF-NLS in all the cases.
V. C ONCLUSIONS AND FURTHER WORKS
We presented a first application of 3rd-order proximal meth-
ods within the BLUM framework. We extended the BLUM
framework to solve constrained minimization problems, par-
ticularly with nonnegativity constraints by using log barrier.
We provide two tractable algorithms to solve approximately
the 3rd-order proximal operators and ultimately to compute
low-rank approximation of tensors under the BCD framework.
In the experiment on synthetic data sets, we showed that the
proposed algorithms can be used efficiently for computing the
nonnegative CPD of an input tensor with moderate sizes. We
showed that the algorithms enjoy a faster convergence than
the state-of-the-art method.
This work is only preliminary and further works will focus:
developing new approaches for more constraints,
developing efficient routines for the fast estimation of the
uniform bound M, which is critical for the algorithm to
have a faster convergence,
the theoretical analysis of BLUM framework for con-
strained minimization problems.
0 50 100 150 200 250 300 350 400
10-7
10-6
10-5
10-4
10-3
10-2
10-1
100
Fig. 1. Results for [I1,I
2,I
3,R]=[50,50,50,5].
0 100 200 300 400 500 600 700 800
10-5
10-4
10-3
10-2
10-1
100
Fig. 2. Results for [I1,I
2,I
3,R]=[50,50,50,10].
0 100 200 300 400 500 600 700
10-6
10-5
10-4
10-3
10-2
10-1
100
Fig. 3. Results for [I1,I
2,I
3,R]=[100,100,100,10].
VI. ACKNOWLEDGEMENT
This research was partially supported by the Ministry
of Education and Science of the Russian Federation (grant
075.10.2021.068).
Numerical results
IAHOM-O2 = AIPPA for (6)
IAHOM-O4 = AIPPA for (7)
compare with
HALS
(hierarchical alternang least
squares)
SDF-NLS
(a L-BFGS method)
E(k)relave ng error
kT I ×1U(1)
k×2· · · ×NU(N)
kkF
kT kF
Test on order-3 tensor
Data generated U[0,1] for all
factor matrices
Test cases [I1, I2, I3, R]
[50,50,50,5]
[100,100,100,10]
IAHOM-O2 & O4 are faster in
all cases.
0 50 100 150 200 250 300 350 400
10-7
10-6
10-5
10-4
10-3
10-2
10-1
100
0 100 200 300 400 500 600 700
10-6
10-5
10-4
10-3
10-2
10-1
100
Other information
Contact
V.Leplat@skoltech.ru a.phan@skoltech.ru andersen.ang@soton.ac.uk
22nd IEEE Stascal Signal Processing Workshop 2-5 July 2023, Hanoi, Vietnam
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.