Page 1

Journal of MultivariateAnalysis 98 (2007) 625–637

www.elsevier.com/locate/jmva

Properties of cyclic subspace regression

Patrick Lang∗,Ann Gironella, Rienk Venema

Idaho State University, ID, USA

Received 1 June 2005

Available online 7 July 2006

Abstract

Various properties of the regression vectorˆ?klproduced by cyclic subspace regression with regard to the

meancenteredlinearregressionequation ˜ y=X?+˜ ?areputforth.Inparticular,thesubspaceassociatedwith

the creation ofˆ?klis shown to contain a basis that maximizes certain covariances with respect to Pl˜ y, the

orthogonalprojectionof ˜ yontoaspecificsubspaceoftherangeofX.Thisbasisisconstructed.Moreover,this

papershowshowthemaximumcovariancevalueseffecttheˆ?kl.Severalalternativerepresentationsofˆ?klare

alsodeveloped.Theserepresentationsshowthatˆ?klisamodifiedversionofthel-factorprincipalcomponents

regression vectorˆ?ll, with the modification occurring by a nonorthogonal projection. Additionally, these

representations enable prediction properties associated withˆ?klto be explicitly identified. Finally, methods

for choosing factors are spelled out.

© 2006 Elsevier Inc.All rights reserved.

AMS 1991 subject classification: 62J05; 62J10; 62H12

Keywords: Cyclic subspace regression; Covariance maximization; Regression vector representation; Prediction; Partial

principal components; Factor selection

1. Introduction

In applications one often tries to fit a linear function of variables x1,...,xpto another variable

y. This is typically accomplished by obtaining n pieces of data for each variable, relating them

by the expression y = X?+?, where y is a n×1 vector whose components are the n data values

for y, X is a n×p matrix with rank r whose columns are the n data values of each of the xi, and

? represents error, and then finding a p × 1 vectorˆ? that “best fits” the information contained in

the equation. The method of least squares (LS) is often employed to find such a best fit. When

employed it finds the unique vector in the r-dimensional space R(Xt), the range of Xt, that is sent

∗Fax: +12082822636.

E-mail address: langpatr@isu.edu (P. Lang).

0047-259X/$-see front matter © 2006 Elsevier Inc.All rights reserved.

doi:10.1016/j.jmva.2006.05.004

Page 2

626

P. Lang et al. / Journal of Multivariate Analysis 98 (2007) 625–637

by X to the orthogonal projection of y onto the r-dimensional space R(X). This solution is said

to depend on r factors.

Sometimes, because of issues of noise, multicollinearity, and so forth, a solution dependent

upon a smaller number of factors is desired. Two commonly used methods for finding a k?r

factor solution are principal components regression (PCR) and partial least squares (PLS). To

describe these solutions it is noted that associated with X are positive singular values ?1,...,?r,

satisfying ?1? ··· ??r, and singular vectors v1,...,vr and orthonormal vectors u1,...,ur

satisfying Xvi = ?iuifor i = 1,...,r, span{v1,...,vr} = R(Xt), and span{u1,...,ur} =

R(X). PCR obtains its k-factor solution by finding the unique vector in the k-dimensional

space span{v1,...,vk} ⊆ R(Xt) that is sent by X to the orthogonal projection of y onto the

k-dimensional space span{u1,...,uk} ⊆ R(X). It is often claimed that this solution is better than

theLSsolutionbecauseitdoesnotcarrytheinformationindirectionsvk+1,...,vrwhichmaybe

associatedwithnoise,error,etc.PLSobtainsitsk-factorsolutionbyfindingtheuniquevectorinthe

k-dimensionalspacespan{zr,(XtX)zr,...,(XtX)k−1zr} ⊆ R(Xt)thatissentbyXtotheorthog-

onal projection of y onto the k-dimensional space span{Xzr,X(XtX)zr,...,X(XtX)k−1zr} ⊆

R(X),wherezr=?r

in both y and X, whereas the solutions from PCR and LS come from subspaces that depend on

some or all of the information in X and not at all on y.

Recently, in [7], another k-factor solution method was put forward. This method, known as

cyclic subspace regression (CSR), obtains its k-factor solution by choosing l, where 1?l?r, and

then choosing k?l. Next it finds the unique vector in the k-dimensional space span{zl,(XtX)zl,

...,(XtX)k−1zl} ⊆ span{v1,...,vl} ⊆ R(Xt) that is sent by X to the orthogonal projection of

y onto the k-dimensional space span{Xzl,X(XtX)zl,...,X(XtX)k−1zl} ⊆ span{u1,...,ul} ⊆

R(X), where zl =?l

differentfromLS,PCR,andPLS,andthesolutionsarereferredtoaspartialprincipalcomponents

solutions.In[7]ithasbeenshownthattherearesituationswhereuseofak-factorsolutionobtained

by use of CSR with k < l < r has led to better predictive models than those obtained by LS,

PCR, or PLS.

The method of CSR has seen limited application; for example [1,5]. However, the authors

believe it should see more use, because it has both of the qualities that have made PCR and

PLS individually useful, namely it has the “noise” reduction feature of PCR in it, obtained by

eliminating use of information in the directions vl+1,...,vr, and it has the usage of information

frombothyandXthatmakesPLSappealing.Additionally,itisnohardertoimplementthaneither

PCR or PLS. Based on this belief the authors have, in this paper, taken the time to further develop

issues associated with CSR that were not discussed in [7]. In particular, after establishing some

background basics, sample covariance maximizing bases for subspaces associated with CSR are

identified and used to represent and obtain regression results that parallel those of PCR and PLS.

Additionally, guidelines based on covariance properties are set out for choosing the appropriate

number of factors in CSR.

i=1?i(ut

iy)vi.ThissolutionissometimessaidtobebetterthanboththePCR

and LS solutions because it comes from a subspace that depends on the information contained

i=1?i(ut

iy)vi. Should k = l = r, this method is just that of LS, should

k = l < r, this method is just PCR, and if k < l = r this method is PLS.All other situations are

2. Background, assumptions and notation

Let y and x1,...,xpbe variables believed to be related by

y = ?0+ ?1x1+ ··· + ?pxp+ ?,

Page 3

P. Lang et al. / Journal of Multivariate Analysis 98 (2007) 625–637

627

where ? represents error and the ?i, i = 0,...,p are coefficients to be estimated from n observa-

tions of y: y1,...,ynand of x1,...,xp: xi1,...,xip, i = 1,...,n. In particular, the equations

yi= ?0+ ?1xi1+ ··· + ?pxip+ ?i,

for i = 1,...,n are used to estimate the regression coefficients.

Let y = (yi) ∈ Rn, xj= (xij) ∈ Rn, j = 1,...,p, 1 = (1) ∈ Rn, and ? = (?i) ∈ Rn. Using

this notation the equations above can be rewritten as

⎛

⎜

?p

y =?1 x1··· xp

?

⎜

⎜

⎝

?0

?1

...

⎞

⎟

⎟

⎟

⎠+ ?.

(2.1)

In what follows (2.1) is decomposed into a vector equation involving the unknowns ?1,...,?p

and a scalar equation involving all of the ?i.

Set V = {v ∈ Rn|v = ?1,? ∈ R} and W = {w ∈ Rn|1tw = 0}. These two sets are

orthogonal subspaces of Rnhaving only the zero vector in common. Moreover, they are such that

every vector in Rncan be written uniquely as the sum of a vector from V and a vector from W.

In other words, Rncan be expressed as an orthogonal direct sum of these spaces.

If ¯ y =

n

n

mean-centered vectors:

1

?n

i=1yi, ¯ xj =

1

?n

i=1xij, and ¯ ? =

1

n

?n

i=1?i, then the following are sample

˜ y = y − ¯ y1,

for j = 1,...,p. Note that 1t˜ y = 1t˜ xj= 1t˜ ? = 0, which implies that ˜ y, ˜ xjand ˜ ? belong to W.

Solving for y,xj, and ? in these equations and substituting those values into (2.1) yields

˜ xj= xj− ¯ xj1,

˜ ? = ? − ¯ ?1,

0 = (?0+ ?1¯ x1+ ··· + ?p¯ xp+ ¯ ? − ¯ y)1 + (?1˜ x1+ ··· + ?p˜ xp+ ˜ ? − ˜ y),

which shows 0 to be a sum of vectors from V and W. This fact, combined with the uniqueness of

vector decomposition with respect to V and W, implies that

˜ y = ?1˜ x1+ ··· + ?p˜ xp+ ˜ ?

(2.2)

and

¯ y = ?0+ ?1¯ x1+ ··· + ?p¯ xp+ ¯ ?.

In what follows the regression coefficients ?0,...,?pare estimated by first “solving” (2.2)

using the method of cyclic subspace regression (described below), yieldingˆ?1,...,ˆ?p, and then

estimating ?0byˆ?0= ¯ y −ˆ?1¯ x1− ··· −ˆ?p¯ xp.

Eqs. (2.2) and (2.3) result from a rigorous development of mean centering. It is noted that

sometimes in regression analysis standardization techniques are also employed, particularly in

situations where the independent variables are on radically different scales. In presenting the

material for this paper the authors had to make a choice: either present the material from a mean-

centered viewpoint or present it from a standardized viewpoint. The former was chosen as it is a

simpler modification of the data.

(2.3)

Page 4

628

P. Lang et al. / Journal of Multivariate Analysis 98 (2007) 625–637

Let X =?˜ x1··· ˜ xp

˜ y = X? + ˜ ?.

Throughout this paper, the rank of X, i.e., the number of linearly independent columns in X, is

assumed to be r, where 1?r? min{n,p}. A direct consequence of this assumption is that XtX

hasp orthonormaleigenvectorsv1,...,vpwiththefirstr beingassociatedwithther eigenvalues

?1? ··· ??r > 0 and the last p − r being associated with the eigenvalue 0. Let ui =

for i = 1,...,r, where ?i =√?i. These vectors ui, viand numbers ?i, for i = 1,...,r, are

known as the singular vectors and values associated with X. The ui, for i = 1,...,r, form an

orthonormal basis for R(X) and the vi, for i = 1,...,r, form an orthonormal basis for R(Xt)

[8]. The singular values ?iare assumed to be distinct in this paper.

The method of cyclic subspace regression is described now. For a more complete description

see [7]. Fix an integer l so that 1?l?r and let

?and ? = (?1··· ?p)t. With this notation (2.2) can be rewritten as

(2.4)

1

?iXvi,

Pl= (u1u2 ··· ul)(u1u2 ··· ul)t.

Then Plis an orthogonal projection of Rnonto the span of u1,...,uland the vector Pl˜ y =

?l

such that 1?k?l.Associated with zlis the k-dimensional subspace of Rp

i=1(ut

paper, it is assumed that (ut

i˜ y)ui is the orthogonal projection of ˜ y into the space spanned by u1,...,ul. In this

i˜ y) ?= 0 for i = 1,...,l. Set zl= XtPl˜ y and let k be a fixed integer

Vk,l= span{zl,(XtX)zl,...,(XtX)k−1zl}.

It should be noted that the assumptions (ut

Vk,lis indeed k-dimensional. Furthermore, these assumptions allow it to be shown that Vl,l =

span{v1,...,vl} (see Claim 1 in theAppendix). Define

Akl=?zl (XtX)zl ··· (XtX)k−1zl

and set Bkl= XAkl. The “solution” to (2.4) by cyclic subspace regression is

ˆ?kl= Akl(Bt

This vectorˆ?klis the unique vector in the span of the columns of Aklthat is sent by X onto the

orthogonal projection of ˜ y onto the span of the columns of Bkl.

i˜ y) ?= 0 and ?i distinct are needed to ensure that

?

klBkl)−1Bt

kl˜ y ∈ Rp.

(2.5)

3. Maximization properties

The space Vk,lhas many bases, in particular, the columns of Aklare a basis. If k = l, then

v1,...,vlalso form a basis. In principal components analysis these vectors vican be used to

create new variables zi= vt

[4], where x is a p × 1 vector consisting of the original independent variables shifted by their

sample means. The next result shows that each Vk,lalso contains a basis that maximizes the

sample covariance with Pl˜ y. Before stating and proving this fact it is noted that for w in Rp, the

sample covariance between Xw and Pl˜ y is

cov(Xw,Pl˜ y) =(Xw)tPl˜ y

n − 1

ix that have maximum variance subject to decorrelation constraints

=

wtzl

n − 1.

Page 5

P. Lang et al. / Journal of Multivariate Analysis 98 (2007) 625–637

629

Theorem 1. There exists a list of linearly independent Rpvectors w1,...,wlsuch that

1. wimaximizes wtzlsubject to the constraints wt

for i = 2,...,l and j = 1,...,i − 1, and

2. Vk,l= span{w1,...,wk} for any 1?k?l.

iwi= 1 for i = 1,...,l and wt

iXtXwj = 0

Proof. The proof will proceed by two separate induction arguments. To begin, note that the

function f whose domain is D0= {w ∈ Rp|wtw = 1} and whose value at w is f(w) = wtzl

is a continuous real-valued function defined on a compact set. This implies that there is a vector

in D0that maximizes the value of f. To find such a maximizer note that f(w) =

where ? is the angle (in radians) between w and zl. Clearly this expression is maximized when

? = 0, i.e., when w is a positive multiple of zl. The condition wtw = 1, forces the multiple to

be 1/

zt

be noted that the argument just presented is similar to the ones used to establish conditions for

equality in the Cauchy–Schwarz inequality.)

Nowconsidertheproblem:maximizewtzlsubjecttotheconditionswtw = 1andwtXtXw1=

··· = wtXtXwj = 0, where 1?j ?l. The function f restricted to Dj = {w ∈ Rp|wtw = 1

and wtXtXw1 = ··· = wtXtXwj = 0} is still continuous and defined on a compact set.

Therefore, there exists a vector that maximizes the value of f on Dj. To find this maximizer, set

Sj = {w ∈ Rp|wtXtXw1= ··· = wtXtXwj = 0} and Tj = span{XtXw1,...,XtXwj} and

note that Rpcan be expressed as an orthogonal direct sum of these two spaces. This implies that

zl= sj+ tj, where sj∈ Sjand tj∈ Tj. Now for w ∈ Dj, it follows that wtzl= wt(sj+ tj) =

wtsj =

Dj, together with the fact that wtzlis maximized when w = ?sj, where ? > 0, imply that this

problem has wj+1= (1/

The above has used induction to show that there exist l unique vectors w1,...,wlsuch that

wt

For j = 1,...,l, let Wj,l = span{w1,...,wj}. By way of induction, it will be shown that

Wj,l= Vj,l. To begin, let j = 1. Since w1= (1/

proving the result for j = 1. Now assume for j = s that Ws,l = Vs,land consider what this

implies for j = s + 1.

Let w0 ∈ Vs+1,l be such that wt

conditions imply, after some elementary considerations, that there exists a nonzero constant ?

such that w0= ?(zl+ z), where

z ∈ XtX(Vs,l) = span{(XtX)zl,(XtX)2zl,...,(XtX)szl}

(see Claim 2 in theAppendix). It should be noted that the induction hypothesis implies that

?

zt

lzlcos ?,

?

lzl. Thus, f has a unique maximizer and it is w1= (1/

?

zt

lzl)zl. (As an aside, it should

?

st

jsj cos ?, where ? is the angle between w and sj. The conditions of membership in

?

st

jsj)sjas the unique solution.

izlis maximal subject to the constraints wt

iwi= 1 and wt

?

iXtXwj= 0 when i ?= j.

zt

lzl)zl, it follows that W1,l= V1,l, thereby

0w0 = 1 and wt

0(XtX)izl = 0 for i = 1,...,s. These

XtX(Vs,l) = XtX(Ws,l) = span{(XtX)w1,(XtX)w2,...,(XtX)ws}.

This fact implies that if x ∈ Rpis such that xtx = 1 and xtXtXwi = 0 for i = 1,...,l, then

xt(XtX)izl= 0 for i = 1,...,l. Consequently, xtzl= xt(zl+ z) = xtw0/?. This last term is

maximized when x = ±w0, where the sign is chosen to correspond to the sign of ?.

Theresultsofthelastparagraph,togetherwiththepropertiesheldbyws+1,implythatwt

wt

s+1zl=

s+1w0/?.As ws+1maximizes wtzl, it follows that ws+1= ±w0and hence Ws+1,l⊆ Vs+1,l.