Page 1

Journal of MultivariateAnalysis 98 (2007) 625–637

www.elsevier.com/locate/jmva

Properties of cyclic subspace regression

Patrick Lang∗,Ann Gironella, Rienk Venema

Idaho State University, ID, USA

Received 1 June 2005

Available online 7 July 2006

Abstract

Various properties of the regression vectorˆ?klproduced by cyclic subspace regression with regard to the

meancenteredlinearregressionequation ˜ y=X?+˜ ?areputforth.Inparticular,thesubspaceassociatedwith

the creation ofˆ?klis shown to contain a basis that maximizes certain covariances with respect to Pl˜ y, the

orthogonalprojectionof ˜ yontoaspecificsubspaceoftherangeofX.Thisbasisisconstructed.Moreover,this

papershowshowthemaximumcovariancevalueseffecttheˆ?kl.Severalalternativerepresentationsofˆ?klare

alsodeveloped.Theserepresentationsshowthatˆ?klisamodifiedversionofthel-factorprincipalcomponents

regression vectorˆ?ll, with the modification occurring by a nonorthogonal projection. Additionally, these

representations enable prediction properties associated withˆ?klto be explicitly identified. Finally, methods

for choosing factors are spelled out.

© 2006 Elsevier Inc.All rights reserved.

AMS 1991 subject classification: 62J05; 62J10; 62H12

Keywords: Cyclic subspace regression; Covariance maximization; Regression vector representation; Prediction; Partial

principal components; Factor selection

1. Introduction

In applications one often tries to fit a linear function of variables x1,...,xpto another variable

y. This is typically accomplished by obtaining n pieces of data for each variable, relating them

by the expression y = X?+?, where y is a n×1 vector whose components are the n data values

for y, X is a n×p matrix with rank r whose columns are the n data values of each of the xi, and

? represents error, and then finding a p × 1 vectorˆ? that “best fits” the information contained in

the equation. The method of least squares (LS) is often employed to find such a best fit. When

employed it finds the unique vector in the r-dimensional space R(Xt), the range of Xt, that is sent

∗Fax: +12082822636.

E-mail address: langpatr@isu.edu (P. Lang).

0047-259X/$-see front matter © 2006 Elsevier Inc.All rights reserved.

doi:10.1016/j.jmva.2006.05.004

Page 2

626

P. Lang et al. / Journal of Multivariate Analysis 98 (2007) 625–637

by X to the orthogonal projection of y onto the r-dimensional space R(X). This solution is said

to depend on r factors.

Sometimes, because of issues of noise, multicollinearity, and so forth, a solution dependent

upon a smaller number of factors is desired. Two commonly used methods for finding a k?r

factor solution are principal components regression (PCR) and partial least squares (PLS). To

describe these solutions it is noted that associated with X are positive singular values ?1,...,?r,

satisfying ?1? ··· ??r, and singular vectors v1,...,vr and orthonormal vectors u1,...,ur

satisfying Xvi = ?iuifor i = 1,...,r, span{v1,...,vr} = R(Xt), and span{u1,...,ur} =

R(X). PCR obtains its k-factor solution by finding the unique vector in the k-dimensional

space span{v1,...,vk} ⊆ R(Xt) that is sent by X to the orthogonal projection of y onto the

k-dimensional space span{u1,...,uk} ⊆ R(X). It is often claimed that this solution is better than

theLSsolutionbecauseitdoesnotcarrytheinformationindirectionsvk+1,...,vrwhichmaybe

associatedwithnoise,error,etc.PLSobtainsitsk-factorsolutionbyfindingtheuniquevectorinthe

k-dimensionalspacespan{zr,(XtX)zr,...,(XtX)k−1zr} ⊆ R(Xt)thatissentbyXtotheorthog-

onal projection of y onto the k-dimensional space span{Xzr,X(XtX)zr,...,X(XtX)k−1zr} ⊆

R(X),wherezr=?r

in both y and X, whereas the solutions from PCR and LS come from subspaces that depend on

some or all of the information in X and not at all on y.

Recently, in [7], another k-factor solution method was put forward. This method, known as

cyclic subspace regression (CSR), obtains its k-factor solution by choosing l, where 1?l?r, and

then choosing k?l. Next it finds the unique vector in the k-dimensional space span{zl,(XtX)zl,

...,(XtX)k−1zl} ⊆ span{v1,...,vl} ⊆ R(Xt) that is sent by X to the orthogonal projection of

y onto the k-dimensional space span{Xzl,X(XtX)zl,...,X(XtX)k−1zl} ⊆ span{u1,...,ul} ⊆

R(X), where zl =?l

differentfromLS,PCR,andPLS,andthesolutionsarereferredtoaspartialprincipalcomponents

solutions.In[7]ithasbeenshownthattherearesituationswhereuseofak-factorsolutionobtained

by use of CSR with k < l < r has led to better predictive models than those obtained by LS,

PCR, or PLS.

The method of CSR has seen limited application; for example [1,5]. However, the authors

believe it should see more use, because it has both of the qualities that have made PCR and

PLS individually useful, namely it has the “noise” reduction feature of PCR in it, obtained by

eliminating use of information in the directions vl+1,...,vr, and it has the usage of information

frombothyandXthatmakesPLSappealing.Additionally,itisnohardertoimplementthaneither

PCR or PLS. Based on this belief the authors have, in this paper, taken the time to further develop

issues associated with CSR that were not discussed in [7]. In particular, after establishing some

background basics, sample covariance maximizing bases for subspaces associated with CSR are

identified and used to represent and obtain regression results that parallel those of PCR and PLS.

Additionally, guidelines based on covariance properties are set out for choosing the appropriate

number of factors in CSR.

i=1?i(ut

iy)vi.ThissolutionissometimessaidtobebetterthanboththePCR

and LS solutions because it comes from a subspace that depends on the information contained

i=1?i(ut

iy)vi. Should k = l = r, this method is just that of LS, should

k = l < r, this method is just PCR, and if k < l = r this method is PLS.All other situations are

2. Background, assumptions and notation

Let y and x1,...,xpbe variables believed to be related by

y = ?0+ ?1x1+ ··· + ?pxp+ ?,

Page 3

P. Lang et al. / Journal of Multivariate Analysis 98 (2007) 625–637

627

where ? represents error and the ?i, i = 0,...,p are coefficients to be estimated from n observa-

tions of y: y1,...,ynand of x1,...,xp: xi1,...,xip, i = 1,...,n. In particular, the equations

yi= ?0+ ?1xi1+ ··· + ?pxip+ ?i,

for i = 1,...,n are used to estimate the regression coefficients.

Let y = (yi) ∈ Rn, xj= (xij) ∈ Rn, j = 1,...,p, 1 = (1) ∈ Rn, and ? = (?i) ∈ Rn. Using

this notation the equations above can be rewritten as

⎛

⎜

?p

y =?1 x1··· xp

?

⎜

⎜

⎝

?0

?1

...

⎞

⎟

⎟

⎟

⎠+ ?.

(2.1)

In what follows (2.1) is decomposed into a vector equation involving the unknowns ?1,...,?p

and a scalar equation involving all of the ?i.

Set V = {v ∈ Rn|v = ?1,? ∈ R} and W = {w ∈ Rn|1tw = 0}. These two sets are

orthogonal subspaces of Rnhaving only the zero vector in common. Moreover, they are such that

every vector in Rncan be written uniquely as the sum of a vector from V and a vector from W.

In other words, Rncan be expressed as an orthogonal direct sum of these spaces.

If ¯ y =

n

n

mean-centered vectors:

1

?n

i=1yi, ¯ xj =

1

?n

i=1xij, and ¯ ? =

1

n

?n

i=1?i, then the following are sample

˜ y = y − ¯ y1,

for j = 1,...,p. Note that 1t˜ y = 1t˜ xj= 1t˜ ? = 0, which implies that ˜ y, ˜ xjand ˜ ? belong to W.

Solving for y,xj, and ? in these equations and substituting those values into (2.1) yields

˜ xj= xj− ¯ xj1,

˜ ? = ? − ¯ ?1,

0 = (?0+ ?1¯ x1+ ··· + ?p¯ xp+ ¯ ? − ¯ y)1 + (?1˜ x1+ ··· + ?p˜ xp+ ˜ ? − ˜ y),

which shows 0 to be a sum of vectors from V and W. This fact, combined with the uniqueness of

vector decomposition with respect to V and W, implies that

˜ y = ?1˜ x1+ ··· + ?p˜ xp+ ˜ ?

(2.2)

and

¯ y = ?0+ ?1¯ x1+ ··· + ?p¯ xp+ ¯ ?.

In what follows the regression coefficients ?0,...,?pare estimated by first “solving” (2.2)

using the method of cyclic subspace regression (described below), yieldingˆ?1,...,ˆ?p, and then

estimating ?0byˆ?0= ¯ y −ˆ?1¯ x1− ··· −ˆ?p¯ xp.

Eqs. (2.2) and (2.3) result from a rigorous development of mean centering. It is noted that

sometimes in regression analysis standardization techniques are also employed, particularly in

situations where the independent variables are on radically different scales. In presenting the

material for this paper the authors had to make a choice: either present the material from a mean-

centered viewpoint or present it from a standardized viewpoint. The former was chosen as it is a

simpler modification of the data.

(2.3)

Page 4

628

P. Lang et al. / Journal of Multivariate Analysis 98 (2007) 625–637

Let X =?˜ x1··· ˜ xp

˜ y = X? + ˜ ?.

Throughout this paper, the rank of X, i.e., the number of linearly independent columns in X, is

assumed to be r, where 1?r? min{n,p}. A direct consequence of this assumption is that XtX

hasp orthonormaleigenvectorsv1,...,vpwiththefirstr beingassociatedwithther eigenvalues

?1? ··· ??r > 0 and the last p − r being associated with the eigenvalue 0. Let ui =

for i = 1,...,r, where ?i =√?i. These vectors ui, viand numbers ?i, for i = 1,...,r, are

known as the singular vectors and values associated with X. The ui, for i = 1,...,r, form an

orthonormal basis for R(X) and the vi, for i = 1,...,r, form an orthonormal basis for R(Xt)

[8]. The singular values ?iare assumed to be distinct in this paper.

The method of cyclic subspace regression is described now. For a more complete description

see [7]. Fix an integer l so that 1?l?r and let

?and ? = (?1··· ?p)t. With this notation (2.2) can be rewritten as

(2.4)

1

?iXvi,

Pl= (u1u2 ··· ul)(u1u2 ··· ul)t.

Then Plis an orthogonal projection of Rnonto the span of u1,...,uland the vector Pl˜ y =

?l

such that 1?k?l.Associated with zlis the k-dimensional subspace of Rp

i=1(ut

paper, it is assumed that (ut

i˜ y)ui is the orthogonal projection of ˜ y into the space spanned by u1,...,ul. In this

i˜ y) ?= 0 for i = 1,...,l. Set zl= XtPl˜ y and let k be a fixed integer

Vk,l= span{zl,(XtX)zl,...,(XtX)k−1zl}.

It should be noted that the assumptions (ut

Vk,lis indeed k-dimensional. Furthermore, these assumptions allow it to be shown that Vl,l =

span{v1,...,vl} (see Claim 1 in theAppendix). Define

Akl=?zl (XtX)zl ··· (XtX)k−1zl

and set Bkl= XAkl. The “solution” to (2.4) by cyclic subspace regression is

ˆ?kl= Akl(Bt

This vectorˆ?klis the unique vector in the span of the columns of Aklthat is sent by X onto the

orthogonal projection of ˜ y onto the span of the columns of Bkl.

i˜ y) ?= 0 and ?i distinct are needed to ensure that

?

klBkl)−1Bt

kl˜ y ∈ Rp.

(2.5)

3. Maximization properties

The space Vk,lhas many bases, in particular, the columns of Aklare a basis. If k = l, then

v1,...,vlalso form a basis. In principal components analysis these vectors vican be used to

create new variables zi= vt

[4], where x is a p × 1 vector consisting of the original independent variables shifted by their

sample means. The next result shows that each Vk,lalso contains a basis that maximizes the

sample covariance with Pl˜ y. Before stating and proving this fact it is noted that for w in Rp, the

sample covariance between Xw and Pl˜ y is

cov(Xw,Pl˜ y) =(Xw)tPl˜ y

n − 1

ix that have maximum variance subject to decorrelation constraints

=

wtzl

n − 1.

Page 5

P. Lang et al. / Journal of Multivariate Analysis 98 (2007) 625–637

629

Theorem 1. There exists a list of linearly independent Rpvectors w1,...,wlsuch that

1. wimaximizes wtzlsubject to the constraints wt

for i = 2,...,l and j = 1,...,i − 1, and

2. Vk,l= span{w1,...,wk} for any 1?k?l.

iwi= 1 for i = 1,...,l and wt

iXtXwj = 0

Proof. The proof will proceed by two separate induction arguments. To begin, note that the

function f whose domain is D0= {w ∈ Rp|wtw = 1} and whose value at w is f(w) = wtzl

is a continuous real-valued function defined on a compact set. This implies that there is a vector

in D0that maximizes the value of f. To find such a maximizer note that f(w) =

where ? is the angle (in radians) between w and zl. Clearly this expression is maximized when

? = 0, i.e., when w is a positive multiple of zl. The condition wtw = 1, forces the multiple to

be 1/

zt

be noted that the argument just presented is similar to the ones used to establish conditions for

equality in the Cauchy–Schwarz inequality.)

Nowconsidertheproblem:maximizewtzlsubjecttotheconditionswtw = 1andwtXtXw1=

··· = wtXtXwj = 0, where 1?j ?l. The function f restricted to Dj = {w ∈ Rp|wtw = 1

and wtXtXw1 = ··· = wtXtXwj = 0} is still continuous and defined on a compact set.

Therefore, there exists a vector that maximizes the value of f on Dj. To find this maximizer, set

Sj = {w ∈ Rp|wtXtXw1= ··· = wtXtXwj = 0} and Tj = span{XtXw1,...,XtXwj} and

note that Rpcan be expressed as an orthogonal direct sum of these two spaces. This implies that

zl= sj+ tj, where sj∈ Sjand tj∈ Tj. Now for w ∈ Dj, it follows that wtzl= wt(sj+ tj) =

wtsj =

Dj, together with the fact that wtzlis maximized when w = ?sj, where ? > 0, imply that this

problem has wj+1= (1/

The above has used induction to show that there exist l unique vectors w1,...,wlsuch that

wt

For j = 1,...,l, let Wj,l = span{w1,...,wj}. By way of induction, it will be shown that

Wj,l= Vj,l. To begin, let j = 1. Since w1= (1/

proving the result for j = 1. Now assume for j = s that Ws,l = Vs,land consider what this

implies for j = s + 1.

Let w0 ∈ Vs+1,l be such that wt

conditions imply, after some elementary considerations, that there exists a nonzero constant ?

such that w0= ?(zl+ z), where

z ∈ XtX(Vs,l) = span{(XtX)zl,(XtX)2zl,...,(XtX)szl}

(see Claim 2 in theAppendix). It should be noted that the induction hypothesis implies that

?

zt

lzlcos ?,

?

lzl. Thus, f has a unique maximizer and it is w1= (1/

?

zt

lzl)zl. (As an aside, it should

?

st

jsj cos ?, where ? is the angle between w and sj. The conditions of membership in

?

st

jsj)sjas the unique solution.

izlis maximal subject to the constraints wt

iwi= 1 and wt

?

iXtXwj= 0 when i ?= j.

zt

lzl)zl, it follows that W1,l= V1,l, thereby

0w0 = 1 and wt

0(XtX)izl = 0 for i = 1,...,s. These

XtX(Vs,l) = XtX(Ws,l) = span{(XtX)w1,(XtX)w2,...,(XtX)ws}.

This fact implies that if x ∈ Rpis such that xtx = 1 and xtXtXwi = 0 for i = 1,...,l, then

xt(XtX)izl= 0 for i = 1,...,l. Consequently, xtzl= xt(zl+ z) = xtw0/?. This last term is

maximized when x = ±w0, where the sign is chosen to correspond to the sign of ?.

Theresultsofthelastparagraph,togetherwiththepropertiesheldbyws+1,implythatwt

wt

s+1zl=

s+1w0/?.As ws+1maximizes wtzl, it follows that ws+1= ±w0and hence Ws+1,l⊆ Vs+1,l.

Page 6

630

P. Lang et al. / Journal of Multivariate Analysis 98 (2007) 625–637

To show that the opposite inclusion hold, it is enough to show that ws+1is linearly independent

of the vectors w1,...,ws, which are linearly independent by virtue of the induction hypothesis.

Suppose ws+1were dependent on these vectors. Then XtXws+1would be a linear combination

of XtXw1,...,XtXws. A direct consequence of this observation is that wt

i.e. Xws+1 = 0. This implies that ws+1is a member of the nullspace of X and the range of

Xt. As these spaces share only the zero vector, it follows that wp+l= 0, a contradiction. Thus,

Ws+1,l= Vs+1,l. This completes the second induction and the proof.

The vectors wihave nice sample covariance maximization properties, but are not readily ob-

tainable. In the next three paragraphs an easily obtained basis for Vl,lis constructed using the

method of Gram–Schmidt. These new basis vectors are shown to equal, up to identified scalars,

the vectors w1,...,wl.

To begin, observe that the function [·,·] defined on R(Xt)×R(Xt) by [c,d] = ctXtXd defines

an inner product. With respect to this inner product the vectors w1,...,wlsatisfy [wi,wj] = 0

whenever i ?= j, i.e. the vectors wiform an [·,·]-orthogonal basis for Vl,l. Applying the method

of Gram–Schmidt orthogonalization to the vectors zl,(XtX)zl,...,(XtX)l−1zlusing the inner

product [·,·] yields the vectors r1= zland

i−1

?

for i = 2,...,l, which satisfy [ri,ri] ?= 0,

span{r1,...,ri} = span{zl,(XtX)zl,...,(XtX)i−1zl}

for i = 1,...,l, and [ri,rj] = 0 whenever i ?= j.

Induction is now used to show that wi= (±1/

that w1 = (1/

ws+1∈ Vs+1,l. As Vs+1,l= span{r1,...,rs+1} it follows that ws+1=?s+1

the wiimply that

s+1XtXws+1 = 0,

?

ri= (XtX)i−1zl−

j=1

[rj,(XtX)i−1zl]

[rj,rj]

rj

?

rt

iri)rifor i = 1,...,l. For i = 1 it is clear

1r1)r1. Now assume the claim holds for i = s and consider

?

zt

lzl)zl = (1/

?

rt

j=1?jrj, where ?j=

[ws+1,rj]/[rj,rj]. The induction hypothesis and the spanning properties already attributed to

Vs,l= span{r1,...,rs} = span{w1,...,ws}.

Thus, every ri, i = 1,...,s can be expressed as a linear combination of w1,...,ws. Since

[wi,wj] = 0 for i ?= j, it follows that ?j= 0 for j = 1,...,s. Thus, ws+1= ?s+1rs+1. Since

wt

To choose the correct sign in the relationship between the wiand ri, consider the product wt

using both multiples of riand choose the one that makes wt

made, then the correct values of wt

s+1ws+1= 1, it must be the case that ?s+1= ±1/

?

rt

s+1rs+1.

izl

izlpositive. Should a choice not be

izlcan still be obtained by taking absolute values.

4. Matrix and solution representations

The representation of the vectorˆ?klin Eq. (2.5) is not unique. To see that is the case let K

denote an invertible k × k matrix. Since

Akl(Bt

klBkl)−1Bt

kl= AklK(KtAt

klXtXAklK)−1KtAt

klXt

Page 7

P. Lang et al. / Journal of Multivariate Analysis 98 (2007) 625–637

631

it follows that

ˆ?kl= Lkl(Gt

klGkl)−1Gt

kl˜ y,

(4.1)

where Lkl= AklK and Gkl= XAklK, is also a representation form forˆ?kl.

For a fixed K, Eq. (4.1) indicates thatˆ?klresults from multiplication of ˜ y by the transforming

matrix Lkl(Gt

a matrix is nothing more than a representation of a linear transformation with respect to specified

bases, it is natural to ask if there are alternative bases that will make the form of the transforming

matrix simpler. If so, then analysis of the relationships between ˜ y andˆ?kl can be simplified.

Alternative representations for this transforming matrix are now developed.

To begin, let the columns of Akland Bklbe denoted by a1,...,akand b1,...,bk, respectively.

With respect to the standard inner product in Rp, the method of Gram–Schmidt orthogonalization

produces from a1,...,akthe vectors q1,...,qksuch that qt

when i ?= j, span{q1,...,qj} = span{a1,...,aj} for j = 1,...,k, and Akl= QkRk, where

Qk= (q1,...,qk) and Rkis a k × k invertible upper triangular matrix whose ijth entry is qt

for i?j and is 0 otherwise [9]. Similarly, with respect to the standard inner product in Rn, the

method of Gram–Schmidt orthogonalization produces from b1,...,bknew vectors p1,...,pk

such that pt

where Pk= (p1,...,pk) and Skis a k × k invertible upper triangular matrix whose ijth entry is

pt

and PkSkR−1

k

klGkl)−1Gt

klwhich is the product of three, in general, complicated matrices. Since

iqj= ?ij, where ?ii= 1 and ?ij= 0

iaj

ipj= ?ij, span{p1,...,pj} = span{b1,...,bj} for j = 1,...,k, and Bkl= PkSk,

ibjfor i?j and is 0 otherwise. These decompositions of Akland Bklimply that Qk= AklR−1

= XAklR−1

k

k. Let K = R−1

k. Then Eq. (4.1) implies, after simplification, that

ˆ?kl= QkT−1

kPt

k˜ y,

(4.2)

where Tk= SkR−1

k.

Theorem 2. The matrix Tkis a bidiagonal matrix.

Proof. Since the inverse of an upper triangular invertible matrix is upper triangular and since

the product of upper triangular matrices is upper triangular it follows that Tkis upper triangular.

The fact that Sk= Pt

to pt

of Tk, that pt

stated i and j by showing, using induction, that Xtpi∈ span{q1,...,qi+1}. Let j be fixed such

that 3?j ?k and consider the vector Xtp1. By construction p1=

XtXa1= a2, it follows that

kXAklimplies Tk= Pt

kXQkand hence the ijth component of Tkis equal

iXqj. To show that Tkis bidiagonal it is enough to show, in view of the upper triangularity

iXqj = 0 for 1?i?j − 2, where 3?j ?k. This will be accomplished for these

1

√bt

1b1b1. Since b1= Xa1and

Xtp1=

1

?

bt

1b1

XtXa1=

1

?

bt

1b1

a2∈ span{q1,q2}.

This proves the containment for i = 1. The containment is now assumed to hold for i = h.

For the case i = h + 1, the construction of the vectors bh+1and ah+1implies that Xtbh+1=

XtXah+1 = ah+2. The Gram–Schmidt process implies that there exist ?1,...,?h+1such that

Page 8

632

P. Lang et al. / Journal of Multivariate Analysis 98 (2007) 625–637

ph+1= ?h+1bh+1−?h

Xtph+1= ?h+1ah+2−

m=1?mpm. From the induction hypothesis it follows that

h

?

m=1

?mXtpm∈ span{q1,...,qh+2}.

This completes the induction. Now for 1?i?j −2, the orthonormality of the qiand the fact that

Xtpi∈ span{q1,...,qi+1} imply

pt

iXqj= (Xtpi)tqj= 0,

showing Tkto be bidiagonal.

?

In light of Theorem 2 and (4.2), it follows that the matrix that transforms ˜ y intoˆ?klcan be

decomposed into the product of three matrices Qk, T−1

properties.Inparticular,QkandPkhaveorthonormalcolumnsandT−1

the inverse of a bidiagonal matrix. This form of the transformation is easy to understand and is

computationallyrobust.Thenexttheoremshowsthatthesecondofthethreemultiplyingmatrices

can be simplified even further, but at the cost of losing orthogonality in the first multiplying

matrix.

k, and Pt

k, which have some rather simple

isuppertridiagonal,being

k

Theorem 3. Let Wk= (w1···wk) and Dk= diag([w1,w1],...,[wk,wk]). Then

ˆ?kl= WkD−1

k(XWk)t˜ y.

Proof. In Section 3 of this paper the vectors w1,...,wkwere shown to form a basis for Vk,l,

the k-dimensional space spanned by the columns of Akl. This implies that there exists a k × k

invertible matrix K such that AklK = Wk, where Wk= (w1··· wk). This fact, together with

Eq.(4.1),showsthatˆ?kl= Wk(Wt

withrespecttotheinnerproduct[·,·]implythatthematrixWt

matrix with diagonal entries [wi,wi], 1?i?k. Let Dkdenote this matrix.

kXtXWk)−1Wt

kXt˜ y.Theorthogonalitypropertiesofw1,...,wk

kXtXWkisak×kinvertiblediagonal

?

Theorem 3 shows that the transforming matrix can be represented as the product of Wk, a

matrix whose columns are in general nonorthogonal, but have covariance properties described

in Theorem 1, with D−1

transpose has orthogonal columns.Additionally, Theorem 3 implies that

k, a matrix that is diagonal, and finally with (XWk)t, a matrix whose

ˆ?kl=

k

?

i=1

wt

[wi,wi]wi.

iXt˜ y

(4.3)

It should be noted that this representation is the analog of the classic k-factor PCR solution to

(2.4) using the singular vectors v1,...,vkand u1,...,uk. To see that this is the case note that

the classic k-factor PCR solution representation is

ˆ?kk=

k

?

i=1

ut

?i

i˜ y

vi.

(4.4)

Page 9

P. Lang et al. / Journal of Multivariate Analysis 98 (2007) 625–637

633

Since Xvi= ?iuiit follows that ut

rewritten as

i˜ y = (vt

iXt˜ y)/?iand that ?2

i= [vi,vi]. Thus, Eq. (4.4) can be

ˆ?kk=

k

?

i=1

vt

[vi,vi]vi.

iXt˜ y

This form parallels the form displayed in (4.3).The primary difference in the representing vectors

viand wiis that the former are a basis for Vk,kand the latter are a basis for Vk,l?= Vk,kwhen

k < l.

Properties held by the numerator and denominator of the expression multiplying the wiin (4.3)

are now set forth.

Theorem 4. For i = 1,...,k

wt

iXt˜ y = wt

izl.

Proof. Since Vl,l = span{v1,...,vl} and Xt˜ y =

wt

i

?r

i=1?i(ut

i˜ y)vi, it follows that wt

iXt˜ y =

?l

This result implies (4.3) can be rewritten as

i=1?i(ut

i˜ y)vi= wt

iXtPl˜ y = wt

izl.

?

ˆ?kl=

k

?

i=1

wt

izl

[wi,wi]wi.

(4.5)

Recall from Section 3 that the value wt

betweenXwiandPl˜ y.Thus,theeffectthatthesemaximizedcovariancevalueshaveonthesolution

ˆ?klis explicitly seen in the representation given in (4.5).

izlis proportional to the maximized sample covariance

Theorem 5. For i = 1,...,l

?2

1?[wi,wi]??2

Proof. Since the vector wi ∈ Vl,lfor i = 1,...,k it follows that wi =?l

??2

?

l.

j=1?ijvj, where

1?2

?ij= vt

l.

jwiand?l

j=1?2

ij= 1.This information implies that ?2

1?[wi,wi] = ?2

i1+···+?2

l?2

il

Theorem 5 shows that [wi,wi] is a positive-bounded quantity. As an aside, it should be noted

that this last result is a special case of Rayleigh’s principle [8].

Let Mk= WkD−1/2

hence Eq. (4.5) can be rewritten as

k

.Then the columns miof Mksatisfy mi= wi/√[wi,wi] for 1?i?k and

ˆ?kl=

k

?

i=1

(mt

izl)mi= MkMt

kzl.

(4.6)

Note that zl= XtXˆ?ll. Substituting this into (4.6) yields

ˆ?kl= MkMt

kXtXˆ?ll.

(4.7)

Page 10

634

P. Lang et al. / Journal of Multivariate Analysis 98 (2007) 625–637

This shows that for k < l the solutionˆ?kl is simply a modification of the l-factor principal

components solution by the matrix Nk= MkMt

Note that Xˆ?ll = Pl˜ y. This result implies, after applying X to the expression in (4.7), that

Xˆ?kl= XMkMt

onal projection, implying that Xˆ?klis nothing more than the orthogonal projection of Pl˜ y by

XMkMt

The equation Wt

projection, in particular it projects Rponto R(Nk) along its nullspace N(Nk).

Let k < l, then

kXtX.

kXtPl˜ y. Since Wt

kXtXWk= Dk, it can be shown that XMkMt

kXtis an orthog-

kXt.

kXtXWk = Dkcan also be used to show that NkNk = Nk. Thus Nkis a

R(Nk) = span{m1,...,mk} = span{zl,...,(XtX)k−1zl}.

As N(Nk) is the orthogonal complement of R(Nt

complement of span{(XtX)zl,...,(XtX)kzl}. By construction the vectors mk+1,...,mlbelong

tothiscomplement.Sincespan{m1,...,ml}isequaltospan{v1,...,vl},itfollowsthatthevectors

vl+1,...,vpare also in this complement. These facts, together with vector space dimension

counts, imply that

k), it follows that N(Nk) is the orthogonal

N(Nk) = span{mk+1,...,ml,vl+1,...,vp}.

Thus, when k < l, it follows that Nk is, in general, a nonorthogonal projection. However,

when k = l, the above implies that Nlis an orthogonal projection onto span{v1,...,vl} along

span{vl+1,...,vp}.

In [2], Goutis established the following result for PLS:

ˆ?

t

1rˆ?1r<ˆ?

t

2rˆ?2r< ··· <ˆ?

t

rrˆ?rr.

This is generalized in the next theorem.

Theorem 6. For l = 2,...,r

ˆ?

t

1lˆ?1l<ˆ?

t

2lˆ?2l< ··· <ˆ?

t

llˆ?ll.

Proof. To begin, note that Eq. (4.6) implies for j = 2,...,l that

ˆ?jl=ˆ?(j−1)l+ (mt

From this representation it follows that

jzl)mj.

ˆ?

t

jlˆ?jl=ˆ?

t

(j−1)lˆ?(j−1)l+ (mt

jzl)2mt

jmj+ 2(mt

jzl)(ˆ?

t

(j−1)lmj).

Clearly, the first two added terms on the right side of the above equation are nonnegative. It

is claimed that the third term is positive. To see that this is the case, observe that mt

wt

jzl =

jzl/?[wj,wj] > 0 by the construction of the wj. Eq. (4.6) implies that

ˆ?

t

(j−1)lmj= mt

jˆ?(j−1)l= mt

jMj−1Mt

j−1zl=

j−1

?

i=1

(mt

jmi)(mt

izl).

Since the term mt

i = 1,...,j − 1. Recall from Section 3 that Tiis equal to span{(XtX)w1,...,(XtX)wi} for

izlis positive, the claim will follow if mt

jmican be shown to be positive for

Page 11

P. Lang et al. / Journal of Multivariate Analysis 98 (2007) 625–637

635

i = 1,...,l − 1. By construction mi = ?i(zl+ ti), where ti ∈ Ti, ?i > 0, and mt

Ti⊆ Ti+1, it follows that mt

5. Prediction

iti = 0. As

?

jmi= mt

j?i(zl+ ti) = ?imt

jzl> 0, proving the result.

By definition the residual associated withˆ?klis ekl= ˜ y−Xˆ?kl. Let K denote a fixed invertible

k × k matrix. The solution form given in Eq. (4.1) shows ekl= (In− Gkl(Gt

notational purposes let Hkl= Gkl(Gt

is an orthogonal projection, as is In− Hkl. Since the columns of Gklare a basis for X(Vk,l), it

follows that the residual ekl= (In− Hkl)˜ y is the orthogonal projection of ˜ y onto the orthogonal

complement of X(Vk,l).

Clearly, the representation form of Hkldepends on the K used. Should K = Ik, then Hkl=

Bkl(Bt

k

appearing in Section 4, then Hkl= PkPt

particularly simple. This simplicity results from the orthonormality of the columns in Pk. More

generally, if Fkis a n × k matrix whose columns are an orthonormal basis for X(Vk,l), then

Hkl= FkFt

squares [3].

Using the notation established in Section 2, let x0= (xi0) ∈ Rpbe a new set of values for

the variables x1,...,xp. The predicted value for the variable y based on the cyclic subspace

regression vectorˆ?klis

ˆ ykl= ¯ y +ˆ?

where x∗

value of ˆ ykl, with the change being affected by the term bkl. This term is now studied in terms of

Hkl. To begin, observe that x∗

?p

i=1

From (2.5) it follows that

?p

i=1

It should be noted that the p appearing in the summations above can be replaced by l because

Hkl˜ y is in the span of u1,...,ul.

Let f1,...,fk be orthonormal vectors that span X(Vk,l), for k = 1,...,l, and set Fk =

(f1··· fk). Using this matrix, the material presented earlier in this section implies that for

k = 2,...,l

?Ft

Combining the results in (5.2) and (5.3) implies for k = 2,...,l that

?

i=1

klGkl)−1Gt

kl)˜ y. For

klGkl)−1Gt

kl. Since H2

kl= Ht

kl= Hkl, it follows that Hkl

klBkl)−1Bt

kland if K is the matrix R−1

k, which is

k. It should be noted that if k = l = r, then Hrris the classic “hat” matrix of least

t

klx∗

0= ¯ y + x∗t

0ˆ?kl= ¯ y + bkl,

0ˆ?kl. Clearly, changing the k and l will change, in general, the

(5.1)

0= (xi0− ¯ xi) and bkl= x∗t

0, being a vector in Rp, can be represented as follows:

?

x∗

0=

p

?

i=1

(vt

ix∗

0)vi= Xt

?

(vt

ix∗

?i

0)

ui

.

bkl=

?

?vt

ix∗

?i

0

?

ui

?t

XAkl

?Bt

klBkl

?−1Bt

kl˜ y =

?p

i=1

?

(vt

ix∗

?i

0)

ui

?t

Hkl˜ y.

(5.2)

Hkl= FkFt

k= (Fk−1fk)

k−1

ft

k

?

= Fk−1Ft

k−1+ fkft

k= H(k−1)l+ fkft

k.

(5.3)

bkl= b(k−1)l+

l?

(vt

ix∗

?i

0)

ui

?t

fkft

k˜ y.

(5.4)

Page 12

636

P. Lang et al. / Journal of Multivariate Analysis 98 (2007) 625–637

Thus, once the vectors x∗

the prediction values associated with the subspaces Vk,lcan be quickly obtained by use of the

recursive relationship in (5.4).

0,fk,vk,ukand the values ?khave been identified for k = 1,...,l,

6. Factor selection

Factor selection is a difficult task.At best it is an art, honed by experience with the problem at

hand, and at its worst, it is a quagmire. In [4] over 30 pages are devoted to ways for choosing the

correct number of factors for PCR.The following paragraph offers a way for choosing both k and

l by means of prediction calculations.

CSR is capable of producing r(r + 1)/2 regression vectorsˆ?kl. Given a new set of values for

the variables y and x1,...,xp, one can use the new xivalues to obtain predictions of the y values

by use of Eq. (5.1). These predictions can then be compared to the actual values. By summing

the squares of the differences between the observed y data values and the predicted values, one

obtains a PRESS value. There is a PRESS value for each allowed k and l. The appropriate choice

of k and l is obtained by choosing the k and l associated with the smallest PRESS value. In the

absence of a new set of values, leave one out cross validation can be used to produce PRESS

values. It should be noted that the formula in (5.4) can speed up the PRESS calculation process.

The method described above for choosing k and l is computationally expensive. The following

method is less so. It supposes that l has been determined by methods similar to those described in

[4] or by some other means. Let w1,...,wlbe the covariance maximizing vectors discussed in

detail in Section 3. The value of the sample covariance between Xwiand Pl˜ y is wt

definition, the total covariance associated with w1,...,wlis

wt

the following measure of total covariance:

?k

izl/(n−1). By

izl. By construction,

1

n−1

?l

i=1wt

1zl? ··· ?wt

lzl?0.InanalogywithfactorchoiceforPCRbyvarianceconsiderations,consider

ckl=

i=1wt

?l

izl

izl

i=1wt

,

1?k?l.

Clearly, 0 < c1l< c2l< ··· < cll= 1. Let c be a number between 0 and 1. To choose k, find the

first value of cklthat is bigger than the chosen c and use that k.The value of c to be used is clearly

problem dependent. Should both covariance and factor reduction issues be important, then the

authors recommend use of a c value 0.8 or higher, as is done with PCR in practice. However, if

the overall goal is factor reduction, then use of a smaller c value is recommended.

7. Summary

This paper has studied various aspects of the regression vector estimateˆ?klproduced by CSR

basedonthemodelgivenbyEq.(2.4).Inparticular,ithasshownthatthesubspaceVk,lcontaining

ˆ?klalso contains a basis that maximizes certain covariances with respect to Pl˜ y. Moreover, it has

shown how the maximum covariance values effect theˆ?kl. Several alternative representations

ofˆ?klhave also been produced. In particular, the representation in Eq. (4.7) shows thatˆ?klis a

modified version of the l-factor PCR regression vectorˆ?ll, with the modification occurring by a

nonorthogonal projection.Additionally, these representations have enabled prediction properties

associated withˆ?klto be explicitly identified.This paper has made no distributional assumptions.

There is much work to be done in that direction.

Page 13

P. Lang et al. / Journal of Multivariate Analysis 98 (2007) 625–637

637

Appendix

A couple of claims were made in the preceding sections. Their proofs have been deferred to

this section to prevent the technical reading from becoming too technical.

Claim 1. Clearly zl= ?1v1+ ··· + ?lvl, where ?i= ?i(ut

(XtX)izl= ?1(?2

for i = 1,...,l − 1. Writing these equations in matrix form yields

(zl (XtX)zl··· (XtX)l−1zl) = (v1v2··· vl)DV,

where D is a l × l diagonal matrix with ?1,...,?lon its main diagonal, and V is a l × l matrix

whose ijth entry is (?2

imply that D is invertible and the assumptions on the ?iimply that V is invertible [6]. Thus,

the vectors zl,(XtX)zl,...,(XtX)l−1zland v1,v2,...,vlare bases for Vl,l. Since the vectors

zl,(XtX)zl,...,(XtX)l−1zlform a basis, they are linearly independent, and consequently, any

nonemptysubsetofthemalsoconsistsoflinearlyindependentvectors.Thus,Vk,lisk-dimensional.

i˜ y). This implies that

1)iv1+ ··· + ?l(?2

l)ivl

i)j−1. The matrix V is a Vandermond matrix. The assumptions about ut

i˜ y

Claim 2. If w0∈ Vs+1,lthen w0= ?zl+ ?1(XtX)zl+ ··· + ?s(XtX)szl. Thus, if wt

and wt

yields w0= ?(zl+(1/?)[?1(XtX)zl+···+?s(XtX)szl]). Setting z = (1/?)[?1(XtX)zl+···+

?s(XtX)szl] shows w0= ?(zl+ z), where ? ?= 0 and z ∈ (XtX)(Vs,l).

0w0= 1

0(XtX)izl= 0 for i = 1,...,s, then 1 = ?wt

0zl, implying that ? ?= 0. Factoring out the ?

References

[1] G. Bakken, T. Houghton, J. Kalivas, Cyclic subspace regression with analysis of wavelength selection criteria,

Chemom. Intel. Lab. Sys. 45 (1998) 219–232.

[2] C. Goutis, Partial least squares algorithm yields shrinkage estimators,Ann. Statist. 24 (2) (1996) 816–824.

[3] R. Johnson, D.Wichern,Applied Multivariate StatisticalAnalysis, fifth ed., Prentice-Hall, Upper Saddle River, 2002.

[4] I. Jolliffe, Principal ComponentAnalysis, second ed., Springer, NewYork, 2002.

[5] J. Kalivas, R. Green, Pareto optimal multivariate calibration for spectroscopic data, Appl. Spectroscopy 55 (2001)

1645–1652.

[6] P. Lancaster, M. Tismenetsky, The Theory of Matrices, second ed.,Academic Press, Orlando, 1985.

[7] P. Lang, J. Brenchley, R. Nieves, J. Kalivas, Cyclic subspace regression, J. MultivariateAnal. 65 (1) (1998) 58–70.

[8] B. Noble, J. Daniel,Applied LinearAlgebra, second ed., Prentice-Hall, Inc., Englewood Cliffs, NJ, 1977.

[9] G. Strang, Introduction to LinearAlgebra, Wellesley-Cambridge Press, Wellesley, 1993.