Page 1

TESTING SIGNIFICANCE OF FEATURES BY LASSOED

PRINCIPAL COMPONENTS

Daniela M. Witten1 and

Department of Statistics Stanford University 390 Serra Mall Stanford, California 94305 USA E-mail:

dwitten@stanford.edu

Robert Tibshirani2

Departments of Health Research and Policy and Statistics Stanford University 390 Serra Mall

Stanford, California 94305 USA E-mail: tibs@stat.stanford.edu

Abstract

We consider the problem of testing the significance of features in high-dimensional settings. In

particular, we test for differentially-expressed genes in a microarray experiment. We wish to identify

genes that are associated with some type of outcome, such as survival time or cancer type. We propose

a new procedure, called Lassoed Principal Components (LPC), that builds upon existing methods

and can provide a sizable improvement. For instance, in the case of two-class data, a standard (albeit

simple) approach might be to compute a two-sample t-statistic for each gene. The LPC method

involves projecting these conventional gene scores onto the eigenvectors of the gene expression data

covariance matrix and then applying an L1 penalty in order to de-noise the resulting projections. We

present a theoretical framework under which LPC is the logical choice for identifying significant

genes, and we show that LPC can provide a marked reduction in false discovery rates over the

conventional methods on both real and simulated data. Moreover, this flexible procedure can be

applied to a variety of types of data and can be used to improve many existing methods for the

identification of significant features.

Keywords

Microarray; gene expression; multiple testing; feature selection

1. Introduction

In recent years new experimental technologies within the field of biology have led to data sets

in which the number of features p greatly exceeds the number of observations n. Two such

examples are gene expression data and data from genome-wide association studies. In the case

of gene expression (or microarray) data, it is often of interest to identify genes that are

differentially-expressed across conditions (for instance, some patients may have cancer and

others may not), or that are associated with some type of outcome, such as survival time. Such

genes might be used as features in a model for prediction or classification of the outcome in a

© Institute of Mathematical Statistics, 2008

1Supported by a National Defense Science and Engineering Graduate Fellowship.

2Supported in part by NSF Grant DMS-99-71405 and the National Institutes of Health Contract N01-HV-28183.

SUPPLEMENTARY MATERIAL

Supplementary materials: Testing significance of features by lassoed principal components (DOI: 10.1214/08-AOAS182SUPP; .pdf). R

code for simulations, details of variance derivations for latent variable model and supporting figures.

NIH Public Access

Author Manuscript

Ann Appl Stat. Author manuscript; available in PMC 2009 September 14.

Published in final edited form as:

Ann Appl Stat. 2008 September 1; 2(3): 986–1012. doi:10.1214/08-AOAS182SUPP.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 2

new patient, or they might be used as target genes for further experiments in order to better

understand the biological processes that contribute to the outcome.

Over the years, a number of methods have been developed for the identification of

differentially-expressed genes in a microarray experiment; for a review, see Cui and Churchill

(2003) or Allison et al (2006). Usually, the association between a given gene and the outcome

is measured using a statistic that is a function of that gene only. Genes for which the statistic

exceeds a given value are considered to be differentially-expressed. Many methods combine

information, or borrow strength, across genes in order to make a more informed assessment of

the significance of a given gene. In the case of a two-class outcome, such methods include the

shrunken variance estimates of Cui et al. (2005), the empirical Bayes approach of Lonnstedt

and Speed (2002), the limma procedure of Smyth (2004) and the optimal discovery procedure

(ODP) of Storey, Dai and Leek (2007). We elaborate on the latter two procedures, as we will

compare them to our method throughout the paper in the case of a two-class outcome. Limma

assesses differential expression between conditions by forming a moderated t-statistic in which

posterior residual standard deviations are used instead of the usual standard deviation. The

ODP approach is quite different; it involves a generalization of the likelihood ratio statistic

such that an individual gene's significance is computed as a function of all of the genes in the

data set. In the case of a survival outcome, a standard method to assess a gene's significance

(and the one to which we will compare our proposed method in this paper) is the Cox score;

see, for example, Beer et al. (2002) and Bair and Tibshirani (2004).

We propose a new method, called Lassoed Principal Components (LPC), for the identification

of differentially-expressed genes. The LPC method is as follows. First, we compute scores for

each gene using an existing method, such as those mentioned above. These scores are then

regressed onto the eigenarrays of the data [Alter, Brown and Botstein (2000)]—principal

components of length equal to the number of genes—with an L1 constraint. The resulting fitted

values are the LPC scores. In this paper we demonstrate that LPC scores can result in more

accurate gene rankings than the conventional methods, and we present theoretical justifications

for the use of the LPC method.

Our method has two main advantages over existing methods:

1.

LPC borrows strength across genes in an explicit manner. This benefit is rooted in

the biological context of the data. In biological systems genes that are involved in the

same biological process, pathway, or network may be co-regulated; if so, such genes

may exhibit similar patterns of expression. One would not expect a single gene to be

associated with the outcome, since, in practice, many genes work together to effect a

particular phenotype. LPC effectively down-weights individual genes that are

associated with the outcome but that do not share an expression pattern with a larger

group of genes, and instead favors large groups of genes that appear to be

differentially-expressed. By implicitly using prior knowledge about what types of

genes one expects to be differentially-expressed, LPC achieves improved power over

existing methods in many examples.

2.

LPC can be applied on top of any existing method (regardless of outcome type) in

order to obtain potentially more accurate measures of differential expression. For

instance, in the case of a two-class outcome, many methods to detect differentially-

expressed genes exist, including SAM [Tusher, Tibshirani and Chu (2001)], limma

[Smyth (2004)] and ODP [Storey, Dai and Leek (2007)], as mentioned earlier. By

projecting any of these methods' gene scores onto the eigenarrays of the data with an

L1 constraint, we harness the power of these methods while incorporating the

biological context.

Witten and TibshiraniPage 2

Ann Appl Stat. Author manuscript; available in PMC 2009 September 14.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 3

The idea behind this method is that a gene that resembles the outcome is more likely to be

significant if it is one of many genes with similar expression patterns than if it resembles no

other genes. From a biological standpoint, this is due to the hypothesis that a gene that truly is

associated with an outcome (such as cancer) will be involved in a biological pathway related

to the outcome that involves many genes. Other genes in the pathway may exhibit similar

expression patterns due to co-regulation. From a statistical standpoint, it is due to the fact that

while variance in the genes' expression levels may occasionally cause an individual

nonsignificant gene to be correlated with the outcome by chance, it is statistically quite unlikely

that a great number of genes will all be correlated with the outcome and with each other due

solely to random noise. By regressing the conventional gene scores onto the eigenarrays of the

gene expression data and using the fitted values to rank genes, we essentially only rank highly

the genes that have moderate to high gene scores and have large loadings in an eigenarray that

is correlated with the vector of gene scores. Thus, individual genes with expression patterns

that do not resemble those of other genes in the data set are not given high rankings by our

method. Genes with moderate scores that resemble a large block of genes with high scores are

given high LPC scores; they borrow strength from genes with similar expression profiles.

The LPC method bears similarities to the surrogate variable analysis (SVA) method of Leek

and Storey (2007). SVA attempts to adjust for expression heterogeneity among samples,

whereas we try to exploit heterogeneity in order to obtain more accurate gene scores. The

effects of the two methods are quite different, and the methods can be used together, as is shown

in Appendix D.2.

A simulated two-class example helps to illustrate how LPC can outperform standard

approaches. Suppose that expression profiles come from either cancer or normal tissue, and

that genes over-expressed in cancer also happen to be under-expressed in older individuals

(Figure 1, see Supplementary Materials for R language code for this simulation [Witten and

Tibshirani (2008)]). The expression of these genes is a function of both patient age and tissue

type. If it is known by the data analyst that age affects gene expression, then age can be used

as a covariate in determining whether a gene is differentially-expressed. However, in practice,

factors that affect gene expression are often unknown or unmeasured, and so are not included

as covariates. In this case, a two-sample t-statistic will have limited success in identifying the

genes associated with cancer type, because the age effect masks some of the correlation

between cancer type and gene expression. On the other hand, applying the LPC method to the

two-sample t-statistics results in high scores for the differentially-expressed genes, because

these genes will have high loadings on the eigenarray that is most correlated with the cancer

type. The resulting gene scores can be seen in Figure 2.

LPC is not restricted to the two-class problem, and findings in the context of survival outcomes

indicate its potential promise. Figure 3 shows the estimated false discovery rates in detecting

genes in lymphoma patients that are associated with altered survival. LPC clearly outperforms

a standard gene-specific analysis based on Cox scores (see Section 3.3).

The paper is organized as follows. In Section 2 we present the details of the LPC method, as

well as some theoretical results that justify its use. Then, in Section 3 we demonstrate by

example that LPC outperforms the conventional gene scores in simulations and on published

microarray data sets for two-class and survival outcomes. We also present a method for false

discovery rate estimation for LPC, which is given in greater detail in the Appendix C. Section

4 contains the Discussion.

Witten and Tibshirani Page 3

Ann Appl Stat. Author manuscript; available in PMC 2009 September 14.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 4

2. The LPC method

2.1. Description

Let X denote an n × p matrix of log transformed gene expression levels, where n is the number

of observations and p is the number of genes, and n ≪ p. Assume that the columns of X (the

genes) have been centered so that they have mean zero across all of the observations. Let Xj

denote the vector of expression levels for gene j . Let y denote a vector of length n containing

the outcomes for each observation. For instance, if this is a two-class problem, then y will be

a binary vector.

The LPC method involves using existing gene scores to develop LPC scores that aim to provide

a more accurate ranking of genes in terms of differential-expression. In principle, a wide variety

of gene scores could be used; however, in the simplest version of LPC one would use one of

the methods in Table 1, depending on the outcome variable. In the examples analyzed, a small

constant is added to the denominators of the gene scores in Table 1 in order to avoid large ratios

resulting from small estimated standard deviations; see, for example, Tusher, Tibshirani and

Chu (2001). We will refer to the statistics in Table 1 as T. For ease of notation, unless we

specify otherwise, the LPC scores discussed in this paper are formed by applying the LPC

method to these T scores. Tj will refer to the gene score for gene j. The LPC method is as follows

for the quantitative, survival and two-class cases:

LPC Algorithm

1.

Compute T, the vector of length p with components Tj, the score for gene j, for j ∈

1, …, p.

2.

Compute v1, …, vn, the eigenarrays of X, where the vi's are the columns of the matrix

V in the singular value decomposition (SVD) of X, X = UDVT.

3.

For some value of the parameter λ, fit the model

with components βi is chosen to minimize the quantity

, where the vector β

. This is multiple linear regression with an

L1 constraint, also known as the “lasso” [Tibshirani (1996)].

Let T̂̂ denote the fitted values obtained by the above model. The LPC score for gene

j is T̂j.

4.

In the case of a multiple-class response, the procedure is slightly different, and is presented in

the Appendix B.

In Step 3 of the LPC algorithm, fitting a linear model with an L1 constraint is very fast, because

we are regressing the scores T on the columns of V, which are orthogonal. We use the following

soft thresholding approach in order to obtain the lasso coefficients:

Soft-Thresholding Algorithm

1.

Compute β̂̂, the vector of coefficients obtained by regressing T on the eigenar-rays of

X using ordinary multiple least squares; that is,

.

2.

Let , ∀i ∊ 1, …, n.

3.

Compute ; these are the LPC scores.

Witten and TibshiraniPage 4

Ann Appl Stat. Author manuscript; available in PMC 2009 September 14.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 5

The LPC algorithm involves a shrinkage parameter, λ, which determines the amount of

regularization performed in the L1-constrained regression. An automated method for the

selection of the value for this parameter is presented in the Appendix A.

Returning to the example from the Introduction, the value of the shrinkage parameter λ chosen

by our automated method was 5.5. This resulted in β̂1 nonzero and β̂i = 0 for i ∈ 2, …, n. The

first eigenarray is associated with the response.

In this example, LPC's success stems from the fact that the L1 constraint resulted in a nonzero

coefficient only for the correct eigenarray.

In the case of a quantitative response, the T scores take the form

that the genes have been scaled appropriately so that the T scores are simply XT y. From the

LPC Algorithm and the Soft-Thresholding Algorithm, the LPC scores are given by the formula

for gene j . Suppose

, where the columns V are linear combinations of the rows of X. Therefore, if

λ = 0 (i.e., in the absence of an L1 constraint), the LPC scores equal the T scores exactly. This

means that T is a special case of LPC. This leads us to hope that if, on a given data set, T

outperforms LPC with nonzero λ, our adaptive method of choosing λ will set λ to zero. If this

is the case, then we will always end up using the approach that works best on a particular data

set. A similar result holds for the case of a two-class response. Note that, in practice, however,

one usually does not scale the genes as described here.

2.2. Motivating LPC via an underlying latent variable model

Consider a scenario in which a subset of genes is associated with the outcome because some

underlying process, or “latent variable,” affects both the expression of the genes and the

outcome measurements. In the Appendix D.1, it is shown that in such a situation, under suitable

assumptions, LPC scores will have lower variance than T scores. This justifies the use of LPC

in situations where a latent variable model could reasonably describe the data set of interest.

2.3. Relationship with the eigengene space

In microarray data analysis the principal components of the columns of X are referred to as the

eigengenes, and the principal components of the rows of X are referred to as the eigenarrays.

We are interested in identifying significant genes; therefore, it may seem peculiar that our

proposed method works in the space of eigenarrays rather than in the space of eigengenes. For

instance, Bair and Tibshirani (2004) and Bair et al. (2006) perform supervised principal

components analysis in the eigengene space. We show here that, in the simple case of a

quantitative outcome, working in the eigenarray space is quite similar to working in the

eigengene space, but has a distinct advantage.

Assume that the columns of X are centered and that the quantitative outcome y is centered. Let

X = UDVT denote the singular value decomposition for X. Let T denote the vector of T statistics.

The LPC method fits the linear model with L1 constraint on the coefficients

(2.1)

where E(ε) = 0 and vi is the ith right singular vector of the gene expression data; in other words,

it is the ith eigenarray of the original data X.

Witten and TibshiraniPage 5

Ann Appl Stat. Author manuscript; available in PMC 2009 September 14.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript