Testing significance of features by lassoed principal components

Department of Statistics Stanford University 390 Serra Mall Stanford, California 94305 USA E-mail: .
The Annals of Applied Statistics (Impact Factor: 1.46). 09/2008; 2(3):986-1012. DOI: 10.1214/08-AOAS182SUPP
Source: PubMed


We consider the problem of testing the significance of features in high-dimensional settings. In particular, we test for differentially-expressed genes in a microarray experiment. We wish to identify genes that are associated with some type of outcome, such as survival time or cancer type. We propose a new procedure, called Lassoed Principal Components (LPC), that builds upon existing methods and can provide a sizable improvement. For instance, in the case of two-class data, a standard (albeit simple) approach might be to compute a two-sample t-statistic for each gene. The LPC method involves projecting these conventional gene scores onto the eigenvectors of the gene expression data covariance matrix and then applying an L(1) penalty in order to de-noise the resulting projections. We present a theoretical framework under which LPC is the logical choice for identifying significant genes, and we show that LPC can provide a marked reduction in false discovery rates over the conventional methods on both real and simulated data. Moreover, this flexible procedure can be applied to a variety of types of data and can be used to improve many existing methods for the identification of significant features.

Full-text preview

Available from:
  • Source
    • "Recently, Witten and Tibshirani [7] introduced the notion of Lassoed principal components for identifying differentially-expressed genes, and considered the problem of testing the significance of features in high dimensional data. Our approach is rather different and is designed to be used after satisfactory PCA has been achived rather than, as in other methods, to produce principal components with particular characteristics (e.g., some coefficients that are zero) such that only interpretable principal components are produced . "
    [Show abstract] [Hide abstract]
    ABSTRACT: In this article, we introduce a procedure for selecting variables in principal components analysis. The procedure was developed to identify a small subset of the original variables that "best explain" the principal components through nonparametric relationships. There are usually some "noisy" uninformative variables in a dataset, and some variables that are strongly related to each other because of their general interdependence. The procedure is designed to be used following the satisfactory initial use of a principal components analysis with all variables, and its aim is to help to interpret underlying structures, particularly in high dimensional data. We analyse the asymptotic behaviour of the method and provide an example by applying the procedure to some real data.
    Full-text · Article · Aug 2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Group-wise pattern analysis of genes, known as gene-set analysis (GSA), addresses the differential expression pattern of biologically pre-defined gene sets. GSA exhibits high statistical power and has revealed many novel biological processes associated with specific phenotypes. In most cases, however, GSA relies on the invalid assumption that the members of each gene set are sampled independently, which increases false predictions. We propose an algorithm, termed DECO, to remove (or alleviate) the bias caused by the correlation of the expression data in GSAs. This is accomplished through the eigenvalue-decomposition of covariance matrixes and a series of linear transformations of data. In particular, moderate de-correlation methods that truncate or re-scale eigenvalues were proposed for a more reliable analysis. Tests of simulated and real experimental data show that DECO effectively corrects the correlation structure of gene expression and improves the prediction accuracy (specificity and sensitivity) for both gene- and sample-randomizing GSA methods. The MATLAB codes and the tested data sets are available at or from the author.
    Full-text · Article · Sep 2010 · Bioinformatics
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Multivariable regression models are widely used in health science research, mainly for two purposes: prediction and effect estimation. Various strategies have been recommended when building a regression model: a) use the right statistical method that matches the structure of the data; b) ensure an appropriate sample size by limiting the number of variables according to the number of events; c) prevent or correct for model overfitting; d) be aware of the problems associated with automatic variable selection procedures (such as stepwise), and e) always assess the performance of the final model in regard to calibration and discrimination measures. If resources allow, validate the prediction model on external data.
    Preview · Article · Jun 2011 · Revista Espa de Cardiologia
Show more