Effective Dimension Reduction Using Sequential Projection Pursuit on Gene Expression Data for Cancer Classification.
ABSTRACT Motiviation: Classification is a powerful tool for uncovering interesting phenomena, for example classes of cancer, in microarray data. Due to the small number of observations (n) in comparison to the number of variables (p), genes, classification on microarray data is challenging. Thus, multivariate dimension reduction techniques are commonly used as a precursor to classification of microarray data; typically this is principal component analysis (PCA) or singular value decomposition (SVD). Since PCA and SVD are concerned with explaining the variance-covariance structure of the data, they may not be the best choice when the between-cluster variance is smaller than the within-cluster variance. Recently an attractive alternative to PCA, sequential projection pursuit (SPP), has been introduced which is designed to elicit clustering tendencies in the data. Thus, in some cases SPP may be more appropriate when performing clustering or classification analysis. Results: We compare the performance of SPP to PCA on two cancer gene expression datasets related to leukemia and colon cancer. Using PCA and SPP to reduce the dimensionality of the data to m
- SourceAvailable from: Anestis Antoniadis[Show abstract] [Hide abstract]
ABSTRACT: MOTIVATION: One particular application of microarray data, is to uncover the molecular variation among cancers. One feature of microarray studies is the fact that the number n of samples collected is relatively small compared to the number p of genes per sample which are usually in the thousands. In statistical terms this very large number of predictors compared to a small number of samples or observations makes the classification problem difficult. An efficient way to solve this problem is by using dimension reduction statistical techniques in conjunction with nonparametric discriminant procedures. RESULTS: We view the classification problem as a regression problem with few observations and many predictor variables. We use an adaptive dimension reduction method for generalized semi-parametric regression models that allows us to solve the 'curse of dimensionality problem' arising in the context of expression data. The predictive performance of the resulting classification rule is illustrated on two well know data sets in the microarray literature: the leukemia data that is known to contain classes that are easy 'separable' and the colon data set.Bioinformatics 04/2003; 19(5):563-70. DOI:10.1093/bioinformatics/btg062 · 4.62 Impact Factor
Article: Projection methods in chemistry[Show abstract] [Hide abstract]
ABSTRACT: Visualization of a data set structure is one of the most challenging goals in data mining. Often, chemical data sets are multidimensional, and therefore visualization of their structure is not directly possible. To overcome this problem, the original data is compressed to the few new features by using projection techniques, preserving the original data structure as good as possible, and allowing its visualization. In this paper, a survey of different projection techniques, linear and nonlinear, is given. Their performance is illustrated on chemical data sets, and the advantages and disadvantages are pointed out.Chemometrics and Intelligent Laboratory Systems 01/2003; 65(1-65):97-112. DOI:10.1016/S0169-7439(02)00107-7 · 2.38 Impact Factor