Effect of training-sample size and classification difficulty on the accuracy of genomic predictors

Bioinformatics Core Facility, Swiss Institute of Bioinformatics, Génopode Building, Quartier Sorge, Lausanne CH-1015, Switzerland.
Breast cancer research: BCR (Impact Factor: 5.88). 01/2010; 12(1):R5. DOI: 10.1186/bcr2468
Source: PubMed

ABSTRACT As part of the MicroArray Quality Control (MAQC)-II project, this analysis examines how the choice of univariate feature-selection methods and classification algorithms may influence the performance of genomic predictors under varying degrees of prediction difficulty represented by three clinically relevant endpoints.
We used gene-expression data from 230 breast cancers (grouped into training and independent validation sets), and we examined 40 predictors (five univariate feature-selection methods combined with eight different classifiers) for each of the three endpoints. Their classification performance was estimated on the training set by using two different resampling methods and compared with the accuracy observed in the independent validation set.
A ranking of the three classification problems was obtained, and the performance of 120 models was estimated and assessed on an independent validation set. The bootstrapping estimates were closer to the validation performance than were the cross-validation estimates. The required sample size for each endpoint was estimated, and both gene-level and pathway-level analyses were performed on the obtained models.
We showed that genomic predictor accuracy is determined largely by an interplay between sample size and classification difficulty. Variations on univariate feature-selection methods and choice of classification algorithm have only a modest impact on predictor performance, and several statistically equally good predictors can be developed for any given classification problem.

Download full-text


Available from: Vlad Popovici, Aug 17, 2015
  • Source
    • "All classifications are performed by nearest mean classifiers (NMC). We chose the NMC for the following reasons: (i) the NMC provides performances comparable to other classifiers on expression data (Wessels et al., 2005; Popovici et al., 2010), (ii) the NMC is a simple base-line classifier, and (iii) compared to other non-linear classifiers it offers an easier way to biologically interpret the use of features. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Integrating gene expression data with secondary data such as pathway or protein-protein interaction data has been proposed as a promising approach for improved outcome prediction of cancer patients. Methods employing this approach usually aggregate the expression of genes into new composite features, while the secondary data guide this aggregation. Previous studies were limited to few data sets with a small number of patients. Moreover, each study used different data and evaluation procedures. This makes it difficult to objectively assess the gain in classification performance. Here we introduce the Amsterdam Classification Evaluation Suite (ACES). ACES is a Python package to objectively evaluate classification and feature-selection methods and contains methods for pooling and normalizing Affymetrix microarrays from different studies. It is simple to use and therefore facilitates the comparison of new approaches to best-in-class approaches. In addition to the methods described in our earlier study (Staiger et al., 2012), we have included two prominent prognostic gene signatures specific for breast cancer outcome, one more composite feature selection method and two network-based gene ranking methods. Employing the evaluation pipeline we show that current composite-feature classification methods do not outperform simple single-genes classifiers in predicting outcome in breast cancer. Furthermore, we find that also the stability of features across different data sets is not higher for composite features. Most stunningly, we observe that prediction performances are not affected when extracting features from randomized PPI networks.
    Frontiers in Genetics 12/2013; 4:289. DOI:10.3389/fgene.2013.00289
  • Source
    • "Affymetrix control probesets were discarded after probeset summarisation. The concept of classification difficulty estimation was introduced in Popovici et al. (2010) for the purpose of exploring associations between variables and a binary endpoint. In brief, for each probeset a squared t-score is evaluated and the cumulative sum of the ordered squares is plotted to evaluate whether or not there is a strong association between the endpoint values and the data. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Abstract Model selection between competing models is a key consideration in the discovery of prognostic multigene signatures. The use of appropriate statistical performance measures as well as verification of biological significance of the signatures is imperative to maximise the chance of external validation of the generated signatures. Current approaches in time-to-event studies often use only a single measure of performance in model selection, such as logrank test p-values, or dichotomise the follow-up times at some phase of the study to facilitate signature discovery. In this study we improve the prognostic signature discovery process through the application of the multivariate partial Cox model combined with the concordance index, hazard ratio of predictions, independence from available clinical covariates and biological enrichment as measures of signature performance. The proposed framework was applied to discover prognostic multigene signatures from early breast cancer data. The partial Cox model combined with the multiple performance measures were used in both guiding the selection of the optimal panel of prognostic genes and prediction of risk within cross validation without dichotomising the follow-up times at any stage. The signatures were successfully externally cross validated in independent breast cancer datasets, yielding a hazard ratio of 2.55 [1.44, 4.51] for the top ranking signature.
    Statistical Applications in Genetics and Molecular Biology 10/2013; 12(5):619-35. DOI:10.1515/sagmb-2012-0047 · 1.52 Impact Factor
  • Source
    • "All methods, simple or complicated ones, are actually susceptible to overfitting to a certain degree. The large MicroArray Quality Consortium II, led by the FDA has recently shown that variations on univariate gene selection methods and prediction rules have only a modest impact on performance (Shi et al., 2010) and several statistically equally good predictors can be developed for any given classification problem (Popovici et al., 2010). Still the same question remains: how should the huge amount of biomarker information be summarized into one single multi-biomarker equation? "
    [Show abstract] [Hide abstract]
    ABSTRACT: A typical array experiment yields at least tens of thousands of measurements on often not more than a hundred patients, a situation often denoted as the curse of dimensionality. With a focus on prognostic multi-biomarker scores derived from microarrays, we highlight the multidimensionality of the problem and the issues in the multidimensionality of the data. We go over several statistical challenges raised by this curse occurring in each step of microarray analysis on patient data, from the hypothesis and the experimental design to the analysis methods, interpretation of results and clinical utility. Different analytical tools and solutions to answer these challenges are provided and discussed.
    Molecular oncology 02/2011; 5(2):190-6. DOI:10.1016/j.molonc.2011.01.002 · 5.94 Impact Factor
Show more