Effect of training-sample size and classification difficulty on the accuracy of genomic predictors

Bioinformatics Core Facility, Swiss Institute of Bioinformatics, Génopode Building, Quartier Sorge, Lausanne CH-1015, Switzerland.
Breast cancer research: BCR (Impact Factor: 5.49). 01/2010; 12(1):R5. DOI: 10.1186/bcr2468
Source: PubMed


As part of the MicroArray Quality Control (MAQC)-II project, this analysis examines how the choice of univariate feature-selection methods and classification algorithms may influence the performance of genomic predictors under varying degrees of prediction difficulty represented by three clinically relevant endpoints.
We used gene-expression data from 230 breast cancers (grouped into training and independent validation sets), and we examined 40 predictors (five univariate feature-selection methods combined with eight different classifiers) for each of the three endpoints. Their classification performance was estimated on the training set by using two different resampling methods and compared with the accuracy observed in the independent validation set.
A ranking of the three classification problems was obtained, and the performance of 120 models was estimated and assessed on an independent validation set. The bootstrapping estimates were closer to the validation performance than were the cross-validation estimates. The required sample size for each endpoint was estimated, and both gene-level and pathway-level analyses were performed on the obtained models.
We showed that genomic predictor accuracy is determined largely by an interplay between sample size and classification difficulty. Variations on univariate feature-selection methods and choice of classification algorithm have only a modest impact on predictor performance, and several statistically equally good predictors can be developed for any given classification problem.

Download full-text


Available from: Vlad Popovici
  • Source
    • "All classifications are performed by nearest mean classifiers (NMC). We chose the NMC for the following reasons: (i) the NMC provides performances comparable to other classifiers on expression data (Wessels et al., 2005; Popovici et al., 2010), (ii) the NMC is a simple base-line classifier, and (iii) compared to other non-linear classifiers it offers an easier way to biologically interpret the use of features. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Integrating gene expression data with secondary data such as pathway or protein-protein interaction data has been proposed as a promising approach for improved outcome prediction of cancer patients. Methods employing this approach usually aggregate the expression of genes into new composite features, while the secondary data guide this aggregation. Previous studies were limited to few data sets with a small number of patients. Moreover, each study used different data and evaluation procedures. This makes it difficult to objectively assess the gain in classification performance. Here we introduce the Amsterdam Classification Evaluation Suite (ACES). ACES is a Python package to objectively evaluate classification and feature-selection methods and contains methods for pooling and normalizing Affymetrix microarrays from different studies. It is simple to use and therefore facilitates the comparison of new approaches to best-in-class approaches. In addition to the methods described in our earlier study (Staiger et al., 2012), we have included two prominent prognostic gene signatures specific for breast cancer outcome, one more composite feature selection method and two network-based gene ranking methods. Employing the evaluation pipeline we show that current composite-feature classification methods do not outperform simple single-genes classifiers in predicting outcome in breast cancer. Furthermore, we find that also the stability of features across different data sets is not higher for composite features. Most stunningly, we observe that prediction performances are not affected when extracting features from randomized PPI networks.
    Full-text · Article · Dec 2013 · Frontiers in Genetics
  • Source
    • "Specifically, we used an all-purpose gene/protein interaction network based on the STRING database [13], into which large genome-scale datasets, assembled from more than 200 patients from various breast cancer subtypes [14] were integrated. Patient collectives of this size enable unprecedented statistical power and robustness despite subgroup differences. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In silico approaches are increasingly considered to improve breast cancer treatment. One of these treatments, neoadjuvant TFAC chemotherapy, is used in cases where application of preoperative systemic therapy is indicated. Estimating response to treatment allows or improves clinical decision-making and this, in turn, may be based on a good understanding of the underlying molecular mechanisms. Ever increasing amounts of high throughput data become available for integration into functional networks. In this study, we applied our software tool ExprEssence to identify specific mechanisms relevant for TFAC therapy response, from a gene/protein interaction network. We contrasted the resulting active subnetwork to the subnetworks of two other such methods, OptDis and KeyPathwayMiner. We could show that the ExprEssence subnetwork is more related to the mechanistic functional principles of TFAC therapy than the subnetworks of the other two methods despite the simplicity of ExprEssence. We were able to validate our method by recovering known mechanisms and as an application example of our method, we identified a mechanism that may further explain the synergism between paclitaxel and doxorubicin in TFAC treatment: Paclitaxel may attenuate MELK gene expression, resulting in lower levels of its target MYBL2, already associated with doxorubicin synergism in hepatocellular carcinoma cell lines. We tested our hypothesis in three breast cancer cell lines, confirming it in part. In particular, the predicted effect on MYBL2 could be validated, and a synergistic effect of paclitaxel and doxorubicin could be demonstrated in the breast cancer cell lines SKBR3 and MCF-7.
    Full-text · Article · Dec 2013 · PLoS ONE
  • Source
    • "Affymetrix control probesets were discarded after probeset summarisation. The concept of classification difficulty estimation was introduced in Popovici et al. (2010) for the purpose of exploring associations between variables and a binary endpoint. In brief, for each probeset a squared t-score is evaluated and the cumulative sum of the ordered squares is plotted to evaluate whether or not there is a strong association between the endpoint values and the data. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Abstract Model selection between competing models is a key consideration in the discovery of prognostic multigene signatures. The use of appropriate statistical performance measures as well as verification of biological significance of the signatures is imperative to maximise the chance of external validation of the generated signatures. Current approaches in time-to-event studies often use only a single measure of performance in model selection, such as logrank test p-values, or dichotomise the follow-up times at some phase of the study to facilitate signature discovery. In this study we improve the prognostic signature discovery process through the application of the multivariate partial Cox model combined with the concordance index, hazard ratio of predictions, independence from available clinical covariates and biological enrichment as measures of signature performance. The proposed framework was applied to discover prognostic multigene signatures from early breast cancer data. The partial Cox model combined with the multiple performance measures were used in both guiding the selection of the optimal panel of prognostic genes and prediction of risk within cross validation without dichotomising the follow-up times at any stage. The signatures were successfully externally cross validated in independent breast cancer datasets, yielding a hazard ratio of 2.55 [1.44, 4.51] for the top ranking signature.
    Full-text · Article · Oct 2013 · Statistical Applications in Genetics and Molecular Biology
Show more