Article

Effect of training-sample size and classification difficulty on the accuracy of genomic predictors

Bioinformatics Core Facility, Swiss Institute of Bioinformatics, Génopode Building, Quartier Sorge, Lausanne CH-1015, Switzerland.
Breast cancer research: BCR (Impact Factor: 5.88). 01/2010; 12(1):R5. DOI: 10.1186/bcr2468
Source: PubMed

ABSTRACT As part of the MicroArray Quality Control (MAQC)-II project, this analysis examines how the choice of univariate feature-selection methods and classification algorithms may influence the performance of genomic predictors under varying degrees of prediction difficulty represented by three clinically relevant endpoints.
We used gene-expression data from 230 breast cancers (grouped into training and independent validation sets), and we examined 40 predictors (five univariate feature-selection methods combined with eight different classifiers) for each of the three endpoints. Their classification performance was estimated on the training set by using two different resampling methods and compared with the accuracy observed in the independent validation set.
A ranking of the three classification problems was obtained, and the performance of 120 models was estimated and assessed on an independent validation set. The bootstrapping estimates were closer to the validation performance than were the cross-validation estimates. The required sample size for each endpoint was estimated, and both gene-level and pathway-level analyses were performed on the obtained models.
We showed that genomic predictor accuracy is determined largely by an interplay between sample size and classification difficulty. Variations on univariate feature-selection methods and choice of classification algorithm have only a modest impact on predictor performance, and several statistically equally good predictors can be developed for any given classification problem.

0 Followers
 · 
203 Views
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this study, a list of classification models was developed to categories organic solvents with respect to their dispersibility of single walled carbon nanotubes (SWNTs). The organic solvents were split into solvent and nonsolvent based on the ability to disperse the SWNTs. Various feature selection techniques combined with different classifier algorithms of linear and quadratic discriminate analysis (LDA and QDA), decision trees (random forest and J48), neural networks and support vector machine (SVM) were explored on a data set consisting of the structurally diverse organic solvents. The physicochemical descriptors such as partial charges, volsurf, subdivided surface area and some shape descriptors contributed to the classification models. The validation studies using test set, leave-one-out and 10-fold cross-validation methods provide statistical parameters such as specificity, sensitivity, accuracy, Mathew´s correlation coefficient and Kappa index to evaluate the developed classification models. The sum of ranking difference (SRD) procedure reveals that the random forest classifier based on selected descriptors by the wrapper feature selection method is the best classification model, while the SVM, MLP and QDA containing models that are ranked as good models. The structural features along with electrostatic interactions of solvent molecules play the significant role in discriminating good solvents from nonsolvents in SWNTs dispersion.
    RSC Advances 02/2015; DOI:10.1039/C5RA01261A · 3.71 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Applying machine learning methods on microarray gene expression profiles for disease classification problems is a popular method to derive biomarkers, i.e. sets of genes that can predict disease state or outcome. Traditional approaches where expression of genes were treated independently suffer from low prediction accuracy and difficulty of biological interpretation. Current research efforts focus on integrating information on protein interactions through biochemical pathway datasets with expression profiles to propose pathway-based classifiers that can enhance disease diagnosis and prognosis. As most of the pathway activity inference methods in literature are either unsupervised or applied on two-class datasets, there is good scope to address such limitations by proposing novel methodologies.ResultsA supervised multiclass pathway activity inference method using optimisation techniques is reported. For each pathway expression dataset, patterns of its constituent genes are summarised into one composite feature, termed pathway activity, and a novel mathematical programming model is proposed to infer this feature as a weighted linear summation of expression of its constituent genes. Gene weights are determined by the optimisation model, in a way that the resulting pathway activity has the optimal discriminative power with regards to disease phenotypes. Classification is then performed on the resulting low-dimensional pathway activity profile.Conclusions The model was evaluated through a variety of published gene expression profiles that cover different types of disease. We show that not only does it improve classification accuracy, but it can also perform well in multiclass disease datasets, a limitation of other approaches from the literature. Desirable features of the model include the ability to control the maximum number of genes that may participate in determining pathway activity, which may be pre-specified by the user. Overall, this work highlights the potential of building pathway-based multi-phenotype classifiers for accurate disease diagnosis and prognosis problems.
    BMC Bioinformatics 12/2014; 15(1):390. DOI:10.1186/s12859-014-0390-2 · 2.67 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: ADAM12-L and ADAM12-S represent two major splice variants of human metalloproteinase-disintegrin 12 mRNA, which differ in their 3'-untranslated regions (3'UTRs). ADAM12-L, but not ADAM12-S, has prognostic and chemopredictive values in breast cancer. Expression levels of the two ADAM12 splice variants in clinical samples are highly discordant, suggesting post-transcriptional regulation of the ADAM12 gene. The miR-29, miR-30, and miR-200 families have potential target sites in the ADAM12-L 3'UTR and they may negatively regulate ADAM12-L expression. miR-29b/c, miR-30b/d, miR-200b/c, or control miRNA mimics were transfected into SUM159PT, BT549, SUM1315MO2, or Hs578T breast cancer cells. ADAM12-L and ADAM12-S mRNA levels were measured by qRT-PCR, and ADAM12-L protein was detected by Western blotting. Direct targeting of the ADAM12-L 3'UTR by miRNAs was tested using an ADAM12-L 3'UTR luciferase reporter. The rate of ADAM12-L translation was evaluated by metabolic labeling of cells with (35)S cysteine/methionine. The roles of endogenous miR-29b and miR-200c were tested by transfecting cells with miRNA hairpin inhibitors. Transfection of miR-29b/c mimics strongly decreased ADAM12-L mRNA levels in SUM159PT and BT549 cells, whereas ADAM12-S levels were not changed. ADAM12-L, but not ADAM12-S, levels were also significantly diminished by miR-200b/c in SUM1315MO2 cells. In Hs578T cells, miR-200b/c mimics impeded translation of ADAM12-L mRNA. Importantly, both miR-29b/c and miR-200b/c strongly decreased steady state levels of ADAM12-L protein in all breast cancer cell lines tested. miR-29b/c and miR-200b/c also significantly decreased the activity of an ADAM12-L 3'UTR reporter, and this effect was abolished when miR-29b/c and miR-200b/c target sequences were mutated. In contrast, miR-30b/d did not elicit consistent and significant effects on ADAM12-L expression. Analysis of a publicly available gene expression dataset for 100 breast tumors revealed a statistically significant negative correlation between ADAM12-L and both miR-29b and miR-200c. Inhibition of endogenous miR-29b and miR-200c in SUM149PT and SUM102PT cells led to increased ADAM12-L expression. The ADAM12-L 3'UTR is a direct target of miR-29 and miR-200 family members. Since the miR-29 and miR-200 families play important roles in breast cancer progression, these results may help explain the different prognostic and chemopredictive values of ADAM12-L and ADAM12-S in breast cancer.
    BMC Cancer 12/2015; 15(1):1108. DOI:10.1186/s12885-015-1108-1 · 3.32 Impact Factor

Full-text (3 Sources)

Download
102 Downloads
Available from
May 29, 2014