The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models.

National Center for Toxicological Research, US Food and Drug Administration, Jefferson, Arkansas, USA.
Nature Biotechnology (Impact Factor: 39.08). 08/2010; 28(8):827-38. DOI: 10.1038/nbt.1665
Source: PubMed

ABSTRACT Gene expression data from microarrays are being applied to predict preclinical and clinical endpoints, but the reliability of these predictions has not been established. In the MAQC-II project, 36 independent teams analyzed six microarray data sets to generate predictive models for classifying a sample with respect to one of 13 endpoints indicative of lung or liver toxicity in rodents, or of breast cancer, multiple myeloma or neuroblastoma in humans. In total, >30,000 models were built using many combinations of analytical methods. The teams generated predictive models without knowing the biological meaning of some of the endpoints and, to mimic clinical reality, tested the models on data that had not been used for training. We found that model performance depended largely on the endpoint and team proficiency and that different approaches generated models of similar performance. The conclusions and recommendations from MAQC-II should be useful for regulatory agencies, study committees and independent investigators that evaluate methods for global gene expression analysis.

1 Follower
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Acute Myeloid Leukemia (AML) is characterized by various cytogenetic and molecular abnormalities. Detection of these abnormalities is important in the risk-classification of patients but requires laborious experimentation. Various studies showed that gene expression profiles (GEP), and the gene signatures derived from GEP, can be used for the prediction of subtypes in AML. Similarly, successful prediction was also achieved by exploiting DNA-methylation profiles (DMP). There are, however, no studies that compared classification accuracy and performance between GEP and DMP, neither are there studies that integrated both types of data to determine whether predictive power can be improved. Here, we used 344 well-characterized AML samples for which both gene expression and DNA-methylation profiles are available. We created three different classification strategies including early, late and no integration of these datasets and used them to predict AML subtypes using a logistic regression model with Lasso regularization. We illustrate that both gene expression and DNA-methylation profiles contain distinct patterns that contribute to discriminating AML subtypes and that an integration strategy can exploit these patterns to achieve synergy between both data types. We show that concatenation of features from both data sets, i.e. early integration, improves the predictive power compared to classifiers trained on GEP or DMP alone. A more sophisticated strategy, i.e. the late integration strategy, employs a two-layer classifier which outperforms the early integration strategy. We demonstrate that prediction of known cytogenetic and molecular abnormalities in AML can be further improved by integrating GEP and DMP profiles.
    BMC Bioinformatics 02/2015; 16 Suppl 4:S5. DOI:10.1186/1471-2105-16-S4-S5 · 2.67 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: A small set of gastric adenocarcinomas (9%) harbor Epstein-Barr virus (EBV) DNA within malignant cells, and the virus is not an innocent bystander but rather is intimately linked to pathogenesis and tumor maintenance. Evidence comes from unique genomic features of host DNA, mRNA, microRNA and CpG methylation profiles as revealed by recent comprehensive genomic analysis by The Cancer Genome Atlas Network. Their data show that gastric cancer is not one disease but rather comprises four major classes: EBV-positive, microsatellite instability (MSI), genomically stable and chromosome instability. The EBV-positive class has even more marked CpG methylation than does the MSI class, and viral cancers have a unique pattern of methylation linked to the downregulation of CDKN2A (p16) but not MLH1. EBV-positive cancers often have mutated PIK3CA and ARID1A and an amplified 9p24.1 locus linked to overexpression of JAK2, CD274 (PD-L1) and PDCD1LG2 (PD-L2). Multiple noncoding viral RNAs are highly expressed. Patients who fail standard therapy may qualify for enrollment in clinical trials targeting cancer-related human gene pathways or promoting destruction of infected cells through lytic induction of EBV genes. Genomic tests such as the GastroGenus Gastric Cancer Classifier are available to identify actionable variants in formalin-fixed cancer tissue of affected patients.
    01/2015; 47(1):e134. DOI:10.1038/emm.2014.93
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Biological data analysis is frequently performed with command line software. While this practice provides considerable flexibility for computationally savy individuals, such as investigators trained in bioinformatics, this also creates a barrier to the widespread use of data analysis software by investigators trained as biologists and/or clinicians. Workflow systems such as Galaxy and Taverna have been developed to try and provide generic user interfaces that can wrap command line analysis software. These solutions are useful for problems that can be solved with workflows, and that do not require specialized user interfaces. However, some types of analyses can benefit from custom user interfaces. For instance, developing biomarker models from high-throughput data is a type of analysis that can be expressed more succinctly with specialized user interfaces. Here, we show how Language Workbench (LW) technology can be used to model the biomarker development and validation process. We developed a language that models the concepts of Dataset, Endpoint, Feature Selection Method and Classifier. These high-level language concepts map directly to abstractions that analysts who develop biomarker models are familiar with. We found that user interfaces developed in the Meta-Programming System (MPS) LW provide convenient means to configure a biomarker development project, to train models and view the validation statistics. We discuss several advantages of developing user interfaces for data analysis with a LW, including increased interface consistency, portability and extension by language composition. The language developed during this experiment is distributed as an MPS plugin (available at
    02/2015; 3:e800. DOI:10.7717/peerj.800


Available from
May 20, 2014