Simple and flexible classification of gene expression microarrays via Swirls and Ripples

Biometry Research Group, Division of Cancer Prevention, National Cancer Institute, EPN 3131, 6130 Executive Blvd MSC 7354, Bethesda, MD 20892-7354, USA.
BMC Bioinformatics (Impact Factor: 2.58). 09/2010; 11(1):452. DOI: 10.1186/1471-2105-11-452
Source: DBLP


A simple classification rule with few genes and parameters is desirable when applying a classification rule to new data. One popular simple classification rule, diagonal discriminant analysis, yields linear or curved classification boundaries, called Ripples, that are optimal when gene expression levels are normally distributed with the appropriate variance, but may yield poor classification in other situations.
A simple modification of diagonal discriminant analysis yields smooth highly nonlinear classification boundaries, called Swirls, that sometimes outperforms Ripples. In particular, if the data are normally distributed with different variances in each class, Swirls substantially outperforms Ripples when using a pooled variance to reduce the number of parameters. The proposed classification rule for two classes selects either Swirls or Ripples after parsimoniously selecting the number of genes and distance measures. Applications to five cancer microarray data sets identified predictive genes related to the tissue organization theory of carcinogenesis.
The parsimonious selection of classifiers coupled with the selection of either Swirls or Ripples provides a good basis for formulating a simple, yet flexible, classification rule. Open source software is available for download.

Download full-text


Available from: PubMed Central · License: CC BY
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Multi-gene interactions likely play an important role in the development of complex phenotypes, and relationships between interacting genes pose a challenging statistical problem in microarray analysis, since the genes involved in these interactions may not exhibit marginal differential expression. As a result, it is necessary to develop tools that can identify sets of interacting genes that discriminate phenotypes without requiring that the classification boundary between phenotypes be convex. We describe an extension and application of a new unsupervised statistical learning technique, known as the Partition Decoupling Method (PDM), to gene expression microarray data. This method may be used to classify samples based on multi-gene expression patterns and to identify pathways associated with phenotype, without relying upon the differential expression of individual genes. The PDM uses iterated spectral clustering and scrubbing steps, revealing at each iteration progressively finer structure in the geometry of the data. Because spectral clustering has the ability to discern clusters that are not linearly separable, it is able to articulate relationships between samples that would be missed by distance- and tree-based classifiers. After projecting the data onto the cluster centroids and computing the residuals ("scrubbing"), one can repeat the spectral clustering, revealing clusters that were not discernible in the first layer. These iterations, each of which provide a partition of the data that is decoupled from the others, are carried forward until the structure in the residuals is indistinguishable from noise, preventing over-fitting. We describe the PDM in detail and apply it to three publicly available cancer gene expression data sets. By applying the PDM on a pathway-by-pathway basis and identifying those pathways that permit unsupervised clustering of samples that match known sample characteristics, we show how the PDM may be used to find sets of mechanistically-related genes that may play a role in disease. An R package to carry out the PDM is available for download. We show that the PDM is a useful tool for the analysis of gene expression data from complex diseases, where phenotypes are not linearly separable and multi-gene effects are likely to play a role. Our results demonstrate that the PDM is able to distinguish cell types and treatments with higher accuracy than is obtained through other approaches, and that the Pathway-PDM application is a valuable technique for identifying disease-associated pathways.
    Full-text · Article · Feb 2010 · BMC Bioinformatics
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We define personalized medicine as the administration of treatment to only persons thought most likely to benefit, typically those at high risk for mortality or another detrimental outcome. To evaluate personalized medicine, we propose a new design for a randomized trial that makes efficient use of high-throughput data (such as gene expression microarrays) and clinical data (such as tumor stage) collected at baseline from all participants. Under this design for a randomized trial involving experimental and control arms with a survival outcome, investigators first estimate the risk of mortality in the control arm based on the high-throughput and clinical data. Then investigators use data from both randomization arms to estimate both the effect of treatment among all participants and among participants in the highest prespecified category of risk. This design requires only an 18.1% increase in sample size compared with a standard randomized trial. A trial based on this design that has a 90% power to detect a realistic increase in survival from 70% to 80% among all participants, would also have a 90% power to detect an increase in survival from 50% to 73% in the highest quintile of risk.
    Full-text · Article · Nov 2010 · Journal of the National Cancer Institute
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Systems biology uses systems of mathematical rules and formulas to study complex biological phenomena. In cancer research there are three distinct threads in systems biology research: modeling biology or biophysics with the goal of establishing plausibility or obtaining insights, modeling based on statistics, bioinformatics, and reverse engineering with the goal of better characterizing the system, and modeling with the goal of clinical predictions. Using illustrative examples we discuss these threads in the context of cancer research.
    Preview · Article · Mar 2011 · Progress in Biophysics and Molecular Biology
Show more