c-REDUCE: Incorporating sequence conservation to detect motifs that correlate with expression

Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Denver, 4200 East Ninth Avenue, B-119, Denver, CO 80262, USA.
BMC Bioinformatics (Impact Factor: 2.58). 12/2008; 9(1):506. DOI: 10.1186/1471-2105-9-506
Source: PubMed


Computational methods for characterizing novel transcription factor binding sites search for sequence patterns or "motifs" that appear repeatedly in genomic regions of interest. Correlation-based motif finding strategies are used to identify motifs that correlate with expression data and do not rely on promoter sequences from a pre-determined set of genes.
In this work, we describe a method for predicting motifs that combines the correlation-based strategy with phylogenetic footprinting, where motifs are identified by evaluating orthologous sequence regions from multiple species. Our method, c-REDUCE, can account for variability at a motif position inferred from evolutionary information. c-REDUCE has been tested on ChIP-chip data for yeast transcription factors and on gene expression data in Drosophila.
Our results indicate that utilizing sequence conservation information in addition to correlation-based methods improves the identification of known motifs.

Full-text preview

Available from: PubMed Central
  • [Show abstract] [Hide abstract]
    ABSTRACT: De novo identification of transcription factor binding sites (TFBS) is a challenging computational problem because TFBSs are relatively short sequences buried in long genomic regions. Earlier methods incorporated genome-wide expression data and promoter sequences into a linear-model framework, regressing expression on counts of putative TFBSs in promoters for a single species. More recently, it has been shown that examining sequence data across multiple species improves the prediction of TFBSs. In this work, we describe an extension of the single-species, linear-model framework for the analysis of paired cross-species sequence and expression data. A repeated measures model for gene-expression measurements across species is used, accounting for phylogenetic relationships among species through the error covariance structure. This multiple-species algorithm is applied to a data set of four yeast species grown under heat-shock conditions and comparisons are made to the single species algorithm. Using evaluations based on transcription factor binding strength and an independent source of expression data, we find the multiple species results show an improvement in the prediction of TFBS.
    Statistical Applications in Genetics and Molecular Biology 01/2009; 8(1):Article 36. DOI:10.2202/1544-6115.1464 · 1.13 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Although genome-wide expression data sets from multiple species are now more commonly generated, there have been few studies on how to best integrate this type of correlated data into models. Starting with a single-species, linear regression model that predicts transcription factor binding sites as a case study, we investigated how best to take into account the correlated expression data when extending this model to multiple species. Using a multivariate regression model, we accounted for the phylogenetic relationships among the species in two ways: (i) a repeated-measures model, where the error term is constrained; and (ii) a Bayesian hierarchical model, where the prior distributions of the regression coefficients are constrained. We show that both multiple-species models improve predictive performance over the single-species model. When compared with each other, the repeated-measures model outperformed the Bayesian model. We suggest a possible explanation for the better performance of the model with the constrained error term. Copyright © 2013 John Wiley & Sons, Ltd.
    Statistics in Medicine 10/2013; 32(23). DOI:10.1002/sim.5850 · 1.83 Impact Factor