Article

On the detection and refinement of transcription factor binding sites using ChIP-Seq data

Center for Statistical Genetics, University of Michigan, Ann Arbor, Michigan 48109, USA.
Nucleic Acids Research (Impact Factor: 8.81). 04/2010; 38(7):2154-67. DOI: 10.1093/nar/gkp1180
Source: PubMed

ABSTRACT Coupling chromatin immunoprecipitation (ChIP) with recently developed massively parallel sequencing technologies has enabled genome-wide detection of protein-DNA interactions with unprecedented sensitivity and specificity. This new technology, ChIP-Seq, presents opportunities for in-depth analysis of transcription regulation. In this study, we explore the value of using ChIP-Seq data to better detect and refine transcription factor binding sites (TFBS). We introduce a novel computational algorithm named Hybrid Motif Sampler (HMS), specifically designed for TFBS motif discovery in ChIP-Seq data. We propose a Bayesian model that incorporates sequencing depth information to aid motif identification. Our model also allows intra-motif dependency to describe more accurately the underlying motif pattern. Our algorithm combines stochastic sampling and deterministic 'greedy' search steps into a novel hybrid iterative scheme. This combination accelerates the computation process. Simulation studies demonstrate favorable performance of HMS compared to other existing methods. When applying HMS to real ChIP-Seq datasets, we find that (i) the accuracy of existing TFBS motif patterns can be significantly improved; and (ii) there is significant intra-motif dependency inside all the TFBS motifs we tested; modeling these dependencies further improves the accuracy of these TFBS motif patterns. These findings may offer new biological insights into the mechanisms of transcription factor regulation.

0 Followers
 · 
136 Views
  • 05/2013; 19(A):28. DOI:10.14806/ej.19.A.629
  • [Show abstract] [Hide abstract]
    ABSTRACT: Understanding transcriptional regulatory elements and particularly the transcription factor binding sites represents a significant challenge in computational biology. The chromatin immunoprecipitation followed by massive parallel sequencing (ChIP-seq) experiments provide an unprecedented opportunity to study transcription factor binding sites on the genome-wide scale. Here we describe a recently developed tool, SIOMICS, to systematically discover motifs and binding sites of transcription factors and their cofactors from ChIP-seq data. Unlike other tools, SIOMICS explores the co-binding properties of multiple transcription factors in short regions to predict motifs and binding sites. We have previously shown that the original SIOMICS method predicts motifs and binding sites of more cofactors in more accurate and time-effective ways than two popular methods. In this paper, we present the extended SIOMICS method, SIOMICS_Extension, and demonstrate its usage for systematic discovery of cofactor motifs and binding sites. The SIOMICS tool, including SIOMICS and SIOMICS_Extension, are available at http://www.eecs.ucf.edu/~xiaoman/SIOMICS/SIOMICS.html.
    Methods 08/2014; 79-80. DOI:10.1016/j.ymeth.2014.08.006 · 3.22 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A transversal topic of my research has been the development and application of computational methods for DNA sequence analysis. The methods I have been developing aim at improving our understanding of the regulation processes happening in normal and cancer cells. This topic connects together the projects presented in this thesis. Two chapters of the thesis represent major areas of my research interests: (1) methods for deciphering transcriptional regulation and their application to answer specific biological questions, and (2) methods to study the genome structure and their application in cancer studies. The first chapter predominantly focuses on transcriptional regulation. Here I describe my contribution to the development of methodology for the discovery of transcription factor binding sites and the positioning of histone proteins. I also explain how sequence analysis, in combination with gene expression data, can allow the identification of direct target genes of a transcription factor under study, as well as the physical mechanisms of its action. As two examples, I provide the results of my study of transcriptional regulation by (i) oncogenic protein EWS-FLI1 in Ewing sarcoma and (ii) oncogenic transcription factor Spi-1/PU.1 in erythroleukemia. In the second chapter, I describe the sequence analysis methods aimed at the identification of the genomic rearrangements in species with existing reference genome. I explain how the developed methodology can be applied to detect the structure of cancer genomes. I provide an example of how such an analysis of tumor genomes can result in a discovery of a new phenomenon: chromothripsis, when hundreds of rearrangements occur in a single cellular catastrophe. The thesis is concluded by listing the major challenges in high-throughput sequencing analysis. I also discuss the current top questions demanding the integration of sequencing data.